Skip to content

Conversation

@eugene-yang
Copy link
Collaborator

@eugene-yang eugene-yang commented Mar 11, 2022

  1. User-specified fields for ir_datasets.Docs object. If no field is provided, fall back to default_text() (a future convention ir_datasets is currently working on). If default_text() is not implemented, fall back to text field.

  2. Support arbitrary fields in TopicProcessor for abitrary query fields in ir_datasets. This is particularly important for integrating the mt_* and ht_* fields in the HC4 interface in ir_datasets. (citing discussion)

  3. Sample configs for running PSQ and human translated queries. A severely truncated translation table is added to ./samples/data for demo purposes.

close #32

1. User-specfied fields for `ir_datasets.Docs` object. If no field is provided, fall back to `default_text()` (a [future convention](allenai/ir_datasets#72) `ir_datasets` is currently working on). If `default_text()` is not implemented, fall back to `text` field.

2. Support arbitrary field in TopicProcessor for abitrary query field in `ir_datasets`. This is particularly important for integrating the `mt_*` and `ht_*` fields in the HC4 interface in `ir_datasets`.

3. Sample configs for running PSQ and human translated queries. A severely truncated translation table is added to `./samples/data` for demo purposes.
Comment on lines +95 to +96
LOGGER.warning(f"Using unrecognized topic fields {e}, may cause unexpected results.")
return fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't we want this to throw an exception?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this is because of item 2:

Support arbitrary fields in TopicProcessor for abitrary query fields in ir_datasets. This is particularly important for integrating the mt_* and ht_* fields in the HC4 interface in ir_datasets. (citing https://github.com/allenai/ir_datasets/issues/148)

I'll have to look more into this as I prefer to keep the checking in to catch typos or other problems with data.

Comment on lines -260 to -261
dataset_lang = LangStandardizer.iso_639_3(self.dataset.queries.lang)
assert dataset_lang == self.lang, f"Query language code from {path} is not {lang} but {dataset_lang}."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A user must list the language in the topics config, but it is not checked against the language of the downloaded dataset. Is the language code of the dataset not consistent? I'd probably put a comment in there to indicate we're not checking the language because xyz.

@cash
Copy link
Member

cash commented Mar 29, 2022

Overall this looks good. I still need to test locally. I will probably also cut down the sample config files. It looks like they every option and I want the samples to include just the options that are being used.

@cash
Copy link
Member

cash commented Mar 31, 2022

ir-datasets hasn't done a release since HC4 has been added. @eugene-yang do you know if Sean is planning to make a new release soon or should we depend on a git commit?

@eugene-yang
Copy link
Collaborator Author

ir-datasets hasn't done a release since HC4 has been added. @eugene-yang do you know if Sean is planning to make a new release soon or should we depend on a git commit?

Basing the requirements on a git commit might break after a new version of ir-dataset is released.
I think a better solution is to have @seanmacavaney tag the current version on master as a pre-release version so pip can resolve it as a version that is >=0.5.0 and can gracefully transit to later versions.

@seanmacavaney any thought on this?

@cash
Copy link
Member

cash commented Mar 31, 2022

You can set pip to pull a particular commit

@seanmacavaney
Copy link

I'm happy to do a release of ir_datasets. On it now.

@seanmacavaney
Copy link

Done -- ir-datasets==0.5.1 is now on pypi, including hc4, neuclir, etc.

@cash
Copy link
Member

cash commented Mar 31, 2022

@seanmacavaney thanks!

@eugene-yang
Copy link
Collaborator Author

@cash are we able to merge this?

@cash
Copy link
Member

cash commented May 23, 2022

@eugene-yang I got stuck trying to download the data - tried multiple times and it never finished. I'll get back to testing this and fixing issues that we've identified.

@eugene-yang
Copy link
Collaborator Author

eugene-yang commented May 23, 2022

@cash I updated the download script couple weeks ago because the base URL changed for Common Crawl.
Let me know if you still have issues downloading HC4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Better integration with ir_datasets

4 participants