-
Notifications
You must be signed in to change notification settings - Fork 6
Better ir_datasets integration.
#37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
1. User-specfied fields for `ir_datasets.Docs` object. If no field is provided, fall back to `default_text()` (a [future convention](allenai/ir_datasets#72) `ir_datasets` is currently working on). If `default_text()` is not implemented, fall back to `text` field. 2. Support arbitrary field in TopicProcessor for abitrary query field in `ir_datasets`. This is particularly important for integrating the `mt_*` and `ht_*` fields in the HC4 interface in `ir_datasets`. 3. Sample configs for running PSQ and human translated queries. A severely truncated translation table is added to `./samples/data` for demo purposes.
| LOGGER.warning(f"Using unrecognized topic fields {e}, may cause unexpected results.") | ||
| return fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't we want this to throw an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing this is because of item 2:
Support arbitrary fields in TopicProcessor for abitrary query fields in ir_datasets. This is particularly important for integrating the mt_* and ht_* fields in the HC4 interface in ir_datasets. (citing https://github.com/allenai/ir_datasets/issues/148)
I'll have to look more into this as I prefer to keep the checking in to catch typos or other problems with data.
| dataset_lang = LangStandardizer.iso_639_3(self.dataset.queries.lang) | ||
| assert dataset_lang == self.lang, f"Query language code from {path} is not {lang} but {dataset_lang}." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A user must list the language in the topics config, but it is not checked against the language of the downloaded dataset. Is the language code of the dataset not consistent? I'd probably put a comment in there to indicate we're not checking the language because xyz.
|
Overall this looks good. I still need to test locally. I will probably also cut down the sample config files. It looks like they every option and I want the samples to include just the options that are being used. |
|
ir-datasets hasn't done a release since HC4 has been added. @eugene-yang do you know if Sean is planning to make a new release soon or should we depend on a git commit? |
Basing the requirements on a git commit might break after a new version of @seanmacavaney any thought on this? |
|
You can set pip to pull a particular commit |
|
I'm happy to do a release of |
|
Done -- |
|
@seanmacavaney thanks! |
|
@cash are we able to merge this? |
|
@eugene-yang I got stuck trying to download the data - tried multiple times and it never finished. I'll get back to testing this and fixing issues that we've identified. |
|
@cash I updated the download script couple weeks ago because the base URL changed for Common Crawl. |
User-specified fields for
ir_datasets.Docsobject. If no field is provided, fall back todefault_text()(a future conventionir_datasetsis currently working on). Ifdefault_text()is not implemented, fall back totextfield.Support arbitrary fields in TopicProcessor for abitrary query fields in
ir_datasets. This is particularly important for integrating themt_*andht_*fields in the HC4 interface inir_datasets. (citing discussion)Sample configs for running PSQ and human translated queries. A severely truncated translation table is added to
./samples/datafor demo purposes.close #32