Problem
The current WorkRB architecture indexes datasets within each task by Language enum. This limits each task to at most one dataset per language.
This constraint prevents supporting tasks with:
- Multiple monolingual datasets per language (e.g., regional variants, domain-specific subsets)
- Cross-lingual datasets (e.g., query language differs from corpus language)
- Multilingual datasets (e.g., corpus spans multiple languages)
The limitation is caused by source code in data loading, evaluation iteration, and result aggregation.
This issue follows up on the architectural discussion in #30.
Proposal
Generalize dataset indexing from Language to arbitrary string identifiers (dataset_id).
Key changes in workrb.tasks.abstract.Task:
- The attribute
lang_datasets: dict[Language, Dataset] becomes datasets: dict[str, Dataset], indexing datasets by an arbitrary string
- Add a new method
languages_to_dataset_ids(languages) -> list[str]
- Default implementation:
[lang.value for lang in languages] (1:1 mapping), which makes the refactor backward compatible for existing tasks
- Tasks with more complex language-dataset mappings can override this method to return custom identifiers
- Rename
load_monolingual_data(language, split) to load_dataset(dataset_id, split)
- Add
get_dataset_language(dataset_id) -> Language | None to enable per-language result aggregation
- Returns the language for monolingual datasets
- Returns
None for cross-lingual or multilingual datasets
Key changes in results.py:
- Add
language field to MetricsResult
- Update
_aggregate_per_language() to group by the language field, skipping datasets where language is None
The user-facing API remains unchanged. Users continue to specify languages when instantiating tasks:
task = SomeTask(languages=["en", "de"], split="test")
Internally, languages_to_dataset_ids() maps languages to dataset identifiers.
This proposal is non-breaking: existing tasks work without modification due to the default 1:1 mapping. Result aggregation behavior is preserved for standard tasks. Per-language aggregation remains backward compatible and simply excludes datasets marked as cross-lingual or multilingual.
Alternatives
An alternative approach proposed by @Mattdl is being discussed in #30: indexing datasets by (query_language, corpus_language) pairs with compound identifiers like "en-en" or "es-nl". This would enable aggregation by query language, corpus language, or specific cross-lingual scenarios. A comparison of both approaches is included in that other discussion.
Implementation
Problem
The current WorkRB architecture indexes datasets within each task by
Languageenum. This limits each task to at most one dataset per language.This constraint prevents supporting tasks with:
The limitation is caused by source code in data loading, evaluation iteration, and result aggregation.
This issue follows up on the architectural discussion in #30.
Proposal
Generalize dataset indexing from
Languageto arbitrary string identifiers (dataset_id).Key changes in workrb.tasks.abstract.Task:
lang_datasets: dict[Language, Dataset]becomesdatasets: dict[str, Dataset], indexing datasets by an arbitrary stringlanguages_to_dataset_ids(languages) -> list[str][lang.value for lang in languages](1:1 mapping), which makes the refactor backward compatible for existing tasksload_monolingual_data(language, split)toload_dataset(dataset_id, split)get_dataset_language(dataset_id) -> Language | Noneto enable per-language result aggregationNonefor cross-lingual or multilingual datasetsKey changes in results.py:
languagefield toMetricsResult_aggregate_per_language()to group by thelanguagefield, skipping datasets wherelanguageisNoneThe user-facing API remains unchanged. Users continue to specify languages when instantiating tasks:
Internally,
languages_to_dataset_ids()maps languages to dataset identifiers.This proposal is non-breaking: existing tasks work without modification due to the default 1:1 mapping. Result aggregation behavior is preserved for standard tasks. Per-language aggregation remains backward compatible and simply excludes datasets marked as cross-lingual or multilingual.
Type:
Area(s) of code: paths, modules, or APIs you expect to touch
src/workrb/tasks/abstract/base.pysrc/workrb/tasks/abstract/ranking_base.pysrc/workrb/tasks/abstract/classification_base.pysrc/workrb/run.pysrc/workrb/config.pysrc/workrb/results.pyAlternatives
An alternative approach proposed by @Mattdl is being discussed in #30: indexing datasets by (
query_language,corpus_language) pairs with compound identifiers like"en-en"or"es-nl". This would enable aggregation by query language, corpus language, or specific cross-lingual scenarios. A comparison of both approaches is included in that other discussion.Implementation