Average rank seem to become standard for foundation models in other domains, e.g. LLMs, for comparison over multiple tasks.
Currently, NeuCo-Bench presents leaderboards with weighted average rank, with weights based on the task performances of the experiments on the leaderboard, a task-difficulty weighted ranking scheme..
I suggest we add an optional setting in the config files to select type of weighting, either uniform or task-difficulty weighted, with uniform as default. Uniform weighting makes fewer assumptions on how to determine task importance, which is why I suggest to have it as default.
Average rank seem to become standard for foundation models in other domains, e.g. LLMs, for comparison over multiple tasks.
Currently, NeuCo-Bench presents leaderboards with weighted average rank, with weights based on the task performances of the experiments on the leaderboard, a task-difficulty weighted ranking scheme..
I suggest we add an optional setting in the config files to select type of weighting, either uniform or task-difficulty weighted, with uniform as default. Uniform weighting makes fewer assumptions on how to determine task importance, which is why I suggest to have it as default.