GitHub - dill-lab/micro-benchmarking-reliability: Code for "How Reliable is Language Model Micro-Benchmarking?"

micro-benchmarking-reliability

This repository contains code and data for reproducing the results from "How Reliable is Language Model Micro-Benchmarking?" by Gregory Yauney, Shahzaib Saqib Warraich, and Swabha Swayamdipta.

Install requirements:

pip install -r requirements.txt

Steps to reproduce the results in the paper

The graphs-combine-subtasks directory contains the final processed results used in the paper. If you'd like to reproduce them, run steps 1 and 2 below. Otherwise, skip to step 3 to plot the existing results.

1. Run micro-benchmarking methods:

You can skip this step by downloading cached micro-benchmarking results: Download from Google Drive (59 MB) and unzipping into this directory.

First, download the cached model evaluation results from the Open LLM Leaderboard v2: download from Google Drive (550 MB) and unzip into this directory.

Here is an example command to run the micro-benchmarking evaluations for the MMLU-Pro dataset:

python evaluate-microbenchmarks.py \
    --selection_techniques Random Random_Subtask_Stratified_Equal Anchor_Points_Weighted Stratified_Random_Sampling tinyBenchmarks DPP \
    --num_source_models 300 \
    --num_runs 50 \
    --benchmark mmlu-pro \
    --combine_subtasks \
    --same_points \
    --num_threads 10

To reproduce the main results in the paper, you will need to run this for the other benchmarks as well: mmlu, bbh, gpqa.

2. Process micro-benchmarking results:

All results need to be processed by running the following command:

python process-results-combine-subtasks.py

3. Make all plots:

Each file that begins with figure can be used to reproduce a figure from the paper. For example, figure-1.py will reproduce Figure 1.

Licenses

Code we wrote for this project is released under the MIT License. We use and adapt code from Anchor Points, tinyBenchmarks, py-irt, and DPPcoresets. Their licenses are available in the licenses directory.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dpp_src		dpp_src
graphs-combine-subtasks		graphs-combine-subtasks
licenses		licenses
py_irt_direct		py_irt_direct
LICENSE		LICENSE
README.md		README.md
evaluate-microbenchmarks.py		evaluate-microbenchmarks.py
figure-1.py		figure-1.py
figure-2.py		figure-2.py
figure-3.py		figure-3.py
figure-4-separate-subtasks.py		figure-4-separate-subtasks.py
figure-4.py		figure-4.py
figure-5.py		figure-5.py
figure-6-all-benchmarks.py		figure-6-all-benchmarks.py
figure-6-combine-subtasks.py		figure-6-combine-subtasks.py
figure-6.py		figure-6.py
figure-different-thresholds-mdad.py		figure-different-thresholds-mdad.py
figure-num-source-models-panel.py		figure-num-source-models-panel.py
mdad.py		mdad.py
microbenchmarks.py		microbenchmarks.py
plot_utils.py		plot_utils.py
process-results-combine-subtasks-same-percent.py		process-results-combine-subtasks-same-percent.py
process-results-combine-subtasks.py		process-results-combine-subtasks.py
process-results-fixed-target-models.py		process-results-fixed-target-models.py
process-results-separate-subtasks.py		process-results-separate-subtasks.py
requirements.txt		requirements.txt
tinybenchmarks_irt.py		tinybenchmarks_irt.py
tinybenchmarks_reimplemented.py		tinybenchmarks_reimplemented.py
tinybenchmarks_utils.py		tinybenchmarks_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

micro-benchmarking-reliability

Steps to reproduce the results in the paper

1. Run micro-benchmarking methods:

2. Process micro-benchmarking results:

3. Make all plots:

Licenses

About

Uh oh!

Releases

Packages

Languages

License

dill-lab/micro-benchmarking-reliability

Folders and files

Latest commit

History

Repository files navigation

micro-benchmarking-reliability

Steps to reproduce the results in the paper

1. Run micro-benchmarking methods:

2. Process micro-benchmarking results:

3. Make all plots:

Licenses

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages