-
Notifications
You must be signed in to change notification settings - Fork 90
Remove system message for GPQA Diamond #291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks @Xceron - Does this mean GPQA Diamond has parity with GPT-OSS now? |
|
@AarushSah hm, almost. I run into RateLimitErrors when using my Groq API key directly, so I used HF as the aggregator for groq. main (4 runs): 0.617 (stderr 0.030) The paper has 20B@medium somewhere at 0.66-0.67 |
|
Ah yes -- they do some answer shuffling between repeats. We do not. |
|
@Xceron if your goal is to replicate the GPT-OSS GPQA Diamond, could you add it in the new GPT-OSS umbrella we have made for AIME? My gut feeling is n_repeats will have to be passed in as a task arg, and/or each record will have to return multiple samples (corresponding to the number of repeats set). |
|
I'm not entirely sure we should be modifying existing evals unless absolutely necessary - the current rendition of GPQA Diamond works, so a variant deserves its own setup. |
|
I'm not entirely sure why this is needed. It kinda suggests that GPT-OSS is benchmarked differently, whereas in fact it uses industry practices. I would go as far as calling the current GPQA-D in openbench a variant. |
|
Thanks for the clarification! I think we’re just talking past each other a bit. My stance isn’t that GPT-OSS is “benchmarking differently.” It’s that OpenBench tries very hard to keep existing evals stable, because a lot of folks (internal + external) rely on their current behavior for longitudinal comparisons. So for anything that meaningfully changes scoring behavior - even if it moves it toward an external reference - I’d much rather treat it as a variant under the GPT-OSS umbrella, the same way we’re handling AIME. That keeps GPQA-Diamond stable, gives you space to add the faithful reproduction you want, and avoids breaking downstream users’ expectations. |
|
I see, done! |
|
@Xceron Would be great if we could add the sample shuffling too then! |
Summary
What are you adding?
Changes Made
Testing
pytest)pre-commit run --all-files)Checklist
Related Issues
Closes #
Additional Context