Remove system message for GPQA Diamond #291

Xceron · 2025-11-07T19:37:01Z

Summary

What are you adding?

Changes Made

GPQA usually does not come with a system prompt. The current system prompt is taken from simple-evals. However, the gpt-oss repo "re-implements" GPQA in simple-eval style and has no system message anymore (see impl).
fwiw, the same repository sets 1.0 as the default temperature, but I left it as-is
I also ran gpt-oss 20B with the current main and the proposed changes on groq, the latter brings the performance closer to the reference scores

Testing

I have run the existing test suite (pytest)
I have added tests for my changes
I have tested with multiple model providers (if applicable)
I have run pre-commit hooks (pre-commit run --all-files)

Checklist

My code follows the project's style guidelines
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (if applicable)
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Related Issues

Closes #

Additional Context

AarushSah · 2025-11-07T19:38:15Z

Thanks @Xceron - Does this mean GPQA Diamond has parity with GPT-OSS now?

Xceron · 2025-11-07T19:42:57Z

@AarushSah hm, almost. I run into RateLimitErrors when using my Groq API key directly, so I used HF as the aggregator for groq.

main (4 runs): 0.617 (stderr 0.030)
removed sysprompt + 1.0 temp (4 runs): 0.647 (stderr 0.047)

The paper has 20B@medium somewhere at 0.66-0.67

AarushSah · 2025-11-07T19:57:01Z

Ah yes -- they do some answer shuffling between repeats. We do not.

AarushSah · 2025-11-09T06:10:42Z

@Xceron if your goal is to replicate the GPT-OSS GPQA Diamond, could you add it in the new GPT-OSS umbrella we have made for AIME? My gut feeling is n_repeats will have to be passed in as a task arg, and/or each record will have to return multiple samples (corresponding to the number of repeats set).

Relevant PRs:
#284
#285

AarushSah · 2025-11-09T06:12:14Z

I'm not entirely sure we should be modifying existing evals unless absolutely necessary - the current rendition of GPQA Diamond works, so a variant deserves its own setup.

Xceron · 2025-11-10T11:20:01Z

I'm not entirely sure why this is needed. It kinda suggests that GPT-OSS is benchmarked differently, whereas in fact it uses industry practices. I would go as far as calling the current GPQA-D in openbench a variant.

AarushSah · 2025-11-12T21:14:28Z

Thanks for the clarification! I think we’re just talking past each other a bit.

My stance isn’t that GPT-OSS is “benchmarking differently.” It’s that OpenBench tries very hard to keep existing evals stable, because a lot of folks (internal + external) rely on their current behavior for longitudinal comparisons.

So for anything that meaningfully changes scoring behavior - even if it moves it toward an external reference - I’d much rather treat it as a variant under the GPT-OSS umbrella, the same way we’re handling AIME.

That keeps GPQA-Diamond stable, gives you space to add the faithful reproduction you want, and avoids breaking downstream users’ expectations.

Xceron · 2025-11-14T16:10:08Z

I see, done!

AarushSah · 2025-11-16T07:08:08Z

@Xceron Would be great if we could add the sample shuffling too then!

Xceron added 2 commits November 7, 2025 20:30

Remove system message in GPQA-D

be80c31

Remove SIMPLE_EVALS_SYSTEM_MESSAGE

e3b3745

Xceron requested review from AarushSah and nmayorga7 as code owners November 7, 2025 19:37

Xceron added 3 commits November 14, 2025 17:02

Revert changes

3cbd606

fix: typo

15f61b3

Add GPT-OSS GPQA

ab30378

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove system message for GPQA Diamond #291

Remove system message for GPQA Diamond #291

Xceron commented Nov 7, 2025

Uh oh!

AarushSah commented Nov 7, 2025

Uh oh!

Xceron commented Nov 7, 2025

Uh oh!

AarushSah commented Nov 7, 2025 •

edited

Loading

Uh oh!

AarushSah commented Nov 9, 2025

Uh oh!

AarushSah commented Nov 9, 2025

Uh oh!

Xceron commented Nov 10, 2025

Uh oh!

AarushSah commented Nov 12, 2025

Uh oh!

Xceron commented Nov 14, 2025 •

edited

Loading

Uh oh!

AarushSah commented Nov 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Remove system message for GPQA Diamond #291

Are you sure you want to change the base?

Remove system message for GPQA Diamond #291

Conversation

Xceron commented Nov 7, 2025

Summary

What are you adding?

Changes Made

Testing

Checklist

Related Issues

Additional Context

Uh oh!

AarushSah commented Nov 7, 2025

Uh oh!

Xceron commented Nov 7, 2025

Uh oh!

AarushSah commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AarushSah commented Nov 9, 2025

Uh oh!

AarushSah commented Nov 9, 2025

Uh oh!

Xceron commented Nov 10, 2025

Uh oh!

AarushSah commented Nov 12, 2025

Uh oh!

Xceron commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AarushSah commented Nov 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

AarushSah commented Nov 7, 2025 •

edited

Loading

Xceron commented Nov 14, 2025 •

edited

Loading