Skip to content

Conversation

@Xceron
Copy link
Contributor

@Xceron Xceron commented Nov 7, 2025

Summary

What are you adding?

  • Bug fix (non-breaking change which fixes an issue)
  • New benchmark/evaluation
  • New model provider
  • CLI enhancement
  • Performance improvement
  • Documentation update
  • API/SDK feature
  • Integration (CI/CD, tools)
  • Export/import functionality
  • Code refactoring
  • Breaking change
  • Other

Changes Made

  • GPQA usually does not come with a system prompt. The current system prompt is taken from simple-evals. However, the gpt-oss repo "re-implements" GPQA in simple-eval style and has no system message anymore (see impl).
  • fwiw, the same repository sets 1.0 as the default temperature, but I left it as-is
  • I also ran gpt-oss 20B with the current main and the proposed changes on groq, the latter brings the performance closer to the reference scores

Testing

  • I have run the existing test suite (pytest)
  • I have added tests for my changes
  • I have tested with multiple model providers (if applicable)
  • I have run pre-commit hooks (pre-commit run --all-files)

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (if applicable)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Related Issues

Closes #

Additional Context

@AarushSah
Copy link
Member

Thanks @Xceron - Does this mean GPQA Diamond has parity with GPT-OSS now?

@Xceron
Copy link
Contributor Author

Xceron commented Nov 7, 2025

@AarushSah hm, almost. I run into RateLimitErrors when using my Groq API key directly, so I used HF as the aggregator for groq.

main (4 runs): 0.617 (stderr 0.030)
removed sysprompt + 1.0 temp (4 runs): 0.647 (stderr 0.047)

The paper has 20B@medium somewhere at 0.66-0.67

@AarushSah
Copy link
Member

AarushSah commented Nov 7, 2025

Ah yes -- they do some answer shuffling between repeats. We do not.

@AarushSah
Copy link
Member

@Xceron if your goal is to replicate the GPT-OSS GPQA Diamond, could you add it in the new GPT-OSS umbrella we have made for AIME? My gut feeling is n_repeats will have to be passed in as a task arg, and/or each record will have to return multiple samples (corresponding to the number of repeats set).

Relevant PRs:
#284
#285

@AarushSah
Copy link
Member

I'm not entirely sure we should be modifying existing evals unless absolutely necessary - the current rendition of GPQA Diamond works, so a variant deserves its own setup.

@Xceron
Copy link
Contributor Author

Xceron commented Nov 10, 2025

I'm not entirely sure why this is needed. It kinda suggests that GPT-OSS is benchmarked differently, whereas in fact it uses industry practices. I would go as far as calling the current GPQA-D in openbench a variant.

@AarushSah
Copy link
Member

Thanks for the clarification! I think we’re just talking past each other a bit.

My stance isn’t that GPT-OSS is “benchmarking differently.” It’s that OpenBench tries very hard to keep existing evals stable, because a lot of folks (internal + external) rely on their current behavior for longitudinal comparisons.

So for anything that meaningfully changes scoring behavior - even if it moves it toward an external reference - I’d much rather treat it as a variant under the GPT-OSS umbrella, the same way we’re handling AIME.

That keeps GPQA-Diamond stable, gives you space to add the faithful reproduction you want, and avoids breaking downstream users’ expectations.

@Xceron
Copy link
Contributor Author

Xceron commented Nov 14, 2025

I see, done!

@AarushSah
Copy link
Member

@Xceron Would be great if we could add the sample shuffling too then!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants