Skip to content

Can't Reproduce AIME24 Results following readme. #21

@dercaft

Description

@dercaft

Thank you for your work on this exciting project!

I've been experimenting with the latest commit (9e69192) and have been carefully following the instructions in the README.md.

Following the README, the accuracy I'm seeing is quite low (1/30).

Then I test combination of different models for agents and tools:

Planner Planner_fixed Code Other agents and tools Correct tasks Acc
agentflow-planner-7b Qwen2.5-7B-Instruct Qwen2.5-Coder-7B-Instruct Qwen2.5-7B-Instruct 1/30 3.3%
Qwen2.5-7B-Instruct Qwen2.5-7B-Instruct Qwen2.5-Coder-7B-Instruct Qwen2.5-7B-Instruct 1/30 3.3%
agentflow-planner-7b Qwen2.5-7B-Instruct Gemini-2.5-pro Qwen2.5-7B-Instruct 2/30 6.7%
GPT-4o Qwen2.5-7B-Instruct Qwen2.5-Coder-7B-Instruct Qwen2.5-7B-Instruct 2/30 6.7%
Gemini-2.5-pro Qwen2.5-7B-Instruct Qwen2.5-Coder-7B-Instruct Qwen2.5-7B-Instruct 1/30 3.3%
GPT-4o GPT-4o Qwen2.5-Coder-7B-Instruct Qwen2.5-7B-Instruct 6/30 20.0%
Gemini-2.5-pro Gemini-2.5-pro Qwen2.5-Coder-7B-Instruct Qwen2.5-7B-Instruct 6/30 20.0%
Gemini-2.5-pro Gemini-2.5-pro Gemini-2.5-pro Gemini-2.5-pro 17/30 56.7%

PS: Gemini-2.5-pro official performance on AIME24 is 92%.

Could you please provide any guidance on this?

Thank you for your time and any help you can offer!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions