Can't Reproduce AIME24 Results following readme.

Thank you for your work on this exciting project!

I've been experimenting with the latest commit (9e691927f42c7acaaca4bb697997d6b7f32072c5) and have been carefully following the instructions in the README.md.

Following the README, the accuracy I'm seeing is quite low (1/30).

Then I test combination of different models for agents and tools:

| Planner               | Planner_fixed         | Code                  | Other agents and tools           | Correct tasks | Acc   |
|-----------------------|-----------------------|-----------------------|-----------------------|-------------|-------|
| agentflow-planner-7b  | Qwen2.5-7B-Instruct   | Qwen2.5-Coder-7B-Instruct | Qwen2.5-7B-Instruct   | 1/30        | 3.3%  |
| Qwen2.5-7B-Instruct   | Qwen2.5-7B-Instruct   | Qwen2.5-Coder-7B-Instruct | Qwen2.5-7B-Instruct   | 1/30        | 3.3%  |
| agentflow-planner-7b  | Qwen2.5-7B-Instruct   | Gemini-2.5-pro        | Qwen2.5-7B-Instruct   | 2/30        | 6.7%  |
| GPT-4o                | Qwen2.5-7B-Instruct   | Qwen2.5-Coder-7B-Instruct | Qwen2.5-7B-Instruct   | 2/30        | 6.7%  |
| Gemini-2.5-pro        | Qwen2.5-7B-Instruct   | Qwen2.5-Coder-7B-Instruct | Qwen2.5-7B-Instruct   | 1/30        | 3.3%  |
| GPT-4o                | GPT-4o                | Qwen2.5-Coder-7B-Instruct | Qwen2.5-7B-Instruct   | 6/30        | 20.0% |
| Gemini-2.5-pro        | Gemini-2.5-pro        | Qwen2.5-Coder-7B-Instruct | Qwen2.5-7B-Instruct   | 6/30        | 20.0% |
| Gemini-2.5-pro        | Gemini-2.5-pro        | Gemini-2.5-pro        | Gemini-2.5-pro        | 17/30       | 56.7% |

PS: Gemini-2.5-pro official performance on AIME24 is 92%.

Could you please provide any guidance on this?

Thank you for your time and any help you can offer!





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can't Reproduce AIME24 Results following readme. #21

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Planner	Planner_fixed	Code	Other agents and tools	Correct tasks	Acc
agentflow-planner-7b	Qwen2.5-7B-Instruct	Qwen2.5-Coder-7B-Instruct	Qwen2.5-7B-Instruct	1/30	3.3%
Qwen2.5-7B-Instruct	Qwen2.5-7B-Instruct	Qwen2.5-Coder-7B-Instruct	Qwen2.5-7B-Instruct	1/30	3.3%
agentflow-planner-7b	Qwen2.5-7B-Instruct	Gemini-2.5-pro	Qwen2.5-7B-Instruct	2/30	6.7%
GPT-4o	Qwen2.5-7B-Instruct	Qwen2.5-Coder-7B-Instruct	Qwen2.5-7B-Instruct	2/30	6.7%
Gemini-2.5-pro	Qwen2.5-7B-Instruct	Qwen2.5-Coder-7B-Instruct	Qwen2.5-7B-Instruct	1/30	3.3%
GPT-4o	GPT-4o	Qwen2.5-Coder-7B-Instruct	Qwen2.5-7B-Instruct	6/30	20.0%
Gemini-2.5-pro	Gemini-2.5-pro	Qwen2.5-Coder-7B-Instruct	Qwen2.5-7B-Instruct	6/30	20.0%
Gemini-2.5-pro	Gemini-2.5-pro	Gemini-2.5-pro	Gemini-2.5-pro	17/30	56.7%

Can't Reproduce AIME24 Results following readme. #21

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions