-
Notifications
You must be signed in to change notification settings - Fork 194
Open
Description
Thank you for your work on this exciting project!
I've been experimenting with the latest commit (9e69192) and have been carefully following the instructions in the README.md.
Following the README, the accuracy I'm seeing is quite low (1/30).
Then I test combination of different models for agents and tools:
| Planner | Planner_fixed | Code | Other agents and tools | Correct tasks | Acc |
|---|---|---|---|---|---|
| agentflow-planner-7b | Qwen2.5-7B-Instruct | Qwen2.5-Coder-7B-Instruct | Qwen2.5-7B-Instruct | 1/30 | 3.3% |
| Qwen2.5-7B-Instruct | Qwen2.5-7B-Instruct | Qwen2.5-Coder-7B-Instruct | Qwen2.5-7B-Instruct | 1/30 | 3.3% |
| agentflow-planner-7b | Qwen2.5-7B-Instruct | Gemini-2.5-pro | Qwen2.5-7B-Instruct | 2/30 | 6.7% |
| GPT-4o | Qwen2.5-7B-Instruct | Qwen2.5-Coder-7B-Instruct | Qwen2.5-7B-Instruct | 2/30 | 6.7% |
| Gemini-2.5-pro | Qwen2.5-7B-Instruct | Qwen2.5-Coder-7B-Instruct | Qwen2.5-7B-Instruct | 1/30 | 3.3% |
| GPT-4o | GPT-4o | Qwen2.5-Coder-7B-Instruct | Qwen2.5-7B-Instruct | 6/30 | 20.0% |
| Gemini-2.5-pro | Gemini-2.5-pro | Qwen2.5-Coder-7B-Instruct | Qwen2.5-7B-Instruct | 6/30 | 20.0% |
| Gemini-2.5-pro | Gemini-2.5-pro | Gemini-2.5-pro | Gemini-2.5-pro | 17/30 | 56.7% |
PS: Gemini-2.5-pro official performance on AIME24 is 92%.
Could you please provide any guidance on this?
Thank you for your time and any help you can offer!
Metadata
Metadata
Assignees
Labels
No labels