-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathlocalgdoc.aml
More file actions
349 lines (268 loc) · 21.6 KB
/
Copy pathlocalgdoc.aml
File metadata and controls
349 lines (268 loc) · 21.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
paperHeader.title: FormulaCode: Evaluating Agentic Optimization on Large Codebases
[paperHeader.authors]
name: Atharva Sehgal
url: https://atharvas.net/
superscript: 1,*
name: James Hou
url: https://jamesahou.github.io/
superscript: 2,*
name: Akanksha Sarkar
url: https://milstein-program.as.cornell.edu/akanksha-sarkar/
superscript: 3
name: Ishaan Mantripragada
url: https://www.linkedin.com/in/ishaanmantri/
superscript: 2
name: Swarat Chaudhuri
url: https://www.cs.utexas.edu/~swarat/
superscript: 1
name: Jennifer J. Sun
url: https://jenjsun.com/
superscript: 3
name: Yisong Yue
url: https://www.yisongyue.com/
superscript: 2
[]
[paperHeader.affiliations]
superscript: 1
label: The University of Texas at Austin
superscript: 2
label: California Institute of Technology
superscript: 3
label: Cornell University
superscript: *
label: Equal contribution
[]
[paperHeader.actions]
label: Live dashboard
icon: activity
href: https://data.formulacode.org/
label: Arxiv
icon: file-text
href: https://arxiv.org/abs/2603.16011
label: GitHub
icon: github
href: https://github.com/formula-code/fc-eval
label: Huggingface
icon: database
href: https://huggingface.co/datasets/formulacode/formulacode-all
[]
paperHeader.abstract.title: Abstract
[paperHeader.abstract.paragraphs]
* Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on synthetic tasks, binary correctness signals, or single-objective evaluation, limiting their ability to assess holistic optimization behavior.
* We introduce FormulaCode, a benchmark for evaluating agentic optimization on large, real-world codebases with fine-grained, multi-objective performance metrics.
* FormulaCode is a <em>live</em> benchmark comprises 957 performance bottlenecks mined from scientific Python repositories on GitHub, each paired with expert-authored patches and, on average, 264.6 community-maintained performance workloads per task, enabling evaluation of the full optimization lifecycle—triage, diagnosis, and resolution—under realistic correctness and performance constraints. Our evaluations reveal that repository-scale, multi-objective optimization remains a major challenge for frontier LLM agents.
[]
paperHeader.leaderboard.title: FormulaCode’s Leaderboard (Tentative)
paperHeader.leaderboard.description: Snapshot of latest results on FormulaCode. Updated Monthly!
paperHeader.hero.eyebrow: Don’t see your model on the leaderboard?
paperHeader.hero.instructions: To evaluate an agent on FormulaCode, Follow the <a href=https://github.com/formula-code/fc-eval/>installation instructions</a> and run:
paperHeader.hero.command: fc-eval run --dataset formulacode --config {your-config.json}
paperHeader.hero.body: The next sections dive into FormulaCode’s analysis with interactive visualizations on a representative subset of FormulaCode. For up-to-date results and insights, please read the paper!
paperHeader.hero.cta.label: Read the paper
paperHeader.hero.cta.href: https://arxiv.org/abs/2603.16011
paperHeader.disclaimer: This is an interactive blog post that presents the core ideas of FormulaCode using a very tiny subset of our dataset. There is a high likelihood that the our findings differ from this exposition. Please read the full paper for accurate information!
paperFooter.citation.show: true
paperFooter.citation.title: Citation
paperFooter.citation.bibtex:@misc{sehgal2025formulacode,
title={Evaluating Agentic Optimization on Large Codebases},
author={Atharva Sehgal and James Hou and Akanksha Sarkar and Ishaan Mantripragada and Swarat Chaudhuri and Jennifer J. Sun and Yisong Yue},
year={2026},
eprint={2603.16011},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2603.16011},
}
:end
paperFooter.funding.title: Acknowledgements
paperFooter.funding.description: This work was supported in part by a Slingshot Award from the <a href=https://www.laude.org/>Laude Institute</a>, NSF awards III-#2505097, PPoSS-#2316161, NSF #2505096, NSF #2505098, and gifts from Point72 and OpenAI.
paperFooter.relatedWork.show: false
paperFooter.relatedWork.title: Related Work
paperFooter.relatedWork.text: This project would not be possible without the excellent work of the community. These are some relevant papers to better understand the premise of our work:
<ul>
<li><a href="https://www.nature.com/articles/s41586-023-06924-6">FunSearch: Making new
discoveries in mathematical sciences using Large Language Models</a> </li>
<li><a href="https://arxiv.org/abs/2305.01582">Interpretable Machine Learning for Science
with PySR and SymbolicRegression.jl</a> </li>
<li><a href="https://arxiv.org/abs/2310.19791">LILO: Learning Interpretable Libraries by
Compressing and Documenting Code</a> </li>
<li><a href="https://arxiv.org/abs/1911.12247 ">LLM-SR: Scientific Equation Discovery via
Programming with Large Language Models</a> </li>
<li><a href="https://arxiv.org/abs/2210.05050 ">Neurosymbolic Programming for Science</a>
</li>
</ul>
:end
paperFooter.acknowledgements.show: true
paperFooter.acknowledgements.title: Acknowledgements
paperFooter.acknowledgements.text: The website design is based on the template developed by <a href=https://pudding.cool/author/fox-meyer/>Fox Meyer</a> and <a href=https://pudding.cool/author/jan-diehm/>Jan Diehm</a> for their interactive article in <a href='https://pudding.cool/'>Pudding.cool</a> on <a href='https://github.com/the-pudding/wine-animals'>The Pour-ing of species</a> that is distributed under an MIT license. The code itself is based on <a href='https://github.com/the-pudding/svelte-starter' target='_blank'>The Pudding's SvelteKit starter template</a>. The Visualizations use <a href='https://layercake.graphics/' target='_blank'>LayerCake</a> and <a href='https://d3js.org/' target='_blank'>D3.js</a>. The source code for this website is available <a href='https://github.com/formula-code/formula-code.github.io/'>here</a>, also under an MIT license.
[.opening]
text: Your codebase isn’t as fast as it used to be and you want to use an agent to optimize the code. You’ve got no preference for a model or agent framework, but you want it to work without any intervention. Which agent model pair do you choose?
instructions: <span class=tap-click>Tap on a Model/Agent to select.</span> <span class=random-click>Just pick a random one for me.</span>
gpt5: You picked <span class=bold>Terminus 2 + GPT-5</span>. A conservative choice! GPT-5 often overlooks small optimizations in favor of large ones. It is best when you want to produce <em>module-level</em> optimizations. <span class=instructions>How do we know? Keep scrolling.</span>
claude: You picked <span class=bold>Terminus 2 + Claude Sonnet 4.0</span>. A reliable choice! Claude Sonnet 4.0 performs the best in our benchmarks at finding function-level and class-level optimizations, but fails on <em>module-level</em> optimizations. So you might need to give it a hand for large scale tasks. <span class=instructions>How do we know? Keep scrolling.</span>
oracle: You picked a <span class=bold>Human</span>. At all levels, expert solutions consistently and repeatedly perform well; forming the basis of our comparative study.
gpt5Advantage: <span class=bold>GPT-5</span> has slightly outperformed humans on <em>module</em> level performance, with a stratified advantage of 1.04.
claudeAdvantage: <span class=bold>Claude</span> has outperformed humans on <em>parameter</em> level performance, with an stratified advantage of 1.04. However, on <em>module</em> level performance, its advantage reverses to -0.04 against the human expert.
oracleAdvantage: <span class=bold>Humans</span> get an advantage score of 0 by default. This is to prevent other models from cheating.
gpt5Quad: The <span class=selected-agent-circle-span>Terminus 2/GPT-5 pair you picked</span> falls into the superoptimization quadrant.
claudeQuad: The <span class=selected-agent-circle-span>Terminus 2/Claude Sonnet 4.0 pair you picked</span> falls into the superoptimization quadrant.
oracleQuad: When we compare against <span class=selected-agent-circle-span>Experts</span> on the other axis, the distribution follows the line of equal advantage and hence is in the no-optimization (over experts) zone.
[]
[+.steps]
Your codebase isn’t as fast as it used to be and you want to use an agent to optimize the code. You’ve got no preference for a model or agent framework, but you want it to work without any intervention. Which agent model pair do you choose?
Couldn’t decide? Maybe this info will help: <span class=bold>Terminus 2 + GPT-5</span> has the highest advantage at producing <em>module-level</em> optimizations, but it often overlooks small optimizations, <span class=bold>Terminus 2 + Claude Sonnet 4.0</span> finds <em>function-level</em> optimizations pretty well, but it might not be the best for deep optimizations. <span class=instructions>How do we know? Keep scrolling.</span>
We scraped 110+ GitHub repositories with crowdsourced performance workloads and identified all pull requests that <em>intended</em> to improve the performance of a specific piece of code. Then, we measured the runtime of the repository before and after to see if the PR’s performance improvement was statistically significant.
After analyzing 1M+ PRs, we were able to identify 961 performance-improving tasks with over 1,472,080 total performance workloads across all tasks. For each of these problems, we asked a frontier LLM agent to optimize the code, given the same tools available to the human developers, and then measured the performance after rejecting optimizations that broke the code. <span class=instructions>Read more in the methodology.</span>
Here’s a cumulative distribution function of the <em>speedup ratio</em> for each of our models. <span class=instructions>Hover over a model to see more details!</span> A CDF is essentially an integration over the histogram; the <em>slower</em> the CDF line rises, the more benchmarks live in the faster region, and the better the model.
On first glance, it looks like our agents are doing pretty well! For <span class=bold>GPT-5</span> and <span class=bold>Claude Sonnet 4.0</span>, there are a lot of jagged bumps, and about 3-5% of all benchmarks are outliers, where both models show extreme code optimizations. However, 75 to 80% of all benchmarks are modest improvements, with a speedup of less than 10%.
However, with a median of 81 benchmarks per task, good performance on a lone workload doesn’t tell us much about the <em>holistic</em> performance of such agents. What we really care about is whether models have a <em>consistent advantage</em> at optimizing code.
[]
[.+postIntro]
What emerges from the above analysis is that speedup alone doesn’t capture the full picture.
<em>Performance optimizations rarely have isolated effects</em>; an optimization in one part of the code could significantly slow down or speed up another part of the code.
Instead, we hypothesize that good performance optimizations produce a <span class=yellow-bold>stratified advantage</span>: an advantage that persists across various strata of a codebase (modules, classes, and functions), not just individual workloads. This requires reasoning about multiple workloads across multiple functionalities and target resources, and ensuring we <em>consistently</em> produce speedups.
To understand more, let’s dive deeper into the data.
[]
[.chartScroll]
[.+block]
Instead of looking at the expert-produced speedup and the model-produced speedup separately, let’s look at them together on a scatterplot.
The <span class=bold>Human Speedup</span> is on the y-axis here, so the better the human speedup, the closer it is to the top. And the <span class=bold>Model Speedup</span> is on the x-axis.
[]
[.+block]
Each data point represents a statistically significant workload captured in our benchmark.
The <span class=yellow-bold>highlighted workload</span> lies at position x=1.11 and y=1.38. That is, the human engineer optimized this workload to be 38% faster than the baseline while the agent’s optimization was only 11% faster.
The agent’s achievements are much less impressive now because the agent demonstrates no <strong>Advantage</strong> over the oracle.
[]
[.+block]
So, where do the most impressive speedups lie? Let’s load the entire dataset and demarcate some regions of interest.
The identify function line depicts <strong>Equal advantage</strong>. For any workload on this line, an agent-written patch is as good as a human-written patch.
[]
[.+block]
Workloads that cause <em>slowdowns</em> will have a speedup less than 1.00x.
The <strong>No oracle speedup</strong> line and a <strong>No agent speedup</strong> line centered at 1.00 help visualize this.
Now, we have 4 regions of interest.
[]
[.+block]
The <span class=regression-span>Bottom Left</span> region characterizes Regressions; these are all the workloads where the agent and the oracle both caused a <strong>Performance Regression</strong>.
This could be an intentional tradeoff, or just a tricky workload for both agents and humans.
[]
[.+block]
The <span class=sub-optimization-span>Top left</span> region shows sub-optimal benchmarks – the benchmarks where the oracle achieved a speedup but the agent caused a regression.
This is the worst region for an agent.
[]
[.+block]
The <span class=under-optimization-span>Top right</span> region shows under-optimized benchmarks – the agent still achieves some speedup but the expert-provided solution was much better.
Any workload here is a worthwhile tradeoff depending on resource prioritization.
[]
[.+block]
What we are really interested in are <span class=sweet-rect-span>Super optimizations</span> – these are the workflows where the agent produced optimizations that were better than the oracle optimizations and better than the baseline.
[]
[.+block]
This allows us to define a notion of <strong>agent advantage</strong>. Mathematically, given two dimensionless vectors depicting the oracle speedups and the agent speedups:
math: \text{oracle-speedup} = \mathbf{o}_{1:N}
math: \text{agent-speedup} = \mathbf{a}_{1:N}
We can define a metric for the overall performance by calculating the average distance from the equal advantage line:
math: \text{advantage} = \frac{1}{N} \sum_{i=1}^{N} o_i - a_i
Intuitively, the closer a point is to the equal advantage line, the lower its score.
[]
[.+block]
What if an agent tries to <span class=bold>minimic the Human</span>’s steps?
Unsurprisingly, all the points lie on the equal advantage line. This means that any simply replicating a memorized solution would get an advantage of 0.0.
[]
[.+block]
Here’s the <span class=bold>Human v/s Claude</span> plot.
Most benchmarks are either super optimal or under optimal!
Claude’s advantage score here is 0.0749, which means Claude does <em>slightly</em> better than the expert on these problems.
[]
[.+block]
The <span class=bold>Human v/s GPT-5</span> comparison, is similar.
We see a few superoptimizations but mostly suboptimizations.
GPT-5’s advantage score is -0.02. So, it’s slightly worse off than humans.
[]
[.+block]
This is surprising. Is Claude truly better than GPT-5 and humans?
This is a good time to talk about our <span class=yellow-bold>grouping scheme</span>.
In the <span class=”highlight-grouping”>bottom left corner</span>, notice that the current data points aren’t being aggregated. So, we’re still looking at <em>singular</em> workloads.
To investigate the <em>holistic</em> optimization abilities, we can group workloads together based on their prefix strings (e.g: Aggregate all workloads under <code>pandas.algorithm.*</code>).
[]
[.+block]
This is the same <span class=bold>Human v/s Claude</span> plot but aggregated on <span class=bold>Modules</span>.
The oracle’s performance increases significantly and most of Claude’s optimizations disappear! The new advantage score is now <span class=yellow-bold>-0.0002</span>.
So, Claude’s aggregate performance optimization capabilities are much weaker than its individual performance optimization capabilities.
With the same aggregation, GPT’s advantage score is 0.0034. <em>Their advantage flipped</em>.
[]
[.+block]
But all this is conditioned on our definition of what counts as equal advantage. What if the minimum acceptable speedup is different?
<span class=instructions><span class=slider-span>Use the sliders</span> to set your own criteria for equal advantage, and keep scrolling to see a model-by-model breakdown based on your selection.</span>
[]
[]
[+.postScatter]
Use the nav boxes to navigate through all the model groups.
[]
overview.benchmarkDesign.title: Benchmark Design
[overview.benchmarkDesign.paragraphs]
* Each FormulaCode task evaluates the ability of an agent to optimize a real-world codebase under strict correctness constraints. A task begins with a baseline repository, which represents the unmodified implementation. The agent operates on the baseline and produces a modified version of the repository by making arbitrary repository-level edits.
* Performance evaluation proceeds by executing the full set of workloads on both the baseline and the agent-modified code and comparing their measured outcomes. Improving performance on one workload may degrade performance on others. As a result, optimization in FormulaCode is inherently multi-objective: agents must reason about trade-offs across subsystems and deliver improvements that are broad and consistent rather than localized to a single execution path.
[]
overview.datasetConstruction.title: Dataset Construction
overview.datasetConstruction.intro: FormulaCode consists of multi-workload real-world code optimization problems from 70 repositories. We developed an automated four-stage pipeline that extracts these problems:
[overview.datasetConstruction.steps]
title: 1. Repository Scraping
description: We crawl GitHub repositories with high-quality expert-defined performance workloads.
title: 2. Attribute Filtering
description: We filter out candidate pull requests where the primary intent was not performance related, using rule-based and LLM-based filters.
title: 3. Environment Synthesis
description: We synthesize environment building scripts using a reflexive LLM agent so that terminal interface tools function correctly.
title: 4. Statistical Validation
description: We filter all candidate PRs that do not show statistically significant improvement in performance workloads.
[]
overview.keyFindings.title: Key Findings
[overview.keyFindings.findings]
title: Agents Improve Runtime but Underperform Experts
description: Agents generally can improve run-time performance, but perform worse than human experts.
title: Local vs. Global Optimization
description: Agents are better at local or function-level optimization, rather than repository-level optimization.
title: Optimization Strategy Strengths
description: Agents excel at using specific optimization strategies (e.g., parallelizing or batching) and struggle with others (e.g., vectorized operations).
title: Long-Tail Repository Performance
description: Agent performance relative to experts can vary dramatically by popularity of the repository, performing worst on the 4th quintile and best on the 2nd quintile.
title: Cost Efficiency
description: Despite being more expensive per call, agents using frontier LLMs are overall more cost effective than those using open weights models.
title: Multi-Workload Tradeoffs
description: Compared to human experts, agents make less favorable performance-cost trade-off decisions.
[]
[overview.landingSections]
title: How does FormulaCode find code optimization tasks?
linkHref: /docs/
linkLabel: Datasmith Documentation ↗
title: Abstract
linkLabel: Read the paper ↗
title: Dataset Statistics
linkHref: https://data.formulacode.org/
linkLabel: Data explorer ↗
captionTemplate: FormulaCode is updated monthly. Last refreshed on {date}. For the latest statistics, visit <a href="https://data.formulacode.org/">data.formulacode.org</a>.
title: Key Findings
linkLabel: Read the paper ↗
title: Leaderboard at a glance
linkHref: /leaderboard/
linkLabel: Full leaderboard ↗
title: Contribute
linkLabel: Join the FormulaCode Discord ↗
caption: FormulaCode is a living code optimization benchmark. Help us cover the long tail by opening a request for a particular data source or a particular model.
[]
overview.compactLeaderboard.title: Compact Leaderboard
overview.compactLeaderboard.buttonText: View Full Leaderboard
overview.submit.title: Don't see your model? Submit it!
overview.submit.repoUrl: https://github.com/formula-code/fc-eval
overview.submit.instructions: To evaluate an agent on FormulaCode, follow the <a href="https://github.com/formula-code/fc-eval">Installation instructions</a> and run:
overview.submit.command: $ fc-eval run --dataset formulacode --config [your-config.json]
leaderboardPage.title: FormulaCode Leaderboard
leaderboardPage.global.title: Global Leaderboard
leaderboardPage.stratified.title: Stratified Leaderboard
leaderboardPage.stratified.description: Performance broken down by optimization scope: <strong>L1</strong> (Function), <strong>L2</strong> (Class), <strong>L3</strong> (Module).
leaderboardPage.submit.title: Submit Your Model
leaderboardPage.submit.description: To evaluate your own agent on FormulaCode, follow our installation guide.
leaderboardPage.submit.buttonText: Get Started
leaderboardPage.submit.repoUrl: https://github.com/formula-code/fc-eval
inlineLeaderboard.title: Leaderboard
inlineLeaderboard.description: This leaderboard displays the agent advantage scores by aggregation level. Higher scores indicate better performance relative to the Oracle.
inlineLeaderboard.instructions: Use the thresholding filters above and see how they change the leaderboard.