This document tracks data we'd like access to so the website can ship the features users (and reviewers) actually want. Each item lists what's needed, which page benefits, and what we ship without it.
src/data/website_data_lite.csv currently contains workloads from 3
task_ids across 3 repositories (astropy, pandas, scikit-learn). The paper
claims 957 performance bottlenecks across 70+ repos. This is the single
biggest wishlist item.
- Used by:
/explorer/, scrollytelling chart, leaderboard recompute. - Ideal shape: the existing CSV schema (
id, level, agent_id, agent/nop, oracle/nop, task_id, agent_recording, repo_name, benchmark_name) with all ~1.4M workloads spanning all 957 tasks. - Acceptable interim: a representative slice of one task per repo (≈70
rows of
(repo, task_id)pairs), enough that filter chips on the explorer feel populated. - Without it: the explorer reads as a "tiny preview slice." The
Browse N performance workloadsheadline is honest but underwhelming.
Already noted in CLAUDE.md — the time-travel slider on /leaderboard/
stays dormant until each row carries a merged_at ISO 8601 string. The
ranking engine (src/utils/rankingEngine.js) auto-activates once present.
- Used by:
/leaderboard/time-travel slider. - Ideal shape: add a
merged_atcolumn towebsite_data.csvand propagate viaprocess_remote_data.py→LITE_COLUMNS. - Without it: slider hidden, leaderboard is static.
MathNet's explorer shows the problem statement. FormulaCode's explorer
currently shows the benchmark harness (the time_* function being
measured). What the audience actually wants to see is the optimization
patch — what the human expert did, and what the agent did, side by side.
- Used by:
/explorer/workload drawer (new "Patch" section), eventually a per-task page. - Ideal shape: for each
task_id, two unified diffs:oracle_patch— the merged human PR's patchagent_patches[<agent_id>]— the patch the agent produced (per agent we ran) Plus optionally the full pre/post file contents for syntax-highlighted side-by-side rendering.
- Acceptable interim: even just
oracle_patchper task gets us 80% of the way there. - Without it: the drawer's value is capped at "here's the harness and here are the speedups." No insight into why the agent won/lost.
Each task_id should link back to its origin so users can verify and dig
deeper.
- Used by:
/explorer/workload drawer, future per-task page. - Ideal shape: a
tasks.jsonkeyed bytask_id:{ "astropy_astropy_21": { "pr_url": "https://github.com/astropy/astropy/pull/12345", "pr_title": "Speed up Time initialization with units", "merged_at": "2024-03-15T12:34:56Z", "summary": "Caches unit conversion in Time constructor to avoid …", "optimization_strategy": "caching", "files_changed": 3, "lines_added": 42, "lines_removed": 18 } } - Without it: the explorer can identify a task by ID but can't say what the task is in plain English.
To support the "browse by repo" and any "performance by repo popularity" analysis (the paper mentions a 4th-quintile finding), each repo needs basic metadata.
- Used by:
/explorer/repo chips, future repo-detail page. - Ideal shape:
repos.json:{ "astropy_astropy": { "owner": "astropy", "repo": "astropy", "stars": 4400, "language": "Python", "description": "Astronomy and astrophysics core library", "n_tasks": 21, "topic": "scientific-computing" } } - Without it: repos render as
astropy/astropyslug-only — fine, but the page is less informative than MathNet's "Browse by competition" panel.
The paper's cost-efficiency finding ("frontier LLMs are overall more cost effective than open weights models") is a key result that has no home on the site yet.
- Used by: new "Cost vs. advantage" chart on
/leaderboard/, agent cards. - Ideal shape: add to the per-row CSV:
agent_cost_usd,agent_input_tokens,agent_output_tokens,agent_wallclock_seconds. - Without it: we can't surface the cost story visually.
The paper finds agents excel at parallelization/batching and struggle with vectorization. Surfacing these tags would let users filter the explorer by strategy.
- Used by:
/explorer/optional "Strategy" filter chip; Strategy Explorer section on the landing page (aToolGrid-style 6–8 column grid, one column per category, tasks listed inside each cell with ✓/✗ for whether the agent matched the human's strategy). - Ideal shape: per-task labels (
["caching", "vectorization", "io"]) on the human PR. Categories from the paper:caching,vectorization,parallelization,batching,memory,io,algorithm,data-structure. - Pairs well with: wishlist #3 (per-task patches). The Strategy Explorer becomes far more interesting if clicking a cell opens a drawer with the representative diff hunk for that (strategy, agent) cell — even one ~10-line snippet per cell is enough to read as "this is what vectorization looks like in pandas."
- Without it: strategy taxonomy lives only in the paper, not the site.
The leaderboard currently lists agents as flat IDs (e.g.
terminus-2,gpt-5). For an Agent Explorer patterned on ccunpacked.dev's
slash-command catalog, we need to group them by family.
- Used by: new Agent Explorer section on the landing page (pill grid
grouped by agent family, each pill showing the agent's signature strength
and a cost-tier badge); future
/agents/per-agent page. - Ideal shape: an
agents.jsonkeyed by agent_id:{ "terminus-2,gpt-5": { "agent_family": "Terminus 2", "model_family": "GPT", "model": "gpt-5", "provider": "OpenAI", "cost_tier": "frontier", "open_weights": false, "signature_strength": "module-level optimization", "color_category": "frontier-closed" } } - Pairs well with: wishlist #6 (per-task cost). Cost tier in the taxonomy
- per-task cost in the CSV unlocks the paper's "frontier vs. open-weights cost-effectiveness" finding as a visual.
- Without it: agents stay as opaque IDs; we can't surface the family-level story (Terminus + frontier-LLM vs. Aider + open-weights, etc.).
The paper has ~6 sharply phrased findings (local vs. global optimization,
strategy strengths, long-tail repository performance, cost efficiency, …).
copy.json already stores these as {title, description} pairs, but that's
just prose. To render them as a Findings cards grid (the
HiddenFeatures pattern from ccunpacked), each finding needs a category, a
headline metric, and a link to where in the site that finding is visually
demonstrated.
- Used by: new Findings section — tinted cards with category color, one-line description, headline metric chip, "View analysis ↗" link.
- Ideal shape: extend
copy.jsonoverview.keyFindings.findings[]:{ "title": "Local vs. Global Optimization", "description": "Agents are better at local or function-level …", "category": "scope", "metric": { "label": "L4 advantage", "value": -0.04 }, "link": "/leaderboard/?level=L4" } - Without it: findings stay as plain prose cards — readable but inert, with no visual emphasis on the actual numbers and no path from "claim" to "evidence."
- Submission flow / new agent uploads. Following MathNet's lead — we
don't need a backend for this. Users submit via the documented
fcevalCLI; we ingest results into the static dataset. - User accounts, comments, voting. Not relevant.
- Real-time leaderboard. Cadence is paper/release, not minutes.