|
| 1 | +# Parity verification: paper ↔ website |
| 2 | + |
| 3 | +This document tracks whether each finding rendered on the FormulaCode landing page faithfully reproduces its corresponding artifact (figure or table) in the paper. It is intended as a maintenance reference: when something looks off, this is where you look first. |
| 4 | + |
| 5 | +The website is meant to be a **mirror** of the paper's results, not a re-derivation. Every site-side finding should trace back to a specific notebook in `_repos/fc-eval/analysis/`, which writes a specific PDF asset in `atharvas/formulacode-paper`, which is `\includegraphics{...}`'d by a specific `.tex` figure or table wrapper. If any of those links break, the finding is out of parity. |
| 6 | + |
| 7 | +## Status at a glance |
| 8 | + |
| 9 | +Legend: ✅ = matches paper · ⚠️ = partial / known divergence · ❌ = wrong (needs fix) |
| 10 | + |
| 11 | +| # | Site finding | Paper artifact | Status | What still differs | |
| 12 | +|---|---|---|---|---| |
| 13 | +| F1 | Agents improve runtime but underperform experts | Table 1 (Global Leaderboard) | ✅ | Baseline `speedup_geomean` (1.1193) now populated; RP ranking is what `task.compute_leaderboard` returns — methodologically the same as the paper. | |
| 14 | +| F2 | Local vs global optimization | Figure 3 (Stratified advantage) | ✅ | Module-level standout for OpenHands+Claude matches paper. | |
| 15 | +| F3 | Optimization strategy strengths | Table 2 (Per-Tag advantage) | ✅ | 9 of 14 tags populated; the remaining 5 (`approximation, scale, db, io, uncategorized`) have zero tasks classified in `filtered_formulacode-verified.parquet` — matches paper's actual distribution. Headline tags (parallelization, batching, lower_level) are all present. | |
| 16 | +| F4 | Long-tail repository performance | Table 3 (Repo popularity quintiles) | ✅ | Per-task stars sourced from `task.metadata["pr_base_stargazers_count"]`; 39/40 cells populated. Q2 best / Q4 weak matches paper. | |
| 17 | +| F5 | Cost efficiency | Figure 4 (Cost-Performance Pareto) | ✅ | Frontier line + halos; Claude at expensive-end of Pareto matches. | |
| 18 | +| F6 | Multi-workload tradeoffs | Figure 5 (Multi-workload tradeoff) | ✅ | Re-exported from plain `multi_objective_analysis.ipynb` (excludes failed agents). 226 rows = 188 agent + 38 expert. | |
| 19 | +| F7 | Temporal generalization | Table 4 (Temporal analysis) | ⚠️ | Schema now matches paper Table 4's 6-bin layout (`6+ mo before` … `6+ mo after`) and Qwen 3 Coder is excluded — 3 models × 6 bins. **11 of 18 cells populated** vs paper's 18/18; the 7 nulls reflect the upstream task set being sparser per bin than the paper's run. Will fill in as more tasks ingest. | |
| 20 | + |
| 21 | +Open upstream issue tracking the divergences: [`formula-code/fc-eval#19`](https://github.com/formula-code/fc-eval/issues/19). |
| 22 | + |
| 23 | +## Per-finding details |
| 24 | + |
| 25 | +### F1 — Table 1 (Global Leaderboard) |
| 26 | + |
| 27 | +| | | |
| 28 | +|---|---| |
| 29 | +| Paper section | `sections/experiments/tables/tab-advantage-leaderboard.tex` | |
| 30 | +| Paper asset | _(LaTeX table — no PDF)_ | |
| 31 | +| Canonical notebook | `_repos/fc-eval/analysis/leaderboard.ipynb` | |
| 32 | +| API table | `findings_global_leaderboard` | |
| 33 | +| Website component | `src/components/sections/findings/F1_GlobalLeaderboard.svelte` | |
| 34 | +| Cached JSON | `src/data/findings/f1_leaderboard.json` | |
| 35 | +| Render style | Heatmap table — RP rank, agent, model, advantage (diverging RdBu @ 0), speedup geomean (diverging RdBu @ 1×) | |
| 36 | + |
| 37 | +**Parity check:** Baseline row's `speedup_geomean` now shows **1.1193×** (was previously `null`), so the diverging color scale on the Speedup column has a center to read against. The RP ordering is what `analysis.task.compute_leaderboard` returns — that's the paper's methodology (Ranked Pairs voting), not advantage-sorted. The earlier-flagged "Claude at rp_rank=1" expectation came from a stale scaffold that had been sorted on advantage rather than computed via RP. |
| 38 | + |
| 39 | +### F2 — Figure 3 (Stratified advantage) |
| 40 | + |
| 41 | +| | | |
| 42 | +|---|---| |
| 43 | +| Paper section | `sections/experiments/tables/fig-agg-advantage-ladders.tex` | |
| 44 | +| Paper asset | `figures/assets/agg_ladders.pdf` | |
| 45 | +| Canonical notebook | `_repos/fc-eval/analysis/figure_2_ladders.ipynb` | |
| 46 | +| API table | `findings_stratified_advantage` | |
| 47 | +| Website component | `src/components/sections/findings/F2_StratifiedAdvantage.svelte` | |
| 48 | +| Cached JSON | `src/data/findings/f2_stratified.json` | |
| 49 | +| Render style | Slope chart, straight segments, color by model + dash by harness (OpenHands dashed, Terminus solid), legend below | |
| 50 | + |
| 51 | +**Level naming gotcha:** The API encodes `level1` = Function, `level2` = Class, `level3` = Module, `level4` = overall. The paper's ℓ∈{1,2,3} is coarse→fine (Module→Class→Function), so the website remaps `[level3→Module, level2→Class, level1→Function]` in `f2_stratified.json`. If the API renames its columns, update `fetch_f2_stratified` in `tasks/process_remote_data.py`. |
| 52 | + |
| 53 | +**Visual sanity:** OpenHands + Claude 4.0 Sonnet should have a noticeably high left endpoint at Module (~+0.30). Other configs should slope flat or rise from Module → Function. Matches the paper's caption. |
| 54 | + |
| 55 | +### F3 — Table 2 (Per-Tag advantage) |
| 56 | + |
| 57 | +| | | |
| 58 | +|---|---| |
| 59 | +| Paper section | `sections/experiments/tables/tab-tag-analysis.tex` | |
| 60 | +| Paper asset | _(LaTeX table — no PDF)_ | |
| 61 | +| Canonical notebook | `_repos/fc-eval/analysis/table_8_tags.ipynb` | |
| 62 | +| API table | `findings_tag_advantage` | |
| 63 | +| Website component | `src/components/sections/findings/F3_StrategyAdvantage.svelte` | |
| 64 | +| Cached JSON | `src/data/findings/f3_tags.json` | |
| 65 | +| Render style | Heatmap table — agent × tag matrix, diverging RdBu @ 0 | |
| 66 | + |
| 67 | +**Parity check:** The exporter now reads from `task.metadata["classification"]` (the same per-PR enum `analysis/table_8_tags.ipynb` uses), so the website table carries 9 tag keys — `parallelization, batching, caching, algorithmic, data_structure, reduce_work, higher_level, micro, lower_level`. The five paper tags missing from the API (`approximation, scale, db, io, uncategorized`) have zero tasks classified into them in `filtered_formulacode-verified.parquet`; that matches Table 2's actual sparsity. The website renders those columns as `—` and otherwise mirrors the paper. |
| 68 | + |
| 69 | +### F4 — Table 3 (Long-tail by repo popularity) |
| 70 | + |
| 71 | +| | | |
| 72 | +|---|---| |
| 73 | +| Paper section | `sections/experiments/tables/tab-long-tail.tex` | |
| 74 | +| Paper asset | _(LaTeX table — no PDF)_ | |
| 75 | +| Canonical notebook | `_repos/fc-eval/analysis/table_9_longtail.ipynb` | |
| 76 | +| API table | `findings_repo_quintiles` | |
| 77 | +| Website component | `src/components/sections/findings/F4_RepoQuintiles.svelte` | |
| 78 | +| Cached JSON | `src/data/findings/f4_longtail.json` | |
| 79 | +| Render style | Heatmap table — agent × Q1–Q5 matrix, diverging RdBu @ 0 | |
| 80 | + |
| 81 | +**Visual sanity:** Q2 row mostly positive (best quintile), Q3 / Q4 rows mostly negative (worst quintiles). Matches paper claim "performance varies dramatically by repository popularity; worst in Q4, best in Q2." |
| 82 | + |
| 83 | +### F5 — Figure 4 (Cost-Performance Pareto) |
| 84 | + |
| 85 | +| | | |
| 86 | +|---|---| |
| 87 | +| Paper section | `sections/experiments/tables/fig-cost-advantage.tex` | |
| 88 | +| Paper asset | `figures/assets/cost_vs_performance.pdf` | |
| 89 | +| Canonical notebook | `_repos/fc-eval/analysis/table_6_cost.ipynb` | |
| 90 | +| API table | `findings_cost_pareto` | |
| 91 | +| Website component | `src/components/sections/findings/F5_CostPareto.svelte` | |
| 92 | +| Cached JSON | `src/data/findings/f5_cost.json` | |
| 93 | +| Render style | Scatter, x = cost-weighted advantage, y = cost (USD, log scale), Pareto points get a halo + dashed frontier line connecting them in cost-ascending order | |
| 94 | + |
| 95 | +**Visual sanity:** OpenHands + Claude 4.0 Sonnet should be the most expensive point on the Pareto frontier. Cheaper Pareto points (Gemini, Qwen, GPT-5) should sit beneath. Matches paper Figure 4 caption. |
| 96 | + |
| 97 | +### F6 — Figure 5 (Multi-workload tradeoff) |
| 98 | + |
| 99 | +| | | |
| 100 | +|---|---| |
| 101 | +| Paper section | `sections/experiments/tables/fig-optimization-tradeoff.tex` | |
| 102 | +| Paper asset | `figures/assets/tradeoff.pdf` | |
| 103 | +| Canonical notebook | `_repos/fc-eval/analysis/multi_objective_analysis.ipynb` | |
| 104 | +| API table | `findings_workload_tradeoff` | |
| 105 | +| Website component | `src/components/sections/findings/F6_MultiWorkloadTradeoff.svelte` | |
| 106 | +| Cached JSON | `src/data/findings/f6_tradeoff.json` | |
| 107 | +| Render style | Scatter, x = worst-workload speedup, y = global speedup, y=x identity line, expert points (`is_expert=true`) drawn distinctly | |
| 108 | + |
| 109 | +**Parity check:** Upstream re-exported from the plain `multi_objective_analysis.ipynb` (using `iter_successful_agent_configs(...)` + `_extract_per_benchmark_speedups`) so failed agents are excluded rather than imputed with `speedup=1.0`. Row split is now **188 agent + 38 expert = 226 rows** (was 304 under the imputed variant). The expert count is intentionally lower than agent: dedupe across duplicate `tasks.txt` entries means one row per PR per role. |
| 110 | + |
| 111 | +### F7 — Table 4 (Temporal generalization) |
| 112 | + |
| 113 | +| | | |
| 114 | +|---|---| |
| 115 | +| Paper section | `sections/experiments/tables/tab-temporal.tex` | |
| 116 | +| Paper asset | _(LaTeX table — no PDF)_ | |
| 117 | +| Canonical notebook | `_repos/fc-eval/analysis/figure_1_temporal.ipynb` | |
| 118 | +| API table | `findings_temporal_generalization` | |
| 119 | +| Website component | `src/components/sections/findings/F7_TemporalGeneralization.svelte` | |
| 120 | +| Cached JSON | `src/data/findings/f7_temporal.json` | |
| 121 | +| Render style | Heatmap table — model × 6 paper-Table-4 bins (`6+ mo before` / `3–6 mo before` / `0–3 mo before` / `0–3 mo after` / `3–6 mo after` / `6+ mo after`), sequential blues on speedup | |
| 122 | + |
| 123 | +**Parity check:** Upstream re-shipped `findings_temporal_generalization` with the paper Table 4 binning — six 3-month-wide windows (`pre6plus`, `pre3to6`, `pre0to3`, `post0to3`, `post3to6`, `post6plus`), three models (Claude / GPT-5 / Gemini, Qwen excluded per `figure_1_temporal.ipynb` cell 5 popping its cutoff). Cell values are mean `agent/nop` within each bin. |
| 124 | + |
| 125 | +**Remaining gap:** The upstream task set is sparser per bin than the paper's: 11 of 18 cells are non-null in the API today vs the paper's 18/18 in Table 4. Specifically: Claude lacks `pre3to6` / `post6plus`, GPT-5 lacks `pre0to3` / `post0to3`, Gemini lacks `pre3to6` / `pre0to3`. This is data-distribution, not exporter logic — will close as more tasks ingest. |
| 126 | + |
| 127 | +(If you ever want the *figure* version — `temporal_ood.pdf`, a running monthly time series — that's a separate `findings_temporal_evolution` endpoint not yet built.) |
| 128 | + |
| 129 | +## Re-verifying parity |
| 130 | + |
| 131 | +Single command refresh from the live API: |
| 132 | + |
| 133 | +```bash |
| 134 | +python tasks/process_remote_data.py |
| 135 | +# or, to refresh only the findings/ JSONs without touching the CSV-derived files: |
| 136 | +python -c "import sys; sys.path.insert(0,'tasks'); \ |
| 137 | + from process_remote_data import fetch_all_findings; \ |
| 138 | + fetch_all_findings('src/data/findings')" |
| 139 | +``` |
| 140 | + |
| 141 | +Then open `http://localhost:5173/#key-findings` (or whichever port `npm run dev` chose) and walk each F1–F7 card top-to-bottom. **Each card's caption deep-links into the arXiv HTML at its specific table/figure anchor** so a click jumps to the exact place in the paper to compare against: |
| 142 | + |
| 143 | +| Finding | arXiv anchor | |
| 144 | +|---|---| |
| 145 | +| F1 Global leaderboard | `#S3.T1` (Table 1, §3.1) | |
| 146 | +| F2 Stratified advantage | `#S3.F3` (Figure 3, §3.2) | |
| 147 | +| F3 Per-tag advantage | `#S3.T2` (Table 2, §3.3) | |
| 148 | +| F4 Repo quintiles | `#S3.T3` (Table 3, §3.4) | |
| 149 | +| F5 Cost-Performance Pareto | `#S3.F4` (Figure 4, §3.5.1) | |
| 150 | +| F6 Multi-workload tradeoff | `#S3.F5` (Figure 5, §3.5.2) | |
| 151 | +| F7 Temporal generalization | `#S3.T4` (Table 4, §3.5.3) | |
| 152 | + |
| 153 | +Anchors are managed in `tasks/process_remote_data.py`'s `ARXIV_ANCHORS` map and flow into the `_arxiv` field of each `src/data/findings/f*.json`. |
| 154 | + |
| 155 | +For each card, check: |
| 156 | + |
| 157 | +1. **Title** matches the paper artifact in the table above. |
| 158 | +2. **Caption link** at the bottom of the card points to the right figure/table name. |
| 159 | +3. **Render shape** (table vs scatter vs slope) matches the paper's representation. |
| 160 | +4. **Numeric sanity** — the headline number the paper highlights still shows up (e.g. F2: OpenHands+Claude module-level advantage ≈ +0.30; F1: all-agent advantages negative; F4: Q2 best, Q4 worst). |
| 161 | + |
| 162 | +If something looks off, update the **Status at a glance** table above and either fix locally (if it's a website rendering issue) or file a follow-up comment on [`fc-eval#19`](https://github.com/formula-code/fc-eval/issues/19) (if it's a data/upstream issue). |
| 163 | + |
| 164 | +## When to update this file |
| 165 | + |
| 166 | +- **Adding a new finding (F8, F9, …):** add a row to the status table and a section below it with the same six-row table (paper section / asset / notebook / API table / component / cached JSON). |
| 167 | +- **Renaming an API column:** update the relevant `fetch_f*` function in `tasks/process_remote_data.py`, re-run the fetcher, and update the "API table" row here if columns change shape (not just rename). |
| 168 | +- **Paper figure swapped in the LaTeX source:** verify the `\includegraphics{...}` target still maps to the notebook listed here. If a new PDF is added, retrace the savefig chain in `_repos/fc-eval/analysis/`. |
| 169 | +- **Closing an upstream divergence:** flip the row's ⚠️ to ✅ and trim the "What still differs" column. |
0 commit comments