This document is the long-form companion to Variation B in FLOWCHART.md
(reproduced below). It exists to answer, for someone who has never seen the
codebase: why is this pipeline structured the way it is, and not some other
way? Most of the rationale is not in the code — it's in the FormulaCode
paper (formulacode-paper/analysis/paper/sections/dataset/construction.tex
and appendix/dataset.tex) and in operational decisions made over the
project's lifetime. We pull both together here.
flowchart LR
subgraph Discover["① Discover"]
direction TB
S1["Stage 1<br/>scrape_repos"]
S2["Stage 2<br/>scrape_commits<br/>+ symbolic perf filter"]
S1 --> S2
end
subgraph Judge["② Judge"]
direction TB
S3["Stage 3<br/>classify_prs<br/>(perf? + difficulty)"]
end
subgraph Build["③ Build a runnable artifact"]
direction TB
S4["Stage 4<br/>resolve_packages<br/>(env_payload, can_install)"]
S5["Stage 5<br/>render_problems<br/>(issues + LLM extract)"]
S6["Stage 6<br/>synthesize_images<br/>(cache → similar → default → LLM)"]
S4 --> S5 --> S6
end
subgraph Verify["④ Verify & ship"]
direction TB
S7["Stage 7<br/>harbor_healthcheck<br/>(oracle agent, parallel trials)"]
S8["Stage 8<br/>publish<br/>(HuggingFace + DockerHub)"]
S7 --> S8
end
GH(["GitHub REST + GraphQL"]):::ext
LLM(["DSPy / Claude / Gemini / Qwen"]):::ext
UV(["uv pip compile"]):::ext
DOC(["Docker + DockerHub"]):::ext
DAY(["Daytona"]):::ext
HF(["HuggingFace Hub"]):::ext
Discover -->|"perf-like PRs"| Judge
Judge -->|"is_performance_commit=true"| Build
Build -->|"container_name set"| Verify
S1 -.- GH
S2 -.- GH
S3 -.- LLM
S4 -.- UV
S5 -.- GH
S5 -.- LLM
S6 -.- LLM
S6 -.- DOC
S7 -.- DAY
S8 -.- HF
S8 -.- DOC
G1{{"gate: symbolic perf heuristic"}}:::gate
G2{{"gate: max_speedup ≥ 1.05x on Daytona"}}:::gate
S2 --> G1 --> S3
S7 --> G2 --> S8
classDef ext fill:#fff7e0,stroke:#c90,color:#553
classDef gate fill:#fde2e2,stroke:#c33,color:#700
The eight CLI stages are an operational decomposition (each is a
resumable, separately-scheduled fc-data --stage N job). The four phases
are a conceptual decomposition that maps onto how the paper itself
narrates dataset construction. The paper (§3.1, construction.tex) groups
the work into exactly four stages:
The dataset creation pipeline comprises four broad stages: (1) crawl GitHub repositories with high-quality expert-defined performance workloads, (2) filter out all candidate pull requests using rule-based and LLM-based attribute filters where the primary intent of merging the PR was not performance related, (3) synthesize an environment building script so that the terminal interface tools function, (4) Filter all candidate PRs that do not show statistically significant improvement in performance workloads.
Variation B mirrors that grouping 1:1, then exposes the implementation substages inside each phase. The split between paper stage and code stage matters because:
- Paper stage = "what is this for?" Each phase corresponds to one kind of evidence about a PR (does it exist, is it perf, can it be built, does it actually speed things up).
- Code stage = "what is this run as?" Each stage is a unit of resumable
work with its own concurrency knob, table writes, and failure logging.
Splitting
resolve_packages,render_problems, andsynthesize_imagesapart (instead of bundling them as the paper's "stage 3") is a practical decision: each one can fail or be re-run independently, each touches a different external service, and each has different per-item cost (uv compile is seconds; LLM synthesis is dollars).
The two gates on the diagram (the diamonds) are the only two places where the pipeline genuinely throws PRs away on quality grounds. Everything else either drops on hard errors (couldn't fetch, couldn't build) or defers (rate-limited, retry later). Calling the gates out visually is the single most important thing the diagram does.
What. Find every Python repository that has a working ASV benchmark suite, then scrape every PR they've merged in the target window.
Stages. scrape_repos (1) and scrape_commits (2).
Why a separate phase. Discovery is the only phase that operates on
the public GitHub population rather than on rows in our own DB. It is
therefore the only phase whose throughput is bounded by GitHub's rate
limits rather than by our compute. Operationally we run it monthly to
pick up new PRs (fc-data --start-date 2026-01-01 --end-date 2026-01-31);
it should be cheap and idempotent.
How.
- Stage 1: the production discovery path is a CommonSQL query against
the GitHub Public Dataset on BigQuery
that filters for
asv.conf.jsonpresence, ≥min_stars(paper:\minStars{}; code default 500), non-fork, recent activity, Python 3.8+. The paper notes this query runs in ~48 seconds and costs $9.4. The code also supports a fallback path through the GitHub Search API (slower, rate-limited, same result set). Outputs land inrepositories. - Stage 2: for each row in
repositories, paginate merged PRs via the GitHub GraphQL API (paginate_merged_prs), fetch per-PR file changes and the unified diff via REST, and run a regex/heuristicsymbolic_compliance()that flags PRs whose commit message contains performance-positive keywords (speed,optimize,vectorize, etc.) and avoids negative ones (revert,bump version,fix typo). Also drops PRs that touch onlytests/,docs/, packaging, or that exceed size limits (>500 files, >80k LOC, >64k tokens of patch). Outputs populatepull_requestsand the booleanis_performance_commit_symbolic.
Why the symbolic gate is here, not later. The paper is explicit that this is a cost-control measure: the LLM classifier in phase ② is expensive, so we do the cheap regex pre-filter first and only spend LLM tokens on PRs that have at least a prima facie perf signal. The paper also notes the gate is intentionally lossy on the negative side (filters away non-perf), and intentionally permissive on the positive side (anything ambiguous is kept and forwarded to the LLM). This recall- over-precision bias is repeated in stage 3.
External services. GitHub REST + GraphQL only. No LLM in this phase.
Where the data lands. repositories, pull_requests (with patch,
file_changes, is_performance_commit_symbolic).
Failure mode. GitHub rate limit. The pipeline implements DATASMITH_RL_*
backoff knobs (see CLAUDE.md). We tolerate partial scrapes — the next
monthly run picks up what was missed.
What. For each symbolically-flagged PR, decide whether it is bona fide a performance optimization, and if so, label it with an optimization category and an estimated difficulty.
Stages. classify_prs (3) only.
Why a separate phase. This is the first phase that costs LLM tokens per item, and it is the first phase whose decisions are semantic rather than structural. Splitting it from discovery means we can re-run classification (e.g. when we swap LLMs or refine prompts) without re-scraping GitHub.
How.
- Reads PRs where
is_performance_commit_symbolic=Trueand eitheris_performance_commit IS NULLor--forcewas passed. - Runs DSPy's
ProblemClassifier.classify(description, patch, file_change_summary)to get a yes/no on perf. - If yes, runs a second DSPy pass (
ClassifyJudge.classify) for category (13-class taxonomy from the paper: Cache and Reuse, Use Better Algorithm, Micro Optimizations, Use Lower Level System, …) and difficulty (Easy / Medium / Hard). - Updates
pull_requestswithis_performance_commit,classification,difficulty. Concurrent fan-out at N=5.
Why recall over precision. Paper §A.2.3 (appendix/dataset.tex,
\subsubsection{Performance Intent Filtering}):
We explicitly prioritize recall over precision. The prompt is configured to lean towards a "YES" classification in ambiguous cases. This design choice is deliberate, as false positives will be symbolically verified in the subsequent benchmark execution stage, and discarded if they yield no measurable speedup.
This is the load-bearing design decision for the entire pipeline. The final speedup gate in phase ④ is the arbiter of truth, so every upstream filter is allowed to be loose. It is much cheaper to over-include here and have a PR discarded after Harbor measures no speedup than it is to under-include and miss a real optimization that will never surface again. This single principle explains why the pipeline funnels so aggressively: the upstream sieves are deliberately coarse because the downstream sieve is precise.
Why local LLM by default. Paper §A.2: we use a locally-served
openai/gpt-oss-120b (vLLM, exposed via the LiteLLM proxy at
https://model.formulacode.org) for classification and extraction tasks
because they are high-volume, low-complexity. Frontier models are reserved
for build script synthesis (phase ③).
External services. DSPy → local vLLM (gpt-oss-120b).
Where the data lands. Three new columns on pull_requests. No new
table — the classification is small enough to inline.
This phase transforms a labelled PR into a thing we can actually run. It is the longest phase, the most expensive, and the only one whose internal substages have a strict ordering: env → context → container.
Stages. resolve_packages (4), render_problems (5),
synthesize_images (6).
Why a separate phase. Everything before this is metadata. Everything after this requires an executable artifact (a Docker image with the package built from source). This phase exists to bridge that gap, and the bridge is hard — paper §3.1.3 enumerates the reasons:
automatically building such development copies proves to be a non-trivial task for three reasons. (1) Many scientific packages require complex tool interactions, which necessitate a bespoke build process. (2) The build process evolves significantly as a project matures. (3) The documentation for building packages tends to be extremely fragmented…
The three substages are an attempt to break the bridge into pieces that fail loudly and independently.
What. Check out each PR's merge commit, parse pyproject.toml /
setup.py / setup.cfg, run uv pip compile to resolve a fully-pinned
env_payload, record success/failure in packages.
Why first. The paper calls this the "Build Environment" filter and
puts it in stage 2 (rule-based filtering); in code it's a separate stage
because it's compute-intensive (paper: ~13 hours parallelized) and we
want to checkpoint its output. If can_install=False, all downstream
stages skip the PR — there is no point synthesizing a build script for a
PR whose deps can't even resolve.
Why uv. Paper §A.2.2: uv is fast enough that pinning every commit's
dep set in 13 hours is feasible. With pip-tools it would take days.
Concurrency. N=16 (the highest in the pipeline) because each item is I/O-heavy and externally rate-limited only by PyPI.
Output. packages.{python_version, env_payload, build_commands, install_commands, can_install}. The env_payload is later baked into
the Dockerfile as a pinned requirements.txt-equivalent — this is
critical for reproducibility (paper insists on building from source for
performance accuracy, but the build environment must be deterministic).
What. Build a self-contained problem statement for the PR by aggregating the PR body + linked issues, then asking an LLM to extract the initial observation, triage attempts, solution overview, and solution observations from the PR body. Render the result through a Jinja2 template. This is the text a downstream agent (Harbor's oracle, or a benchmark contestant) will actually read.
Why this is hard enough to be its own stage. Paper §A.2.4:
while an issue provides the initial observed performance regression or bottleneck, [issues] frequently bundle multiple optimization directions that are implemented across several pull requests. As a result, a problem statement derived only from the issue can under-specify the problem statement's starting state, leading to an ambiguous task (an agent may optimize a different aspect than the original change).
So we can't just dump the issue body — we need the LLM to narrow the problem to the specific direction the PR took. The extracted sentences are symbolically verified for fidelity (high LCS ratio with the original PR body) so the agent can't drift into hallucinated requirements.
Hard prerequisite gate. Stage 5 requires packages.can_install=True
and the PR must have at least one linked issue plus a non-empty body
(paper §A.2.4 "Context Filtering"). PRs without both are dropped — they
literally lack the raw material for a problem statement.
Output. candidate_prs (rich extracted components per PR) plus an
update to pull_requests.rendered_problem and problem_description.
What. Generate the shell scripts (docker_build_pkg.sh,
docker_build_run.sh) that build an editable install of the package
inside a Docker container at the PR's merge commit, such that ASV,
pytest, and our snapshot-testing tool all run.
Why this is the most complex stage. It is essentially the entire contribution of the paper's stage 3 ("Synthesizing Reproducible Environments"). The state machine in code mirrors the paper's "chronological retrieval + reflexive agent" design exactly:
| State | Maps to paper section |
|---|---|
CHECK_CACHE |
reuse-of-prior-builds (cost amortization) |
FIND_SIMILAR / TRY_SIMILAR |
"Chronological Retrieval" (10 nearest neighbours by commit date from the same repo) |
TRY_DEFAULT |
template-based fallback |
LLM_GENERATE |
"Agentic Synthesis" (10 reflexion turns max) |
FAIL |
drop and log |
Why the cache → similar → default → LLM cascade. Paper §A.2.5, "Chronological Retrieval":
We leverage the insight that build requirements rarely change drastically between adjacent commits. For a given task, we sample 10 successful build scripts from the same repository, sourced from a database of successfully built tasks, sorted by commit date. We first attempt to build the container using the script from the nearest chronological neighbor. If the build or verification fails, we move to the next neighbor until we either find a successful build or run out of neighbors.
The motivation is dollar cost. Paper reports the entire build-script
synthesis effort cost $446 at the time of writing, and explicitly says
this number would explode without the cascade because each LLM-from-scratch
synthesis is the most expensive single operation in the pipeline. This
is also why successful synthesis on one PR triggers _enqueue_neighbors()
in code — adding the ±60-day siblings to the queue means the cache is
maximally warm by the time those siblings are processed.
LLM cascade. The local gpt-oss-120b is tried first; failure falls
back to claude-3-5-sonnet-20241022. Codex / Gemini / Qwen are
additional swap-in agents in the code. Paper notes the largest models
rarely use the available tools (read repo files, parse setup.py) because
the failure logs are usually self-explanatory; smallest models (Llama
3.3 8B) use many tool turns. The 10-turn budget is hard.
Concurrency. N=3 worker pool (lower than other stages) because each worker holds a Docker daemon slot and a non-trivial amount of disk for intermediate layers. Workers share a rate-limit pause state so a 429 from the LLM provider stops everyone.
Output. candidate_containers (the synthesized scripts), an update
to pull_requests.container_name (the DockerHub tag), and an actual
push to DockerHub of base / repo / PR images.
Why three image layers. Cost again — the base image (Python + system deps) is shared across the entire repo; the repo image (source tree at HEAD-ish) is shared across PRs in the same repo; only the PR image is unique. Building bottom-up with cache-on-hit means a new PR in a known repo costs ~minutes, not ~hours.
What. Run the synthesized container through a real benchmark oracle, measure the actual speedup the PR produces, and only publish PRs that pass the speedup threshold.
Stages. harbor_healthcheck (7), publish (8).
Why a separate phase. This is the only phase that produces ground truth. Everything upstream is suggestive — only Harbor on Daytona actually times the code. Splitting verify from ship means we can re-measure (e.g. if hardware changes invalidate prior measurements, or if we want fresh trials) without re-publishing.
What. For each PR with a built container, hand off to Harbor (the
benchmark orchestrator) which materializes a task directory, launches
the oracle agent (a no-op agent that simply applies the human
expert's patch), runs ASV, and produces a reward.json with per-
benchmark speedups. We record one harbor_runs row per trial.
Why an "oracle" agent. This is the diagnostic question "does the human's patch, applied in our container, actually speed things up?" If not, our container is broken (build artifact issue) or the PR was mislabelled (false positive from phase ②). Either way, we drop it. It is critical that this measurement is done with the same agent infrastructure that contestants will use, otherwise we cannot claim the dataset is fair.
Why two environments (docker vs daytona). Paper §A.2.6:
measurement noise is the enemy. ASV's two-stage decision rule (Mann–
Whitney U with p < 0.002, or non-overlapping 99% CIs as fallback)
is conservative by design, but conservatism is meaningless if the
substrate is itself noisy. Daytona gives us a clean EC2 instance per
trial; local docker is fast but shares a host with other workloads.
Local Docker runs are useful for iteration; only Daytona runs gate stage 8.
(from CLAUDE.md). The code records harbor_runs.environment so
we can tell the two apart — and stage 8 explicitly filters to
environment='daytona'.
Concurrency. Harbor's LocalOrchestrator handles trial parallelism
internally; we batch-submit PRs and let Harbor parallelize.
--n-concurrent 16 for Daytona is typical.
What. Pick PRs whose best Daytona harbor_runs row reports
max_speedup ≥ 1.05, serialize them to Parquet via pyarrow, push to
HuggingFace, re-push their Docker images to DockerHub, mark
pull_requests.published_at.
Why the 1.05x threshold. The paper's stage 4 statistical test (Mann–Whitney U at p<0.002) is the correctness test — does the speedup exist statistically? The 1.05x threshold is an interestingness filter — the paper's premise is that LLMs should be evaluated on meaningful optimizations, not noise-floor improvements. A PR that passes Mann–Whitney with a 1.001x speedup is real but uninteresting; including it would inflate dataset size while diluting the benchmark.
Why publish is fail-closed. If Harbor never ran successfully on Daytona, the PR is not published — even if every upstream stage labelled it as a great optimization. This is the design choice that makes the FormulaCode dataset trustworthy: you cannot get into the release without a measured speedup. The paper calls this out as the project's central contribution over SWE-bench-style binary pass/fail benchmarks.
These are decisions that don't belong to any one stage but shape the whole pipeline.
There is no inter-stage messaging system. Stages communicate by writing columns/rows that other stages read. This is by design:
- Resumability. A stage that crashes leaves the DB in a valid state;
the next run picks up where the previous one left off via
--resumeor stage-specific filters. - Observability. The dashboard (
analysis/dashboard) reads the same tables the pipeline writes. Anyone can see the funnel narrowing in real time. - Independence. Stages can be re-run, re-ordered (within phases), or swapped out entirely. The contract is a table schema, not a Python function signature.
- Public read. The
anonrole can readrepositories,pull_requests,candidate_containers,harbor_runs,benchmark_informationoverhttps://api.formulacode.org. The pipeline is in effect a public dataset that happens to update itself.
The pull_requests table is the "spine" — it accretes a new column at
nearly every stage (is_performance_commit_symbolic,
is_performance_commit, classification, difficulty,
rendered_problem, container_name, published_at). Side tables
(packages, candidate_prs, candidate_containers, harbor_runs)
exist where the data is too large or too one-to-many to inline.
Paper §A.2: classification, extraction, and most synthesis runs go to
gpt-oss-120b served by vLLM, exposed via the LiteLLM proxy
(https://model.formulacode.org). Only the synthesis cascade falls
back to Claude 3.5 Sonnet (and only when local models fail repeatedly).
This is partly cost (the entire dataset cost $446 of Anthropic credits)
and partly throughput (local models have no rate limit, just a queue
depth).
This is restated here because it is the single most consequential architectural choice:
- Symbolic filter (stage 2): drop only on negative signal.
- LLM classifier (stage 3): bias toward YES on ambiguity.
- Build / synth (stages 4–6): try every fallback before giving up.
- Speedup gate (stage 7→8): the only hard precision filter in the pipeline.
Inverting this — putting precision early — would either lose real optimizations (false negatives are unrecoverable) or duplicate work (every precise filter has to redo the work the speedup gate already does). The pipeline is shaped like a funnel because the only thing we can actually trust is the ground-truth measurement at the bottom.
Per CLAUDE.md, every operationally tunable constant is read from the
environment with a DATASMITH_ prefix, defaulted to a literal in
Python. This matters for the pipeline because nearly every stage has
a "how aggressive should this be?" knob (rate-limit pause, neighbor
window, retry count, concurrency) and we want operators to tune
without code changes.
Paper §A.1: building from scratch takes ~32 hours and ~2 TB of disk.
We don't do that. The pipeline is run with a monthly date window
(--start-date 2026-01-01 --end-date 2026-01-31) so each invocation
processes only that month's new merged PRs. Caches (build scripts,
base images, repo images) carry over from previous months, so the
amortized per-month cost is small.
Important caveats so a reader doesn't take the diagram too literally:
- Phase boundaries are not transactional. A failure mid-phase doesn't roll back the phase. Each stage commits per-item.
- The "external services" arrows are shared. GitHub touches stages 1, 2, and 5; the LLM proxy touches 3, 5, and 6. The diagram dots these in to show which stage uses which service, but in production a single LiteLLM instance fields all three.
- Stage 6 is the only stage that pushes Docker images during the pipeline. Stage 8 re-pushes to ensure availability, but the initial push happens in stage 6. The diagram simplifies this.
- Harbor is itself a multi-component system. "Stage 7 → Daytona" hides that Harbor materializes task directories, runs the oracle agent, parses ASV output, and aggregates trials. The diagram treats it as one node because the externally visible behavior — submit containers, get speedups — is one operation.
- The paper's "stage 4" (statistical validation) is inside our stage 7, not our stage 8. Mann–Whitney U is what Harbor's reward computation is doing under the hood; the 1.05x filter in publish is on top of that statistical test, not in place of it.
If you need any of those details, fall back to Variation C
(database-as-bus) in FLOWCHART.md or read the per-stage modules
listed in CLAUDE.md.