Forge

An autonomous coding agent that tells you the truth.

Forge is a multi-agent engine that plans, writes, reviews, and merges code on your behalf. Unlike most agent frameworks, it refuses to mark work "done" unless the tests actually pass, the diff actually landed, and the worker's self-report matches reality. When an agent lies on the dashboard, CI turns red.

Built on LangGraph, with workers running OpenCode inside isolated git worktrees. Project-agnostic via studio/config.yaml. Current release: v5.1.0 — Quality Intelligence, Multi-Language Toolchains.

Built by Allen Sarkisyan — GitHub · LinkedIn.

The live dashboard during a real session. "OUTCOME LIE PREVENTED" banner shows honesty invariants catching a worker claim that didn't match reality.

At a glance


Aider polyglot · Python (Claude Sonnet 4.5 via Bedrock, Mode A, 3-run variance; 2026-04-18)	34 / 34 · 100% · median $6.53/run · 86s/ex (details)
Aider polyglot · Rust (Claude Sonnet 4.5 via Bedrock, Mode A, single run, n=1; 2026-04-18)	30 / 30 · 100% · cost pre-telemetry-fix (not reliably captured) (details)
Aider polyglot · Go (Claude Sonnet 4.5 via Bedrock, Mode A, single run, n=1; 2026-04-19)	39 / 39 · 100% · $7.70 · 57 min · 88s/ex (details)
Test suite on `main` (collected 2026-05-06)	8,400+ tests (see Tests)
Honesty invariants active	`epic_outcome_lie_prevented` · `completion_mismatch` · `calibration_skipped_zero_actual` · `subtask_result_contract_violation`
Multi-language toolchains	Go · Rust · Java · C++ · Python · TypeScript (auto-detected from project files)
Quality intelligence	18 deterministic modules: impl guard, coherence auditor, vocabulary checker, acceptance DSL, failure taxonomy, advisory tuner, and more
License	Apache 2.0

Polyglot coverage (as of 2026-04-19): 103 / 225 exercises verified across 3 of 6 languages, all 100%. JavaScript, C++, Java runs are the remaining unknowns; see docs/POLYGLOT.md for methodology and caveats.

Polyglot methodology, caveats, and full comparison to Aider's own published numbers in docs/POLYGLOT.md.

What Forge does · What it isn't
Quickstart
How it works: Architecture · Anatomy of one epic
Why it's different: How it compares · Design principles · Failure modes
Reference: Worker runtimes · Configuration · Agent roster · CLI · Observability & testing · Project layout
Documentation map · Releases · Design influences · License

What Forge does

You give Forge a backlog (a set of epics or tasks). For each one, it:

Plans — picks the next epic, drafts an architectural spec, and decomposes the work into 2–4 small, reviewable subtasks.
Builds — fans out parallel workers into isolated git worktrees, each running a test-driven OpenCode session. Language-aware toolchains (auto-detected from go.mod, Cargo.toml, pom.xml, etc.) handle build, test, vet, and format.
Reviews — the Real Implementation Guard rejects stubs/demos before the diff reaches the reviewer. Then: reviewer, security scan, red team adversarial edge cases, and integration check. The VerificationPipeline runs all lint/vet steps with preflight tool availability checks.
Merges — applies approved subtasks in layer order, subtracting any pre-existing failures via a baseline so only new regressions trigger a rollback.
Learns — writes a retrospective, updates calibration data, and feeds lessons back into the planner.

Every epic terminates on one of six validated outcome kinds (done, shelved, escalated, shelved_by_redteam, abandoned_pending, error) — never an ambiguous "finished". The full event stream is the audit trail; there is no other source of truth.

The thing most frameworks don't do

An LLM that says "I fixed it" is not evidence. Forge treats worker self-reports as claims to be verified:

.forge/completion.json (the worker's claim) is cross-checked against actual test output. Mismatches emit completion_mismatch and demote the worker via the Trust Ratchet.
A run that reports done but fanned out zero workers hits an invariant in emit_epic_outcome_event and either emits epic_outcome_lie_prevented or raises AssertionError (under FORGE_STRICT_OUTCOMES=1, which CI sets).
Retrospectives citing models that aren't in metrics.db, or dated off-year, are quarantined to .forge/evolution/<ts>.invalid.md instead of the canonical feed.

What Forge isn't

Not an IDE. For interactive coding, use Cursor or Claude Code.
Not a one-shot bug fixer. Even a 10-line fix traverses the full plan → build → review → merge loop. For targeted work, forge task <repo> -d "..." exists; for issue-only or PR-only flows the scaffolding is present but not in the hot path.
Not deterministic. Costs, timings, and subtask counts vary run-to-run. The gates are deterministic; the work isn't.
Not single-model. Every role (planner, worker, reviewer, adversarial) can be a different model. Models are pinned in studio/models.yaml — one file, no fallbacks, no auto-swapping. See docs/MODELS.md.
Not a replacement for human review on novel architecture. The AUTONOMOUS trust tier auto-merges small, well-understood changes; anything surprise-magnitude or cross-package still routes to the approval queue.

Quickstart

Prerequisites. Python 3.12+, Docker Desktop (required since R9.E), OpenCode on PATH (only needed for ad-hoc tooling — Forge installs it inside the worker image), and one provider credential (Bedrock API key, AWS credentials, or OPENROUTER_API_KEY). Run make worker-image once to build forge-worker:latest. Go 1.24+ only for the bundled game example; nono only when launching via ./run.sh.

Install.

git clone https://github.com/sjqtentacles/forge && cd forge
python3 -m venv .venv && source .venv/bin/activate
pip install -e 'platform[dev]'
cp .env.example .env   # then uncomment the credentials you want to use

Run (canonical).

PYTHONPATH=platform .venv/bin/python3 -m forge.cli run --dashboard
PYTHONPATH=platform .venv/bin/python3 -m forge.cli run --once --max-tasks 3

Launcher scripts (not interchangeable):

./run.sh — Bedrock via AWS profile (uses $AWS_PROFILE, defaults to default; set AWS_PROFILE in your shell or .env to use a different profile), wraps the CLI in nono.
./run-forever.sh — OpenRouter (OPENROUTER_API_KEY), continuous daemon, no sandbox.

Dashboard at http://127.0.0.1:8420. Models are pinned in studio/models.yaml — one file, no fallbacks; see docs/MODELS.md for the policy. The rest of the configuration (project shape, features, gates, retention) lives in studio/config.yaml; full schema in docs/CONFIG-REFERENCE.md. See docs/QUICKSTART.md for a longer walkthrough, troubleshooting table, and test commands.

How it works

Architecture

Two LangGraph state machines compose the engine.

The outer graph schedules epics and retrospects:

graph LR
    startNode[Start] --> BP[Backlog Planner]
    BP -->|"Send per epic"| EP[Epic Pipeline]
    BP -->|"queue empty"| MM[Mega Merge]
    EP --> MM
    MM --> Retro[Retrospective]
    Retro --> endNode[End]

The inner graph runs a single epic — Strategist picks it, Architect designs, Decomposer splits into tiny subtasks, workers fan out in isolated worktrees, gated by Reviewer / Security / Red Team / Merge, with fresh-context recovery and a post-merge playtest:

graph LR
    S[Strategist] --> CS[Context Scout]
    CS -->|"first pass"| AR[Architecture Review]
    CS -->|"retry pass"| D[Decomposer]
    CS -->|"fan_out phase"| FO[Fan Out]
    AR --> D
    D --> V[Validator]
    V --> CS
    FO --> W[Workers]
    W --> F[Fixer]
    F --> R[Reviewer]
    R --> SR[Security Review]
    SR --> IC[Integration Check]
    IC --> RT[Red Team]
    RT --> M[Merge]
    M -->|"success"| VA[Verify Acceptance]
    VA --> P[Playtester]
    P --> PL[Persist Learnings]
    PL --> endNode[End]

    RT -->|"edge case found"| F
    M -->|"restyle"| RA[Recovery Advisor]
    RA --> CS
    M -->|"fallback"| Ref[Refactor]
    Ref --> P

Forge-on-Forge mode (running Forge against its own monorepo) is configured via platform/forge/engine/forge_self.py — see docs/PROFILES.md for the documented profile and studio/profiles/forge_on_forge.yaml for the wired entry point. The R5 issue-resolution / PR-mode / SWE-bench-stub scaffolding modules were deleted in R6 (see docs/V4.6.md) — the canonical SWE-bench harness lives at forge.swebench.task.

Anatomy of one epic

Concrete example: what actually happens when Forge runs HTTP-202: Implement CSV splitter.

Stage	What happens	Artifact	Key events
Strategist picks	Selects epic from taskboard + design doc + learnings	Spec draft in state	`node_enter {node: strategist}`
Architect designs	Interfaces, file layout, acceptance criteria	`.forge/specs/HTTP-202.md`	`node_exit {node: strategist}`
Decomposer splits	2–4 subtasks within quality gates	`state.subtasks_live` + DAG	`decomposer_vote`, `decomposer_lesson_injected`
Validator checks	Type conflicts, bad imports (retry-bounded)	—	`node_enter {node: validator}`
Context Scout	Impact maps per subtask via CodeGraph	Affected-test lists	—
Fan out	Workers dispatched in parallel (layer order)	`.worktrees/HTTP-202a/` etc.	`fan_out {workers: N}`
Workers run OpenCode	TDD loop: write test, fail, implement, pass	Worktree diff + `completion.json`	`completion_mismatch`?
Fixer (if needed)	Lint/build errors fed back	Updated diff	`node_enter {node: fixer}`
Reviewer + preflight	Oversize diffs short-circuit to redecompose	Review verdict	`diff_preflight_reject`?
Security + Integration	Vulnerability scan + cross-subtask check	—	—
Red Team	Adversarial edge-case tests against approved diffs	New failing test?	`red_team_merge_with_debt`?
Merge (baseline-aware)	Layer-ordered; subtracts pre-existing failures	Updated working tree	`merge_baseline`, `merge_rollback`?
Playtest + Verify	Headless smoke + acceptance check	`docs/PLAYTEST-LOG.md` (generated at runtime; not committed)	—
Persist Learnings	Outcome inference via `infer_outcome_kind`	`metrics.db`, `calibration.jsonl`	`epic_outcome {kind: done}` (validated)

Conditional branches not shown: epic_coherence (post-Strategist sanity check), plan_critic (reviews the decomposer's plan), sketch_gate (pre-worker draft check), adversarial_review (reviewer triad + arbiter tiebreaker), red_team_repair, recovery_advisor.

Why it's different

How it compares

The agent-framework space is crowded and moving weekly; rather than claim features about tools I can't fully verify, here is what Forge specifically does, with code pointers, plus the one benchmark where a direct comparison is fair:

Honesty invariants. "Refuse to mark work done unless tests actually pass" is enforced by a completion.json covenant cross-checked against real test output. Mismatches emit completion_mismatch and demote the worker via the Trust Ratchet. Outcome lies (e.g. reporting done with zero workers fanned out) hit an invariant in emit_epic_outcome_event (platform/forge/logging.py) and either emit epic_outcome_lie_prevented or raise AssertionError under FORGE_STRICT_OUTCOMES=1.
Per-subtask worktree isolation. Every subtask runs in its own .worktrees/<epic>/ directory; every test gets its own .forge/ tmp dir via autouse _isolate_forge_state; every worker gets its own MCP server. Cross-contamination is a bug, not a trade-off.
Baseline-aware merge gate. evaluate_merge_gate snapshots pre-existing failures at pre_merge_head and subtracts them; only new regressions roll back a branch. Placeholder or null baselines raise BaselineNotReady rather than silently passing.
Validated outcome enum. Epics terminate on one of six kinds (done, shelved, escalated, shelved_by_redteam, abandoned_pending, error), with a paired epic_outcome_cause. Off-enum values down-coerce to error and emit epic_outcome_kind_invalid.

On the one benchmark where numbers are directly comparable — the Aider polyglot benchmark on Claude Sonnet 4.5:

Harness	Python (34 ex.)	Rust (30 ex.)	Go (39 ex.)	All-6-langs (225 ex.)
Aider's own (published)	not published per-language	not published per-language	not published per-language	77.9%
Forge worker (Mode A)	100% (34/34) · 3-run variance · median $6.53	100% (30/30) · single run, n=1	100% (39/39) · single run, n=1 · $7.70	103/225 measured · 122 remaining (JS · C++ · Java)

Mode A bypasses Forge's planner/reviewer stack to benchmark just the worker loop; full-stack numbers will differ. See docs/POLYGLOT.md for the methodology, caveats about training-data contamination, and why Mode A was the honest starting point.

Design principles

Git is the memory, not the prompt. Retries rebuild the prompt from scratch using progress.json + the current diff. No stacked error histories, no prompt degradation. This is the Ralph Wiggum pattern.
Baselines make failures real. evaluate_merge_gate snapshots pre-existing failures at pre_merge_head and subtracts them; only new regressions roll back a branch.
Invariants assert; diagnostics log. Honest violations raise AssertionError under FORGE_STRICT_OUTCOMES=1; in production they emit structured events (epic_outcome_lie_prevented, completion_mismatch, calibration_skipped_zero_actual, subtask_result_contract_violation).
Workers can't lie twice. .forge/completion.json is cross-checked against actual test output. Repeat liars get demoted by the Trust Ratchet.
Every outcome is on-enum. Epics terminate on one of six validated kinds — done, shelved, escalated, shelved_by_redteam, abandoned_pending, error — with a paired epic_outcome_cause. Off-enum values down-coerce to error and emit epic_outcome_kind_invalid.
Isolated by construction. Each subtask gets its own worktree, each test gets its own .forge/ tmp dir (autouse _isolate_forge_state), each worker gets its own MCP server. Cross-contamination is a bug, not a trade-off.
Small epics beat clever epics. Hard gate at 200 lines / 4 files per subtask, tightens 30% per retry with a 40-line floor. If a diff blows the budget, diff_preflight rejects it before the reviewer ever sees it.
The retrospective is held to the same standard. Reports citing models not in metrics.db or dated off-year land in .forge/evolution/<ts>.invalid.md sidecars — not the canonical feed.

Failure modes and how they're handled

Failure	Detection	Response
Worker claims pass, tests fail	Completion contract cross-check	`completion_mismatch` + trust demotion
"done" with zero workers fanned out	`emit_epic_outcome_event` invariant	`epic_outcome_lie_prevented` (or `AssertionError` under `FORGE_STRICT_OUTCOMES=1`)
Diff exceeds the line budget	`diff_preflight.evaluate` pre-review	Reject, tighten decomposer budget 30%, retry
Retry loop stuck on oversize	`merge_node` ratio check	Escalate with cause `redecompose.oversize_stuck`
`_lines_changed` can't be measured	`worker_diff.compute_lines_changed` returns `None`	Skip calibration row, emit `calibration_skipped_measurement_failure`
Calibration row has `estimate > 0 AND actual == 0` on a pass	Defensive check in `record_if_eligible`	Drop row, emit `calibration_skipped_zero_actual`
Baseline is placeholder or null-aggregate	`detect_regression` guard	Raise `BaselineNotReady`; callers refresh baseline
Retrospective cites a model not in `metrics.db`	`retro_validator` check	Quarantine to `.forge/evolution/<ts>.invalid.md`
Primary model failure rate spikes	Operator inspects dashboard cost panel + `metrics.db`	Edit `studio/models.yaml`, restart. (Auto-swap was deleted in the 2026-05-02 cleanup; see docs/MODELS.md.)
Tests poison the real `.forge/*.db`	`CostTracker` tripwire under `PYTEST_CURRENT_TEST`	Autouse `_isolate_forge_state` redirects every constructor
Epic times out	`epic_pipeline` timeout branch	Outcome `error` with cause `epic_pipeline.timeout` (not off-enum `timeout`)

Quality intelligence

Forge ships 18 deterministic (no-LLM) quality modules that compose into the pipeline at different stages. All follow TDD — the module tests define the behavioral contract.

Planning quality (pre-worker):

Module	What it does
Plan Critic + Cross-Package Coherence	7-step reasoning guide catches hidden deps, sizing optimism, and shared-concept drift between subtasks targeting different packages
Integration Wiring Subtask	Sanitizer auto-injects a final-layer subtask to wire all packages into the entry point; decomposer prompt rule forces explicit emission
Import Graph Injection	Lightweight package-level dependency graph injected into decomposer prompts (budget-capped)
Contract Registry	Auto-extracts exported functions/types/constants per package for decomposer context
Integration Test Requirement	Auto-appends cross-package test requirement to subtask specs with dependencies

Worker quality (during build):

Module	What it does
Real Implementation Guard	Static checks for stub patterns: Go (fmt.Println-only main, panic("not implemented")), Python (pass-only, raise NotImplementedError), TypeScript (console.log-only, throw not implemented)
Verification Pipeline + Preflight	Configurable multi-step lint/type-check/format pipeline; `preflight_check` warns about missing binaries on PATH at startup
Incremental Testing	Selects only tests affected by changed files via CodeGraph
Enhanced Worker Context	Injects neighbor package exports (budget-capped) so workers know available symbols

Post-build quality (pre-merge):

Module	What it does
Acceptance DSL + Runner	YAML-based scenarios with deterministic assertions (exit_code, stdout_contains, file_exists, runtime_under_s, etc.)
Coherence Auditor	Extracts Go/Python constants and switch-case values; flags when two packages diverge on a shared concept
Vocabulary Consistency Checker	Groups constants by prefix; cross-references switch variables by stem overlap

Feedback and adaptation:

Module	What it does
Failure Taxonomy	Classifies every rejection into structured categories stored in metrics.db
Advisory Tuner	Suggests config changes from failure patterns (never auto-applies)
Quality Profiles	Determines lightweight/standard/strict gate set based on project maturity
Scenario Generation	Auto-generates ACCEPTANCE.yaml skeletons from epic acceptance criteria
Cached Decomposition	Structural hashing; skips redundant validation on retry

See docs/QUALITY-INTELLIGENCE.md for the full reference.

Multi-language toolchains

Forge auto-detects the project language from marker files and selects a best-in-class toolchain:

Language	Detected by	Build	Test	Lint	Format
Go	`go.mod`	`go build ./...`	`go test ./...`	`golangci-lint run ./...`	`gofmt`
Rust	`Cargo.toml`	`cargo build`	`cargo test`	`cargo clippy -- -D warnings`	`cargo fmt`
Java	`pom.xml` / `build.gradle`	`mvn compile` / `gradle build`	`mvn test` / `gradle test`	`checkstyle`	`spotless`
C++	`CMakeLists.txt`	`cmake --build`	`ctest`	`clang-tidy`	`clang-format`
Python	`pyproject.toml`	`pip install -e .`	`pytest`	`ruff check`	`ruff format`
TypeScript	`package.json` + `tsconfig.json`	`npm run build`	`npm test`	`biome check`	`biome format`

The ToolchainBridge adapts these Protocol-based implementations to the worker's interface. Override any command via project.commands in studio/config.yaml.

Worker runtime

Since R9.E, Forge runs every OpenCode worker inside a container. The factory (forge.runtime.create_runtime) only accepts worker_runtime: docker; legacy values (local, claude_code, auto) raise UnsupportedRuntimeError at startup so stale configs fail fast.

Runtime	How it runs	Notes
`docker` (only supported)	OpenCode in a container built from `Dockerfile.worker`; `docker cp` syncs the worktree in/out, aborts on copy failure	Build once with `make worker-image`. Cost/usage parsing flows through the shared `usage.py` parser and `--format json` + `.forge/completion.json`, just like before.

Configuration

Two files. Per project.

File	Owns
`studio/models.yaml`	The only place model IDs and per-model token rates live. One required block of four aliases (`primary`, `secondary`, `reasoning`, `embedding`) plus optional rates. No fallbacks, no model tiers, no auto-swap. Schema and policy: docs/MODELS.md.
`studio/config.yaml`	Everything else — project shape, features, quality gates, retention, etc. Schema: docs/CONFIG-REFERENCE.md.

studio/config.yaml sections worth knowing about:

Section	Controls
`project`	Language, paths, commands, docs, sandbox allowlists
`worker_runtime` / `worker_image`	Container backend + image
`providers`	LLM endpoints (base URL, region, API-key env var). Provider for any given model is derived from the model ID prefix in `forge.providers.routing`; no per-agent `provider:` field.
`agents`	Per-agent `model` (alias from `models.yaml`), temperature, max_tokens, json_mode
`features`	`best_of_n`, `adaptive_search`, `mcp_tools`, `record_llm`, `record_patterns`, `record_training`, `evolution_proposals`, `chromadb_indexing`
`quality_gates`	Max lines / files per task (default 200 / 4); best-of-N count + threshold
`context_budgets`	Per-section token budgets
`memory`	Embedding provider, vector store, structured store paths
`rate_limits`, `concurrency`, `timeouts`	`max_epics: 1`, `max_workers_per_epic: 4`, Worker 900s, Fixer 300s
`reviewer`, `security`, `voting`, `promotion`, `orphan_branch`, `decomposer`	Per-subsystem policy tuning: rubber-stamp alarm, security-mismatch retro threshold, self-consistency collapse short-circuit + temperature controller, canary promote-gate agreement, orphan-branch retry threshold, decomposer compaction + prompt ceiling
`retention`, `trust`, `approval`	Retention caps, promote threshold, auto/manual approval

Agent roster

Grouped by phase; every agent has a YAML in studio/agents/ and a row in config.yaml:

Phase	Agents
Plan	Strategist (picks epic), Architect (designs interfaces), Decomposer (2–4 subtasks within gates), Plan Critic (reviews the decomposition), Validator (structural feasibility), Context Scout (impact maps, no LLM)
Build	Sketcher (pre-worker draft gate), Worker / Green (OpenCode TDD in worktree with `completion.json` covenant)
Review	Red (adversarial, reasoning tier), Fixer, Reviewer (reasoning tier, soft quality gates), Arbiter (reviewer-disagreement tiebreaker), Security Reviewer, Integration Architect
Merge + cleanup	Merge (baseline-aware), Refactor (post-merge dedup), Playtester (headless smoke)
Learn	Retrospective (writes validated proposals to `.forge/evolution/`)

17 prompt YAMLs under studio/agents/. director.yaml is legacy (superseded by Strategist); critic.yaml exists but isn't in the active agent config.

CLI

python -m forge.cli <subcommand> — sourced from platform/forge/cli.py:

Subcommand	Purpose
`init`	Scaffold a new project (`studio/`, `.forge/`, starter `config.yaml`)
`run`	Launch the engine (`--once`, `--dashboard`, `--max-tasks`, `--max-epic-retries`, `--dashboard-linger`, `--profile`)
`status`	Read `.forge/checkpoint.json` (`--costs`, `--diagnostics`, `--deep`)
`sleeptime`	Run a meta-evolution cycle
`task <repo>`	Clone a repo and run Forge against a free-form task
`config validate`	YAML parse + structural check
`canary` / `canary promote-gate`	Replay frozen canary epics; sign-off-gated promotion of `features.gate_node_primary`
`doctor`	Preflight: config, dirs, PATH, MCP boot, DB schemas
`costs`	Breakdown from `.forge/costs.db`
`diagnostics`	Failure diagnostics from `.forge/metrics.db`
`purge-index`	Evict stale worktree chunks from the code embedding collection
`backlog unshelve <epic>`	Clear cooldown + demotion so the planner can re-select
`preflight`	Dry-run the subtask acceptance gate without a worker
`memory migrate`	Patterns library → project-scoped git-history retrieval
`requeue <epic>`	Append to `.forge/priority_queue.json`
`priority list` / `priority clear`	Inspect / clear the priority queue
`profiles list`	Enumerate known `ProjectProfile` descriptors

Trace-replay: canary --replay --against <ref> --shadow --threshold --max-usd.

Observability and testing

Dashboard at http://127.0.0.1:8420 — D3 + dagre workflow DAG, live per-worker Docker status, per-model cost, activity-span hourly rate, subtask kanban, pause/resume, manual approval queue, /healthz + /readyz.

Structured events worth grepping for (all verified in-tree): epic_outcome_lie_prevented, epic_outcome_kind_invalid, lines_changed_measurement_failed, calibration_skipped_zero_actual, calibration_skipped_measurement_failure, completion_mismatch, subtask_result_contract_violation, trust_demotion, diff_preflight_reject, redecompose_budget_consumed, merge_baseline, merge_rollback, fan_out, reviewer_disagreement_arbitrated. Baseline-readiness failures raise BaselineNotReady (exception, not event). Run CI with FORGE_STRICT_OUTCOMES=1 so the outcome-lie diagnostic hard-asserts.

Tests. python3 -m pytest platform/tests/ -p no:xdist -o "addopts=" -v runs 8,400+ tests (as of 2026-05-06). Autouse _isolate_forge_state (platform/tests/conftest.py) redirects every .forge/*.db constructor and _REPO_ROOT to a per-test tmp dir. The default pytest invocation uses -n auto --dist loadscope from platform/pyproject.toml; the -p no:xdist -o "addopts=" form above disables parallelism for deterministic output ordering.

Eval harness. 30 gold scenarios in platform/forge/eval/scenarios.py; nightly detect_regression + per-layer attribution; BaselineNotReady on placeholder baselines (v4.5.2). See docs/EVAL-HARNESS.md.

Canary. forge canary replays 3 deterministic epics (epic_fix_typo, epic_add_pure_helper, epic_should_shelve) against .forge/canary/canary_baseline.json.

Project layout

forge/
├── run.sh, run-forever.sh     # launchers (nono sandbox, continuous daemon)
├── Dockerfile, Dockerfile.worker
├── platform/forge/            # Python package
│   ├── cli.py, engine/, runtime/, branching/, context/, memory/
│   ├── mcp/, trust/, eval/, meta/, dashboard/, approval/
│   └── quality/, observability/, integrations/, plugins/
├── platform/tests/            # 8,400+ tests; autouse _isolate_forge_state
├── studio/
│   ├── config.yaml, agents/   # 17 per-agent prompt YAMLs
│   └── workflows/             # orchestrator.py, worker_session.py, retrospective.py
│       └── nodes/             # extracted node modules with typed I/O contracts
├── examples/                  # 8 project profiles: go-game, go-roguelike, go-horror,
│                              # go-roguefort, go-arcology, go-voidrift, go-ironhold, python-api
├── benchmarks/                # per-model benchmark harness
├── docs/                      # release notes + deep-dive references
└── .forge/                    # runtime state: costs.db, metrics.db, trust.db,
                               # canary/, eval/baselines/, specs/, evolution/

Full layout and subsystem deep-dives in docs/ARCHITECTURE.md and docs/PLATFORM.md.

Documentation map

Start here, then drill down as needed. Full index at docs/README.md.

You want to…	Read
Install and run Forge	docs/QUICKSTART.md
Contribute code	CONTRIBUTING.md
Understand the architecture	docs/ARCHITECTURE.md, docs/PLATFORM.md
Pick / change models	docs/MODELS.md
Configure a project	docs/CONFIG-REFERENCE.md
Point Forge at a new codebase	examples/README.md
Reproduce the benchmark numbers	benchmarks/README.md, docs/POLYGLOT.md, docs/SWE-BENCH.md
See what changed per release	CHANGELOG.md
Read design decisions	`docs/adr/`
Understand the honesty invariants	docs/EVAL-HARNESS.md, docs/V4.5.2.md

Releases

Version	Theme	Notes
v5.1.0 (current)	Quality intelligence, multi-language toolchains	18 quality modules, Go/Rust/Java/C++ toolchains, plan critic enabled, wiring subtask injection, acceptance DSL, VerificationPipeline preflight
v5.0.0	Quality hardening, node architecture, CI gates	Extracted node modules with typed I/O contracts, GitHub Actions CI, 8,400+ tests, 87% coverage
v4.5.2	Outcome honesty, test isolation, release gates	docs/V4.5.2.md
v4.5.1.1	Live-validation patch (HTTP-102 follow-ups)	docs/V4.5.1.1.md
v4.5.1	Live-validation hardening over v4.5	docs/V4.5.1.md
v4.5	Truthful, budget-aware workers	docs/V4.5.md
v4.4.3	Eval harness (pure infrastructure, no behavior change)	docs/V4.4.3.md
v4.4	Truthful outcomes + robust Forge	docs/V4.4.md

Older: v4.3.1, v4.3, v4.2, v4 — see docs/V4*.md. Stage notes for v4.5.2 layers B–E live under docs/V4.5.2-stage*.md.

Design influences

Research

TDAD (Test-Driven Agentic Development) — AST-based code–test dependency graphs; direct parent of Forge's impact maps.
Self-consistency (Wang et al., arxiv 2203.11171) — Majority voting over multiple samples. Powers the decomposer / reviewer vote-and-select path with collapse short-circuit + temperature controller (platform/forge/engine/voting.py).
Ralph Wiggum pattern (Geoffrey Huntley) — Fresh context + git-as-memory iteration over stacked error histories. Drives the progress.json retry rebuild.
SWE-bench Pro (Princeton NLP) — Evaluation taxonomy for realistic, multi-file repository tasks. Forge ships a data layer for running against it (platform/forge/swebench/task.py).

Engineering lineage

LangGraph — Entire orchestrator is StateGraph composition with Send fanout and checkpointing.
Aider (Paul Gauthier) — SEARCH/REPLACE block grammar (platform/forge/tools/diff_applier.py), polyglot benchmark corpus (scripts/_polyglot_one.py), and the anti-pattern lesson that "fuzzy matchers accumulate edge cases" — so we don't have one.
Model Context Protocol (Anthropic) — Every worker gets a per-worktree MCP server (platform/forge/mcp/) exposing search_codebase, find_related_code, check_risk, recall_similar_tasks.
Circuit breaker (Nygard / Fowler) — Decompose/retry breaker shelves or accepts partial debt (platform/forge/engine/circuit_breakers.py). The cross-model failure-rate auto-swap variant was deleted in the 2026-05-02 cleanup along with ModelRouter; model selection is now operator-driven via studio/models.yaml.
tree-sitter — Symbol extraction under platform/forge/context/; feeds the repo map, CodeGraph, and impact-map construction.

License

Apache License 2.0 — see LICENSE.

Maintainer

Built and maintained by Allen Sarkisyan — GitHub · LinkedIn.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Forge

At a glance

Table of contents

What Forge does

The thing most frameworks don't do

What Forge isn't

Quickstart

How it works

Architecture

Anatomy of one epic

Why it's different

How it compares

Design principles

Failure modes and how they're handled

Quality intelligence

Multi-language toolchains

Worker runtime

Configuration

Agent roster

CLI

Observability and testing

Project layout

Documentation map

Releases

Design influences

License

Maintainer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 575 Commits
.cursor		.cursor
.forge		.forge
.github/workflows		.github/workflows
.vscode		.vscode
benchmarks		benchmarks
docs		docs
examples		examples
platform		platform
scripts		scripts
studio		studio
.env.example		.env.example
.gitignore		.gitignore
.importlinter		.importlinter
AUTHORS		AUTHORS
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.worker		Dockerfile.worker
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
opencode.json		opencode.json
run-forever.sh		run-forever.sh
run.sh		run.sh
run_benchmark.py		run_benchmark.py

Folders and files

Latest commit

History

Repository files navigation

Forge

At a glance

Table of contents

What Forge does

The thing most frameworks don't do

What Forge isn't

Quickstart

How it works

Architecture

Anatomy of one epic

Why it's different

How it compares

Design principles

Failure modes and how they're handled

Quality intelligence

Multi-language toolchains

Worker runtime

Configuration

Agent roster

CLI

Observability and testing

Project layout

Documentation map

Releases

Design influences

License

Maintainer

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages