An autonomous coding agent that tells you the truth.
Forge is a multi-agent engine that plans, writes, reviews, and merges code on your behalf. Unlike most agent frameworks, it refuses to mark work "done" unless the tests actually pass, the diff actually landed, and the worker's self-report matches reality. When an agent lies on the dashboard, CI turns red.
Built on LangGraph, with workers running OpenCode inside isolated git worktrees. Project-agnostic via studio/config.yaml. Current release: v5.1.0 — Quality Intelligence, Multi-Language Toolchains.
Built by Allen Sarkisyan — GitHub · LinkedIn.
The live dashboard during a real session. "OUTCOME LIE PREVENTED" banner shows honesty invariants catching a worker claim that didn't match reality.
| Aider polyglot · Python (Claude Sonnet 4.5 via Bedrock, Mode A, 3-run variance; 2026-04-18) | 34 / 34 · 100% · median $6.53/run · 86s/ex (details) |
| Aider polyglot · Rust (Claude Sonnet 4.5 via Bedrock, Mode A, single run, n=1; 2026-04-18) | 30 / 30 · 100% · cost pre-telemetry-fix (not reliably captured) (details) |
| Aider polyglot · Go (Claude Sonnet 4.5 via Bedrock, Mode A, single run, n=1; 2026-04-19) | 39 / 39 · 100% · $7.70 · 57 min · 88s/ex (details) |
Test suite on main (collected 2026-05-06) |
8,400+ tests (see Tests) |
| Honesty invariants active | epic_outcome_lie_prevented · completion_mismatch · calibration_skipped_zero_actual · subtask_result_contract_violation |
| Multi-language toolchains | Go · Rust · Java · C++ · Python · TypeScript (auto-detected from project files) |
| Quality intelligence | 18 deterministic modules: impl guard, coherence auditor, vocabulary checker, acceptance DSL, failure taxonomy, advisory tuner, and more |
| License | Apache 2.0 |
Polyglot coverage (as of 2026-04-19): 103 / 225 exercises verified across 3 of 6 languages, all 100%. JavaScript, C++, Java runs are the remaining unknowns; see docs/POLYGLOT.md for methodology and caveats.
Polyglot methodology, caveats, and full comparison to Aider's own published numbers in docs/POLYGLOT.md.
- What Forge does · What it isn't
- Quickstart
- How it works: Architecture · Anatomy of one epic
- Why it's different: How it compares · Design principles · Failure modes
- Reference: Worker runtimes · Configuration · Agent roster · CLI · Observability & testing · Project layout
- Documentation map · Releases · Design influences · License
You give Forge a backlog (a set of epics or tasks). For each one, it:
- Plans — picks the next epic, drafts an architectural spec, and decomposes the work into 2–4 small, reviewable subtasks.
- Builds — fans out parallel workers into isolated git worktrees, each running a test-driven OpenCode session. Language-aware toolchains (auto-detected from go.mod, Cargo.toml, pom.xml, etc.) handle build, test, vet, and format.
- Reviews — the Real Implementation Guard rejects stubs/demos before the diff reaches the reviewer. Then: reviewer, security scan, red team adversarial edge cases, and integration check. The VerificationPipeline runs all lint/vet steps with preflight tool availability checks.
- Merges — applies approved subtasks in layer order, subtracting any pre-existing failures via a baseline so only new regressions trigger a rollback.
- Learns — writes a retrospective, updates calibration data, and feeds lessons back into the planner.
Every epic terminates on one of six validated outcome kinds (done, shelved, escalated, shelved_by_redteam, abandoned_pending, error) — never an ambiguous "finished". The full event stream is the audit trail; there is no other source of truth.
An LLM that says "I fixed it" is not evidence. Forge treats worker self-reports as claims to be verified:
.forge/completion.json(the worker's claim) is cross-checked against actual test output. Mismatches emitcompletion_mismatchand demote the worker via the Trust Ratchet.- A run that reports
donebut fanned out zero workers hits an invariant inemit_epic_outcome_eventand either emitsepic_outcome_lie_preventedor raisesAssertionError(underFORGE_STRICT_OUTCOMES=1, which CI sets). - Retrospectives citing models that aren't in
metrics.db, or dated off-year, are quarantined to.forge/evolution/<ts>.invalid.mdinstead of the canonical feed.
- Not an IDE. For interactive coding, use Cursor or Claude Code.
- Not a one-shot bug fixer. Even a 10-line fix traverses the full plan → build → review → merge loop. For targeted work,
forge task <repo> -d "..."exists; for issue-only or PR-only flows the scaffolding is present but not in the hot path. - Not deterministic. Costs, timings, and subtask counts vary run-to-run. The gates are deterministic; the work isn't.
- Not single-model. Every role (planner, worker, reviewer, adversarial) can be a different model. Models are pinned in
studio/models.yaml— one file, no fallbacks, no auto-swapping. See docs/MODELS.md. - Not a replacement for human review on novel architecture. The
AUTONOMOUStrust tier auto-merges small, well-understood changes; anything surprise-magnitude or cross-package still routes to the approval queue.
Prerequisites. Python 3.12+, Docker Desktop (required since R9.E), OpenCode on PATH (only needed for ad-hoc tooling — Forge installs it inside the worker image), and one provider credential (Bedrock API key, AWS credentials, or OPENROUTER_API_KEY). Run make worker-image once to build forge-worker:latest. Go 1.24+ only for the bundled game example; nono only when launching via ./run.sh.
Install.
git clone https://github.com/sjqtentacles/forge && cd forge
python3 -m venv .venv && source .venv/bin/activate
pip install -e 'platform[dev]'
cp .env.example .env # then uncomment the credentials you want to useRun (canonical).
PYTHONPATH=platform .venv/bin/python3 -m forge.cli run --dashboard
PYTHONPATH=platform .venv/bin/python3 -m forge.cli run --once --max-tasks 3Launcher scripts (not interchangeable):
./run.sh— Bedrock via AWS profile (uses$AWS_PROFILE, defaults todefault; setAWS_PROFILEin your shell or.envto use a different profile), wraps the CLI in nono../run-forever.sh— OpenRouter (OPENROUTER_API_KEY), continuous daemon, no sandbox.
Dashboard at http://127.0.0.1:8420. Models are pinned in studio/models.yaml — one file, no fallbacks; see docs/MODELS.md for the policy. The rest of the configuration (project shape, features, gates, retention) lives in studio/config.yaml; full schema in docs/CONFIG-REFERENCE.md. See docs/QUICKSTART.md for a longer walkthrough, troubleshooting table, and test commands.
Two LangGraph state machines compose the engine.
The outer graph schedules epics and retrospects:
graph LR
startNode[Start] --> BP[Backlog Planner]
BP -->|"Send per epic"| EP[Epic Pipeline]
BP -->|"queue empty"| MM[Mega Merge]
EP --> MM
MM --> Retro[Retrospective]
Retro --> endNode[End]
The inner graph runs a single epic — Strategist picks it, Architect designs, Decomposer splits into tiny subtasks, workers fan out in isolated worktrees, gated by Reviewer / Security / Red Team / Merge, with fresh-context recovery and a post-merge playtest:
graph LR
S[Strategist] --> CS[Context Scout]
CS -->|"first pass"| AR[Architecture Review]
CS -->|"retry pass"| D[Decomposer]
CS -->|"fan_out phase"| FO[Fan Out]
AR --> D
D --> V[Validator]
V --> CS
FO --> W[Workers]
W --> F[Fixer]
F --> R[Reviewer]
R --> SR[Security Review]
SR --> IC[Integration Check]
IC --> RT[Red Team]
RT --> M[Merge]
M -->|"success"| VA[Verify Acceptance]
VA --> P[Playtester]
P --> PL[Persist Learnings]
PL --> endNode[End]
RT -->|"edge case found"| F
M -->|"restyle"| RA[Recovery Advisor]
RA --> CS
M -->|"fallback"| Ref[Refactor]
Ref --> P
Forge-on-Forge mode (running Forge against its own monorepo) is configured via platform/forge/engine/forge_self.py — see docs/PROFILES.md for the documented profile and studio/profiles/forge_on_forge.yaml for the wired entry point. The R5 issue-resolution / PR-mode / SWE-bench-stub scaffolding modules were deleted in R6 (see docs/V4.6.md) — the canonical SWE-bench harness lives at forge.swebench.task.
Concrete example: what actually happens when Forge runs HTTP-202: Implement CSV splitter.
| Stage | What happens | Artifact | Key events |
|---|---|---|---|
| Strategist picks | Selects epic from taskboard + design doc + learnings | Spec draft in state | node_enter {node: strategist} |
| Architect designs | Interfaces, file layout, acceptance criteria | .forge/specs/HTTP-202.md |
node_exit {node: strategist} |
| Decomposer splits | 2–4 subtasks within quality gates | state.subtasks_live + DAG |
decomposer_vote, decomposer_lesson_injected |
| Validator checks | Type conflicts, bad imports (retry-bounded) | — | node_enter {node: validator} |
| Context Scout | Impact maps per subtask via CodeGraph | Affected-test lists | — |
| Fan out | Workers dispatched in parallel (layer order) | .worktrees/HTTP-202a/ etc. |
fan_out {workers: N} |
| Workers run OpenCode | TDD loop: write test, fail, implement, pass | Worktree diff + completion.json |
completion_mismatch? |
| Fixer (if needed) | Lint/build errors fed back | Updated diff | node_enter {node: fixer} |
| Reviewer + preflight | Oversize diffs short-circuit to redecompose | Review verdict | diff_preflight_reject? |
| Security + Integration | Vulnerability scan + cross-subtask check | — | — |
| Red Team | Adversarial edge-case tests against approved diffs | New failing test? | red_team_merge_with_debt? |
| Merge (baseline-aware) | Layer-ordered; subtracts pre-existing failures | Updated working tree | merge_baseline, merge_rollback? |
| Playtest + Verify | Headless smoke + acceptance check | docs/PLAYTEST-LOG.md (generated at runtime; not committed) |
— |
| Persist Learnings | Outcome inference via infer_outcome_kind |
metrics.db, calibration.jsonl |
epic_outcome {kind: done} (validated) |
Conditional branches not shown: epic_coherence (post-Strategist sanity check), plan_critic (reviews the decomposer's plan), sketch_gate (pre-worker draft check), adversarial_review (reviewer triad + arbiter tiebreaker), red_team_repair, recovery_advisor.
The agent-framework space is crowded and moving weekly; rather than claim features about tools I can't fully verify, here is what Forge specifically does, with code pointers, plus the one benchmark where a direct comparison is fair:
- Honesty invariants. "Refuse to mark work done unless tests actually pass" is enforced by a
completion.jsoncovenant cross-checked against real test output. Mismatches emitcompletion_mismatchand demote the worker via the Trust Ratchet. Outcome lies (e.g. reportingdonewith zero workers fanned out) hit an invariant inemit_epic_outcome_event(platform/forge/logging.py) and either emitepic_outcome_lie_preventedor raiseAssertionErrorunderFORGE_STRICT_OUTCOMES=1. - Per-subtask worktree isolation. Every subtask runs in its own
.worktrees/<epic>/directory; every test gets its own.forge/tmp dir via autouse_isolate_forge_state; every worker gets its own MCP server. Cross-contamination is a bug, not a trade-off. - Baseline-aware merge gate.
evaluate_merge_gatesnapshots pre-existing failures atpre_merge_headand subtracts them; only new regressions roll back a branch. Placeholder or null baselines raiseBaselineNotReadyrather than silently passing. - Validated outcome enum. Epics terminate on one of six kinds (
done,shelved,escalated,shelved_by_redteam,abandoned_pending,error), with a pairedepic_outcome_cause. Off-enum values down-coerce toerrorand emitepic_outcome_kind_invalid.
On the one benchmark where numbers are directly comparable — the Aider polyglot benchmark on Claude Sonnet 4.5:
| Harness | Python (34 ex.) | Rust (30 ex.) | Go (39 ex.) | All-6-langs (225 ex.) |
|---|---|---|---|---|
| Aider's own (published) | not published per-language | not published per-language | not published per-language | 77.9% |
| Forge worker (Mode A) | 100% (34/34) · 3-run variance · median $6.53 | 100% (30/30) · single run, n=1 | 100% (39/39) · single run, n=1 · $7.70 | 103/225 measured · 122 remaining (JS · C++ · Java) |
Mode A bypasses Forge's planner/reviewer stack to benchmark just the worker loop; full-stack numbers will differ. See docs/POLYGLOT.md for the methodology, caveats about training-data contamination, and why Mode A was the honest starting point.
- Git is the memory, not the prompt. Retries rebuild the prompt from scratch using
progress.json+ the current diff. No stacked error histories, no prompt degradation. This is the Ralph Wiggum pattern. - Baselines make failures real.
evaluate_merge_gatesnapshots pre-existing failures atpre_merge_headand subtracts them; only new regressions roll back a branch. - Invariants assert; diagnostics log. Honest violations raise
AssertionErrorunderFORGE_STRICT_OUTCOMES=1; in production they emit structured events (epic_outcome_lie_prevented,completion_mismatch,calibration_skipped_zero_actual,subtask_result_contract_violation). - Workers can't lie twice.
.forge/completion.jsonis cross-checked against actual test output. Repeat liars get demoted by the Trust Ratchet. - Every outcome is on-enum. Epics terminate on one of six validated kinds —
done,shelved,escalated,shelved_by_redteam,abandoned_pending,error— with a pairedepic_outcome_cause. Off-enum values down-coerce toerrorand emitepic_outcome_kind_invalid. - Isolated by construction. Each subtask gets its own worktree, each test gets its own
.forge/tmp dir (autouse_isolate_forge_state), each worker gets its own MCP server. Cross-contamination is a bug, not a trade-off. - Small epics beat clever epics. Hard gate at 200 lines / 4 files per subtask, tightens 30% per retry with a 40-line floor. If a diff blows the budget,
diff_preflightrejects it before the reviewer ever sees it. - The retrospective is held to the same standard. Reports citing models not in
metrics.dbor dated off-year land in.forge/evolution/<ts>.invalid.mdsidecars — not the canonical feed.
| Failure | Detection | Response |
|---|---|---|
| Worker claims pass, tests fail | Completion contract cross-check | completion_mismatch + trust demotion |
| "done" with zero workers fanned out | emit_epic_outcome_event invariant |
epic_outcome_lie_prevented (or AssertionError under FORGE_STRICT_OUTCOMES=1) |
| Diff exceeds the line budget | diff_preflight.evaluate pre-review |
Reject, tighten decomposer budget 30%, retry |
| Retry loop stuck on oversize | merge_node ratio check |
Escalate with cause redecompose.oversize_stuck |
_lines_changed can't be measured |
worker_diff.compute_lines_changed returns None |
Skip calibration row, emit calibration_skipped_measurement_failure |
Calibration row has estimate > 0 AND actual == 0 on a pass |
Defensive check in record_if_eligible |
Drop row, emit calibration_skipped_zero_actual |
| Baseline is placeholder or null-aggregate | detect_regression guard |
Raise BaselineNotReady; callers refresh baseline |
Retrospective cites a model not in metrics.db |
retro_validator check |
Quarantine to .forge/evolution/<ts>.invalid.md |
| Primary model failure rate spikes | Operator inspects dashboard cost panel + metrics.db |
Edit studio/models.yaml, restart. (Auto-swap was deleted in the 2026-05-02 cleanup; see docs/MODELS.md.) |
Tests poison the real .forge/*.db |
CostTracker tripwire under PYTEST_CURRENT_TEST |
Autouse _isolate_forge_state redirects every constructor |
| Epic times out | epic_pipeline timeout branch |
Outcome error with cause epic_pipeline.timeout (not off-enum timeout) |
Forge ships 18 deterministic (no-LLM) quality modules that compose into the pipeline at different stages. All follow TDD — the module tests define the behavioral contract.
Planning quality (pre-worker):
| Module | What it does |
|---|---|
| Plan Critic + Cross-Package Coherence | 7-step reasoning guide catches hidden deps, sizing optimism, and shared-concept drift between subtasks targeting different packages |
| Integration Wiring Subtask | Sanitizer auto-injects a final-layer subtask to wire all packages into the entry point; decomposer prompt rule forces explicit emission |
| Import Graph Injection | Lightweight package-level dependency graph injected into decomposer prompts (budget-capped) |
| Contract Registry | Auto-extracts exported functions/types/constants per package for decomposer context |
| Integration Test Requirement | Auto-appends cross-package test requirement to subtask specs with dependencies |
Worker quality (during build):
| Module | What it does |
|---|---|
| Real Implementation Guard | Static checks for stub patterns: Go (fmt.Println-only main, panic("not implemented")), Python (pass-only, raise NotImplementedError), TypeScript (console.log-only, throw not implemented) |
| Verification Pipeline + Preflight | Configurable multi-step lint/type-check/format pipeline; preflight_check warns about missing binaries on PATH at startup |
| Incremental Testing | Selects only tests affected by changed files via CodeGraph |
| Enhanced Worker Context | Injects neighbor package exports (budget-capped) so workers know available symbols |
Post-build quality (pre-merge):
| Module | What it does |
|---|---|
| Acceptance DSL + Runner | YAML-based scenarios with deterministic assertions (exit_code, stdout_contains, file_exists, runtime_under_s, etc.) |
| Coherence Auditor | Extracts Go/Python constants and switch-case values; flags when two packages diverge on a shared concept |
| Vocabulary Consistency Checker | Groups constants by prefix; cross-references switch variables by stem overlap |
Feedback and adaptation:
| Module | What it does |
|---|---|
| Failure Taxonomy | Classifies every rejection into structured categories stored in metrics.db |
| Advisory Tuner | Suggests config changes from failure patterns (never auto-applies) |
| Quality Profiles | Determines lightweight/standard/strict gate set based on project maturity |
| Scenario Generation | Auto-generates ACCEPTANCE.yaml skeletons from epic acceptance criteria |
| Cached Decomposition | Structural hashing; skips redundant validation on retry |
See docs/QUALITY-INTELLIGENCE.md for the full reference.
Forge auto-detects the project language from marker files and selects a best-in-class toolchain:
| Language | Detected by | Build | Test | Lint | Format |
|---|---|---|---|---|---|
| Go | go.mod |
go build ./... |
go test ./... |
golangci-lint run ./... |
gofmt |
| Rust | Cargo.toml |
cargo build |
cargo test |
cargo clippy -- -D warnings |
cargo fmt |
| Java | pom.xml / build.gradle |
mvn compile / gradle build |
mvn test / gradle test |
checkstyle |
spotless |
| C++ | CMakeLists.txt |
cmake --build |
ctest |
clang-tidy |
clang-format |
| Python | pyproject.toml |
pip install -e . |
pytest |
ruff check |
ruff format |
| TypeScript | package.json + tsconfig.json |
npm run build |
npm test |
biome check |
biome format |
The ToolchainBridge adapts these Protocol-based implementations to the worker's interface. Override any command via project.commands in studio/config.yaml.
Since R9.E, Forge runs every OpenCode worker inside a container. The
factory (forge.runtime.create_runtime) only accepts
worker_runtime: docker; legacy values (local, claude_code,
auto) raise UnsupportedRuntimeError at startup so stale configs
fail fast.
| Runtime | How it runs | Notes |
|---|---|---|
docker (only supported) |
OpenCode in a container built from Dockerfile.worker; docker cp syncs the worktree in/out, aborts on copy failure |
Build once with make worker-image. Cost/usage parsing flows through the shared usage.py parser and --format json + .forge/completion.json, just like before. |
Two files. Per project.
| File | Owns |
|---|---|
studio/models.yaml |
The only place model IDs and per-model token rates live. One required block of four aliases (primary, secondary, reasoning, embedding) plus optional rates. No fallbacks, no model tiers, no auto-swap. Schema and policy: docs/MODELS.md. |
studio/config.yaml |
Everything else — project shape, features, quality gates, retention, etc. Schema: docs/CONFIG-REFERENCE.md. |
studio/config.yaml sections worth knowing about:
| Section | Controls |
|---|---|
project |
Language, paths, commands, docs, sandbox allowlists |
worker_runtime / worker_image |
Container backend + image |
providers |
LLM endpoints (base URL, region, API-key env var). Provider for any given model is derived from the model ID prefix in forge.providers.routing; no per-agent provider: field. |
agents |
Per-agent model (alias from models.yaml), temperature, max_tokens, json_mode |
features |
best_of_n, adaptive_search, mcp_tools, record_llm, record_patterns, record_training, evolution_proposals, chromadb_indexing |
quality_gates |
Max lines / files per task (default 200 / 4); best-of-N count + threshold |
context_budgets |
Per-section token budgets |
memory |
Embedding provider, vector store, structured store paths |
rate_limits, concurrency, timeouts |
max_epics: 1, max_workers_per_epic: 4, Worker 900s, Fixer 300s |
reviewer, security, voting, promotion, orphan_branch, decomposer |
Per-subsystem policy tuning: rubber-stamp alarm, security-mismatch retro threshold, self-consistency collapse short-circuit + temperature controller, canary promote-gate agreement, orphan-branch retry threshold, decomposer compaction + prompt ceiling |
retention, trust, approval |
Retention caps, promote threshold, auto/manual approval |
Grouped by phase; every agent has a YAML in studio/agents/ and a row in config.yaml:
| Phase | Agents |
|---|---|
| Plan | Strategist (picks epic), Architect (designs interfaces), Decomposer (2–4 subtasks within gates), Plan Critic (reviews the decomposition), Validator (structural feasibility), Context Scout (impact maps, no LLM) |
| Build | Sketcher (pre-worker draft gate), Worker / Green (OpenCode TDD in worktree with completion.json covenant) |
| Review | Red (adversarial, reasoning tier), Fixer, Reviewer (reasoning tier, soft quality gates), Arbiter (reviewer-disagreement tiebreaker), Security Reviewer, Integration Architect |
| Merge + cleanup | Merge (baseline-aware), Refactor (post-merge dedup), Playtester (headless smoke) |
| Learn | Retrospective (writes validated proposals to .forge/evolution/) |
17 prompt YAMLs under studio/agents/. director.yaml is legacy (superseded by Strategist); critic.yaml exists but isn't in the active agent config.
python -m forge.cli <subcommand> — sourced from platform/forge/cli.py:
| Subcommand | Purpose |
|---|---|
init |
Scaffold a new project (studio/, .forge/, starter config.yaml) |
run |
Launch the engine (--once, --dashboard, --max-tasks, --max-epic-retries, --dashboard-linger, --profile) |
status |
Read .forge/checkpoint.json (--costs, --diagnostics, --deep) |
sleeptime |
Run a meta-evolution cycle |
task <repo> |
Clone a repo and run Forge against a free-form task |
config validate |
YAML parse + structural check |
canary / canary promote-gate |
Replay frozen canary epics; sign-off-gated promotion of features.gate_node_primary |
doctor |
Preflight: config, dirs, PATH, MCP boot, DB schemas |
costs |
Breakdown from .forge/costs.db |
diagnostics |
Failure diagnostics from .forge/metrics.db |
purge-index |
Evict stale worktree chunks from the code embedding collection |
backlog unshelve <epic> |
Clear cooldown + demotion so the planner can re-select |
preflight |
Dry-run the subtask acceptance gate without a worker |
memory migrate |
Patterns library → project-scoped git-history retrieval |
requeue <epic> |
Append to .forge/priority_queue.json |
priority list / priority clear |
Inspect / clear the priority queue |
profiles list |
Enumerate known ProjectProfile descriptors |
Trace-replay: canary --replay --against <ref> --shadow --threshold --max-usd.
Dashboard at http://127.0.0.1:8420 — D3 + dagre workflow DAG, live per-worker Docker status, per-model cost, activity-span hourly rate, subtask kanban, pause/resume, manual approval queue, /healthz + /readyz.
Structured events worth grepping for (all verified in-tree): epic_outcome_lie_prevented, epic_outcome_kind_invalid, lines_changed_measurement_failed, calibration_skipped_zero_actual, calibration_skipped_measurement_failure, completion_mismatch, subtask_result_contract_violation, trust_demotion, diff_preflight_reject, redecompose_budget_consumed, merge_baseline, merge_rollback, fan_out, reviewer_disagreement_arbitrated. Baseline-readiness failures raise BaselineNotReady (exception, not event). Run CI with FORGE_STRICT_OUTCOMES=1 so the outcome-lie diagnostic hard-asserts.
Tests. python3 -m pytest platform/tests/ -p no:xdist -o "addopts=" -v runs 8,400+ tests (as of 2026-05-06). Autouse _isolate_forge_state (platform/tests/conftest.py) redirects every .forge/*.db constructor and _REPO_ROOT to a per-test tmp dir. The default pytest invocation uses -n auto --dist loadscope from platform/pyproject.toml; the -p no:xdist -o "addopts=" form above disables parallelism for deterministic output ordering.
Eval harness. 30 gold scenarios in platform/forge/eval/scenarios.py; nightly detect_regression + per-layer attribution; BaselineNotReady on placeholder baselines (v4.5.2). See docs/EVAL-HARNESS.md.
Canary. forge canary replays 3 deterministic epics (epic_fix_typo, epic_add_pure_helper, epic_should_shelve) against .forge/canary/canary_baseline.json.
forge/
├── run.sh, run-forever.sh # launchers (nono sandbox, continuous daemon)
├── Dockerfile, Dockerfile.worker
├── platform/forge/ # Python package
│ ├── cli.py, engine/, runtime/, branching/, context/, memory/
│ ├── mcp/, trust/, eval/, meta/, dashboard/, approval/
│ └── quality/, observability/, integrations/, plugins/
├── platform/tests/ # 8,400+ tests; autouse _isolate_forge_state
├── studio/
│ ├── config.yaml, agents/ # 17 per-agent prompt YAMLs
│ └── workflows/ # orchestrator.py, worker_session.py, retrospective.py
│ └── nodes/ # extracted node modules with typed I/O contracts
├── examples/ # 8 project profiles: go-game, go-roguelike, go-horror,
│ # go-roguefort, go-arcology, go-voidrift, go-ironhold, python-api
├── benchmarks/ # per-model benchmark harness
├── docs/ # release notes + deep-dive references
└── .forge/ # runtime state: costs.db, metrics.db, trust.db,
# canary/, eval/baselines/, specs/, evolution/
Full layout and subsystem deep-dives in docs/ARCHITECTURE.md and docs/PLATFORM.md.
Start here, then drill down as needed. Full index at docs/README.md.
| You want to… | Read |
|---|---|
| Install and run Forge | docs/QUICKSTART.md |
| Contribute code | CONTRIBUTING.md |
| Understand the architecture | docs/ARCHITECTURE.md, docs/PLATFORM.md |
| Pick / change models | docs/MODELS.md |
| Configure a project | docs/CONFIG-REFERENCE.md |
| Point Forge at a new codebase | examples/README.md |
| Reproduce the benchmark numbers | benchmarks/README.md, docs/POLYGLOT.md, docs/SWE-BENCH.md |
| See what changed per release | CHANGELOG.md |
| Read design decisions | docs/adr/ |
| Understand the honesty invariants | docs/EVAL-HARNESS.md, docs/V4.5.2.md |
| Version | Theme | Notes |
|---|---|---|
| v5.1.0 (current) | Quality intelligence, multi-language toolchains | 18 quality modules, Go/Rust/Java/C++ toolchains, plan critic enabled, wiring subtask injection, acceptance DSL, VerificationPipeline preflight |
| v5.0.0 | Quality hardening, node architecture, CI gates | Extracted node modules with typed I/O contracts, GitHub Actions CI, 8,400+ tests, 87% coverage |
| v4.5.2 | Outcome honesty, test isolation, release gates | docs/V4.5.2.md |
| v4.5.1.1 | Live-validation patch (HTTP-102 follow-ups) | docs/V4.5.1.1.md |
| v4.5.1 | Live-validation hardening over v4.5 | docs/V4.5.1.md |
| v4.5 | Truthful, budget-aware workers | docs/V4.5.md |
| v4.4.3 | Eval harness (pure infrastructure, no behavior change) | docs/V4.4.3.md |
| v4.4 | Truthful outcomes + robust Forge | docs/V4.4.md |
Older: v4.3.1, v4.3, v4.2, v4 — see docs/V4*.md. Stage notes for v4.5.2 layers B–E live under docs/V4.5.2-stage*.md.
Research
- TDAD (Test-Driven Agentic Development) — AST-based code–test dependency graphs; direct parent of Forge's impact maps.
- Self-consistency (Wang et al., arxiv 2203.11171) — Majority voting over multiple samples. Powers the decomposer / reviewer vote-and-select path with collapse short-circuit + temperature controller (platform/forge/engine/voting.py).
- Ralph Wiggum pattern (Geoffrey Huntley) — Fresh context + git-as-memory iteration over stacked error histories. Drives the
progress.jsonretry rebuild. - SWE-bench Pro (Princeton NLP) — Evaluation taxonomy for realistic, multi-file repository tasks. Forge ships a data layer for running against it (platform/forge/swebench/task.py).
Engineering lineage
- LangGraph — Entire orchestrator is
StateGraphcomposition withSendfanout and checkpointing. - Aider (Paul Gauthier) — SEARCH/REPLACE block grammar (platform/forge/tools/diff_applier.py), polyglot benchmark corpus (scripts/_polyglot_one.py), and the anti-pattern lesson that "fuzzy matchers accumulate edge cases" — so we don't have one.
- Model Context Protocol (Anthropic) — Every worker gets a per-worktree MCP server (platform/forge/mcp/) exposing
search_codebase,find_related_code,check_risk,recall_similar_tasks. - Circuit breaker (Nygard / Fowler) — Decompose/retry breaker shelves or accepts partial debt (platform/forge/engine/circuit_breakers.py). The cross-model failure-rate auto-swap variant was deleted in the 2026-05-02 cleanup along with
ModelRouter; model selection is now operator-driven viastudio/models.yaml. - tree-sitter — Symbol extraction under platform/forge/context/; feeds the repo map, CodeGraph, and impact-map construction.
Apache License 2.0 — see LICENSE.
Built and maintained by Allen Sarkisyan — GitHub · LinkedIn.
