Skip to content

sjqtentacles/forge

Repository files navigation

Forge

An autonomous coding agent that tells you the truth.

Forge is a multi-agent engine that plans, writes, reviews, and merges code on your behalf. Unlike most agent frameworks, it refuses to mark work "done" unless the tests actually pass, the diff actually landed, and the worker's self-report matches reality. When an agent lies on the dashboard, CI turns red.

Built on LangGraph, with workers running OpenCode inside isolated git worktrees. Project-agnostic via studio/config.yaml. Current release: v5.1.0 — Quality Intelligence, Multi-Language Toolchains.

Built by Allen SarkisyanGitHub · LinkedIn.

Forge dashboard: LIVE session building Cryptfall (Go roguelike), subtasks 3/4, phase worker, cost $72.91, pipeline DAG showing Worker:CRYPT-025b active, heartbeat 17s ago, 8-phase progress bar, OUTCOME LIE PREVENTED banner for CRYPT-024.

The live dashboard during a real session. "OUTCOME LIE PREVENTED" banner shows honesty invariants catching a worker claim that didn't match reality.

At a glance

Aider polyglot · Python (Claude Sonnet 4.5 via Bedrock, Mode A, 3-run variance; 2026-04-18) 34 / 34 · 100% · median $6.53/run · 86s/ex (details)
Aider polyglot · Rust (Claude Sonnet 4.5 via Bedrock, Mode A, single run, n=1; 2026-04-18) 30 / 30 · 100% · cost pre-telemetry-fix (not reliably captured) (details)
Aider polyglot · Go (Claude Sonnet 4.5 via Bedrock, Mode A, single run, n=1; 2026-04-19) 39 / 39 · 100% · $7.70 · 57 min · 88s/ex (details)
Test suite on main (collected 2026-05-06) 8,400+ tests (see Tests)
Honesty invariants active epic_outcome_lie_prevented · completion_mismatch · calibration_skipped_zero_actual · subtask_result_contract_violation
Multi-language toolchains Go · Rust · Java · C++ · Python · TypeScript (auto-detected from project files)
Quality intelligence 18 deterministic modules: impl guard, coherence auditor, vocabulary checker, acceptance DSL, failure taxonomy, advisory tuner, and more
License Apache 2.0

Polyglot coverage (as of 2026-04-19): 103 / 225 exercises verified across 3 of 6 languages, all 100%. JavaScript, C++, Java runs are the remaining unknowns; see docs/POLYGLOT.md for methodology and caveats.

Polyglot methodology, caveats, and full comparison to Aider's own published numbers in docs/POLYGLOT.md.


Table of contents


What Forge does

You give Forge a backlog (a set of epics or tasks). For each one, it:

  1. Plans — picks the next epic, drafts an architectural spec, and decomposes the work into 2–4 small, reviewable subtasks.
  2. Builds — fans out parallel workers into isolated git worktrees, each running a test-driven OpenCode session. Language-aware toolchains (auto-detected from go.mod, Cargo.toml, pom.xml, etc.) handle build, test, vet, and format.
  3. Reviews — the Real Implementation Guard rejects stubs/demos before the diff reaches the reviewer. Then: reviewer, security scan, red team adversarial edge cases, and integration check. The VerificationPipeline runs all lint/vet steps with preflight tool availability checks.
  4. Merges — applies approved subtasks in layer order, subtracting any pre-existing failures via a baseline so only new regressions trigger a rollback.
  5. Learns — writes a retrospective, updates calibration data, and feeds lessons back into the planner.

Every epic terminates on one of six validated outcome kinds (done, shelved, escalated, shelved_by_redteam, abandoned_pending, error) — never an ambiguous "finished". The full event stream is the audit trail; there is no other source of truth.

The thing most frameworks don't do

An LLM that says "I fixed it" is not evidence. Forge treats worker self-reports as claims to be verified:

  • .forge/completion.json (the worker's claim) is cross-checked against actual test output. Mismatches emit completion_mismatch and demote the worker via the Trust Ratchet.
  • A run that reports done but fanned out zero workers hits an invariant in emit_epic_outcome_event and either emits epic_outcome_lie_prevented or raises AssertionError (under FORGE_STRICT_OUTCOMES=1, which CI sets).
  • Retrospectives citing models that aren't in metrics.db, or dated off-year, are quarantined to .forge/evolution/<ts>.invalid.md instead of the canonical feed.

What Forge isn't

  • Not an IDE. For interactive coding, use Cursor or Claude Code.
  • Not a one-shot bug fixer. Even a 10-line fix traverses the full plan → build → review → merge loop. For targeted work, forge task <repo> -d "..." exists; for issue-only or PR-only flows the scaffolding is present but not in the hot path.
  • Not deterministic. Costs, timings, and subtask counts vary run-to-run. The gates are deterministic; the work isn't.
  • Not single-model. Every role (planner, worker, reviewer, adversarial) can be a different model. Models are pinned in studio/models.yaml — one file, no fallbacks, no auto-swapping. See docs/MODELS.md.
  • Not a replacement for human review on novel architecture. The AUTONOMOUS trust tier auto-merges small, well-understood changes; anything surprise-magnitude or cross-package still routes to the approval queue.

Quickstart

Prerequisites. Python 3.12+, Docker Desktop (required since R9.E), OpenCode on PATH (only needed for ad-hoc tooling — Forge installs it inside the worker image), and one provider credential (Bedrock API key, AWS credentials, or OPENROUTER_API_KEY). Run make worker-image once to build forge-worker:latest. Go 1.24+ only for the bundled game example; nono only when launching via ./run.sh.

Install.

git clone https://github.com/sjqtentacles/forge && cd forge
python3 -m venv .venv && source .venv/bin/activate
pip install -e 'platform[dev]'
cp .env.example .env   # then uncomment the credentials you want to use

Run (canonical).

PYTHONPATH=platform .venv/bin/python3 -m forge.cli run --dashboard
PYTHONPATH=platform .venv/bin/python3 -m forge.cli run --once --max-tasks 3

Launcher scripts (not interchangeable):

  • ./run.sh — Bedrock via AWS profile (uses $AWS_PROFILE, defaults to default; set AWS_PROFILE in your shell or .env to use a different profile), wraps the CLI in nono.
  • ./run-forever.sh — OpenRouter (OPENROUTER_API_KEY), continuous daemon, no sandbox.

Dashboard at http://127.0.0.1:8420. Models are pinned in studio/models.yaml — one file, no fallbacks; see docs/MODELS.md for the policy. The rest of the configuration (project shape, features, gates, retention) lives in studio/config.yaml; full schema in docs/CONFIG-REFERENCE.md. See docs/QUICKSTART.md for a longer walkthrough, troubleshooting table, and test commands.


How it works

Architecture

Two LangGraph state machines compose the engine.

The outer graph schedules epics and retrospects:

graph LR
    startNode[Start] --> BP[Backlog Planner]
    BP -->|"Send per epic"| EP[Epic Pipeline]
    BP -->|"queue empty"| MM[Mega Merge]
    EP --> MM
    MM --> Retro[Retrospective]
    Retro --> endNode[End]
Loading

The inner graph runs a single epic — Strategist picks it, Architect designs, Decomposer splits into tiny subtasks, workers fan out in isolated worktrees, gated by Reviewer / Security / Red Team / Merge, with fresh-context recovery and a post-merge playtest:

graph LR
    S[Strategist] --> CS[Context Scout]
    CS -->|"first pass"| AR[Architecture Review]
    CS -->|"retry pass"| D[Decomposer]
    CS -->|"fan_out phase"| FO[Fan Out]
    AR --> D
    D --> V[Validator]
    V --> CS
    FO --> W[Workers]
    W --> F[Fixer]
    F --> R[Reviewer]
    R --> SR[Security Review]
    SR --> IC[Integration Check]
    IC --> RT[Red Team]
    RT --> M[Merge]
    M -->|"success"| VA[Verify Acceptance]
    VA --> P[Playtester]
    P --> PL[Persist Learnings]
    PL --> endNode[End]

    RT -->|"edge case found"| F
    M -->|"restyle"| RA[Recovery Advisor]
    RA --> CS
    M -->|"fallback"| Ref[Refactor]
    Ref --> P
Loading

Forge-on-Forge mode (running Forge against its own monorepo) is configured via platform/forge/engine/forge_self.py — see docs/PROFILES.md for the documented profile and studio/profiles/forge_on_forge.yaml for the wired entry point. The R5 issue-resolution / PR-mode / SWE-bench-stub scaffolding modules were deleted in R6 (see docs/V4.6.md) — the canonical SWE-bench harness lives at forge.swebench.task.

Anatomy of one epic

Concrete example: what actually happens when Forge runs HTTP-202: Implement CSV splitter.

Stage What happens Artifact Key events
Strategist picks Selects epic from taskboard + design doc + learnings Spec draft in state node_enter {node: strategist}
Architect designs Interfaces, file layout, acceptance criteria .forge/specs/HTTP-202.md node_exit {node: strategist}
Decomposer splits 2–4 subtasks within quality gates state.subtasks_live + DAG decomposer_vote, decomposer_lesson_injected
Validator checks Type conflicts, bad imports (retry-bounded) node_enter {node: validator}
Context Scout Impact maps per subtask via CodeGraph Affected-test lists
Fan out Workers dispatched in parallel (layer order) .worktrees/HTTP-202a/ etc. fan_out {workers: N}
Workers run OpenCode TDD loop: write test, fail, implement, pass Worktree diff + completion.json completion_mismatch?
Fixer (if needed) Lint/build errors fed back Updated diff node_enter {node: fixer}
Reviewer + preflight Oversize diffs short-circuit to redecompose Review verdict diff_preflight_reject?
Security + Integration Vulnerability scan + cross-subtask check
Red Team Adversarial edge-case tests against approved diffs New failing test? red_team_merge_with_debt?
Merge (baseline-aware) Layer-ordered; subtracts pre-existing failures Updated working tree merge_baseline, merge_rollback?
Playtest + Verify Headless smoke + acceptance check docs/PLAYTEST-LOG.md (generated at runtime; not committed)
Persist Learnings Outcome inference via infer_outcome_kind metrics.db, calibration.jsonl epic_outcome {kind: done} (validated)

Conditional branches not shown: epic_coherence (post-Strategist sanity check), plan_critic (reviews the decomposer's plan), sketch_gate (pre-worker draft check), adversarial_review (reviewer triad + arbiter tiebreaker), red_team_repair, recovery_advisor.


Why it's different

How it compares

The agent-framework space is crowded and moving weekly; rather than claim features about tools I can't fully verify, here is what Forge specifically does, with code pointers, plus the one benchmark where a direct comparison is fair:

  • Honesty invariants. "Refuse to mark work done unless tests actually pass" is enforced by a completion.json covenant cross-checked against real test output. Mismatches emit completion_mismatch and demote the worker via the Trust Ratchet. Outcome lies (e.g. reporting done with zero workers fanned out) hit an invariant in emit_epic_outcome_event (platform/forge/logging.py) and either emit epic_outcome_lie_prevented or raise AssertionError under FORGE_STRICT_OUTCOMES=1.
  • Per-subtask worktree isolation. Every subtask runs in its own .worktrees/<epic>/ directory; every test gets its own .forge/ tmp dir via autouse _isolate_forge_state; every worker gets its own MCP server. Cross-contamination is a bug, not a trade-off.
  • Baseline-aware merge gate. evaluate_merge_gate snapshots pre-existing failures at pre_merge_head and subtracts them; only new regressions roll back a branch. Placeholder or null baselines raise BaselineNotReady rather than silently passing.
  • Validated outcome enum. Epics terminate on one of six kinds (done, shelved, escalated, shelved_by_redteam, abandoned_pending, error), with a paired epic_outcome_cause. Off-enum values down-coerce to error and emit epic_outcome_kind_invalid.

On the one benchmark where numbers are directly comparable — the Aider polyglot benchmark on Claude Sonnet 4.5:

Harness Python (34 ex.) Rust (30 ex.) Go (39 ex.) All-6-langs (225 ex.)
Aider's own (published) not published per-language not published per-language not published per-language 77.9%
Forge worker (Mode A) 100% (34/34) · 3-run variance · median $6.53 100% (30/30) · single run, n=1 100% (39/39) · single run, n=1 · $7.70 103/225 measured · 122 remaining (JS · C++ · Java)

Mode A bypasses Forge's planner/reviewer stack to benchmark just the worker loop; full-stack numbers will differ. See docs/POLYGLOT.md for the methodology, caveats about training-data contamination, and why Mode A was the honest starting point.

Design principles

  1. Git is the memory, not the prompt. Retries rebuild the prompt from scratch using progress.json + the current diff. No stacked error histories, no prompt degradation. This is the Ralph Wiggum pattern.
  2. Baselines make failures real. evaluate_merge_gate snapshots pre-existing failures at pre_merge_head and subtracts them; only new regressions roll back a branch.
  3. Invariants assert; diagnostics log. Honest violations raise AssertionError under FORGE_STRICT_OUTCOMES=1; in production they emit structured events (epic_outcome_lie_prevented, completion_mismatch, calibration_skipped_zero_actual, subtask_result_contract_violation).
  4. Workers can't lie twice. .forge/completion.json is cross-checked against actual test output. Repeat liars get demoted by the Trust Ratchet.
  5. Every outcome is on-enum. Epics terminate on one of six validated kinds — done, shelved, escalated, shelved_by_redteam, abandoned_pending, error — with a paired epic_outcome_cause. Off-enum values down-coerce to error and emit epic_outcome_kind_invalid.
  6. Isolated by construction. Each subtask gets its own worktree, each test gets its own .forge/ tmp dir (autouse _isolate_forge_state), each worker gets its own MCP server. Cross-contamination is a bug, not a trade-off.
  7. Small epics beat clever epics. Hard gate at 200 lines / 4 files per subtask, tightens 30% per retry with a 40-line floor. If a diff blows the budget, diff_preflight rejects it before the reviewer ever sees it.
  8. The retrospective is held to the same standard. Reports citing models not in metrics.db or dated off-year land in .forge/evolution/<ts>.invalid.md sidecars — not the canonical feed.

Failure modes and how they're handled

Failure Detection Response
Worker claims pass, tests fail Completion contract cross-check completion_mismatch + trust demotion
"done" with zero workers fanned out emit_epic_outcome_event invariant epic_outcome_lie_prevented (or AssertionError under FORGE_STRICT_OUTCOMES=1)
Diff exceeds the line budget diff_preflight.evaluate pre-review Reject, tighten decomposer budget 30%, retry
Retry loop stuck on oversize merge_node ratio check Escalate with cause redecompose.oversize_stuck
_lines_changed can't be measured worker_diff.compute_lines_changed returns None Skip calibration row, emit calibration_skipped_measurement_failure
Calibration row has estimate > 0 AND actual == 0 on a pass Defensive check in record_if_eligible Drop row, emit calibration_skipped_zero_actual
Baseline is placeholder or null-aggregate detect_regression guard Raise BaselineNotReady; callers refresh baseline
Retrospective cites a model not in metrics.db retro_validator check Quarantine to .forge/evolution/<ts>.invalid.md
Primary model failure rate spikes Operator inspects dashboard cost panel + metrics.db Edit studio/models.yaml, restart. (Auto-swap was deleted in the 2026-05-02 cleanup; see docs/MODELS.md.)
Tests poison the real .forge/*.db CostTracker tripwire under PYTEST_CURRENT_TEST Autouse _isolate_forge_state redirects every constructor
Epic times out epic_pipeline timeout branch Outcome error with cause epic_pipeline.timeout (not off-enum timeout)

Quality intelligence

Forge ships 18 deterministic (no-LLM) quality modules that compose into the pipeline at different stages. All follow TDD — the module tests define the behavioral contract.

Planning quality (pre-worker):

Module What it does
Plan Critic + Cross-Package Coherence 7-step reasoning guide catches hidden deps, sizing optimism, and shared-concept drift between subtasks targeting different packages
Integration Wiring Subtask Sanitizer auto-injects a final-layer subtask to wire all packages into the entry point; decomposer prompt rule forces explicit emission
Import Graph Injection Lightweight package-level dependency graph injected into decomposer prompts (budget-capped)
Contract Registry Auto-extracts exported functions/types/constants per package for decomposer context
Integration Test Requirement Auto-appends cross-package test requirement to subtask specs with dependencies

Worker quality (during build):

Module What it does
Real Implementation Guard Static checks for stub patterns: Go (fmt.Println-only main, panic("not implemented")), Python (pass-only, raise NotImplementedError), TypeScript (console.log-only, throw not implemented)
Verification Pipeline + Preflight Configurable multi-step lint/type-check/format pipeline; preflight_check warns about missing binaries on PATH at startup
Incremental Testing Selects only tests affected by changed files via CodeGraph
Enhanced Worker Context Injects neighbor package exports (budget-capped) so workers know available symbols

Post-build quality (pre-merge):

Module What it does
Acceptance DSL + Runner YAML-based scenarios with deterministic assertions (exit_code, stdout_contains, file_exists, runtime_under_s, etc.)
Coherence Auditor Extracts Go/Python constants and switch-case values; flags when two packages diverge on a shared concept
Vocabulary Consistency Checker Groups constants by prefix; cross-references switch variables by stem overlap

Feedback and adaptation:

Module What it does
Failure Taxonomy Classifies every rejection into structured categories stored in metrics.db
Advisory Tuner Suggests config changes from failure patterns (never auto-applies)
Quality Profiles Determines lightweight/standard/strict gate set based on project maturity
Scenario Generation Auto-generates ACCEPTANCE.yaml skeletons from epic acceptance criteria
Cached Decomposition Structural hashing; skips redundant validation on retry

See docs/QUALITY-INTELLIGENCE.md for the full reference.

Multi-language toolchains

Forge auto-detects the project language from marker files and selects a best-in-class toolchain:

Language Detected by Build Test Lint Format
Go go.mod go build ./... go test ./... golangci-lint run ./... gofmt
Rust Cargo.toml cargo build cargo test cargo clippy -- -D warnings cargo fmt
Java pom.xml / build.gradle mvn compile / gradle build mvn test / gradle test checkstyle spotless
C++ CMakeLists.txt cmake --build ctest clang-tidy clang-format
Python pyproject.toml pip install -e . pytest ruff check ruff format
TypeScript package.json + tsconfig.json npm run build npm test biome check biome format

The ToolchainBridge adapts these Protocol-based implementations to the worker's interface. Override any command via project.commands in studio/config.yaml.


Worker runtime

Since R9.E, Forge runs every OpenCode worker inside a container. The factory (forge.runtime.create_runtime) only accepts worker_runtime: docker; legacy values (local, claude_code, auto) raise UnsupportedRuntimeError at startup so stale configs fail fast.

Runtime How it runs Notes
docker (only supported) OpenCode in a container built from Dockerfile.worker; docker cp syncs the worktree in/out, aborts on copy failure Build once with make worker-image. Cost/usage parsing flows through the shared usage.py parser and --format json + .forge/completion.json, just like before.

Configuration

Two files. Per project.

File Owns
studio/models.yaml The only place model IDs and per-model token rates live. One required block of four aliases (primary, secondary, reasoning, embedding) plus optional rates. No fallbacks, no model tiers, no auto-swap. Schema and policy: docs/MODELS.md.
studio/config.yaml Everything else — project shape, features, quality gates, retention, etc. Schema: docs/CONFIG-REFERENCE.md.

studio/config.yaml sections worth knowing about:

Section Controls
project Language, paths, commands, docs, sandbox allowlists
worker_runtime / worker_image Container backend + image
providers LLM endpoints (base URL, region, API-key env var). Provider for any given model is derived from the model ID prefix in forge.providers.routing; no per-agent provider: field.
agents Per-agent model (alias from models.yaml), temperature, max_tokens, json_mode
features best_of_n, adaptive_search, mcp_tools, record_llm, record_patterns, record_training, evolution_proposals, chromadb_indexing
quality_gates Max lines / files per task (default 200 / 4); best-of-N count + threshold
context_budgets Per-section token budgets
memory Embedding provider, vector store, structured store paths
rate_limits, concurrency, timeouts max_epics: 1, max_workers_per_epic: 4, Worker 900s, Fixer 300s
reviewer, security, voting, promotion, orphan_branch, decomposer Per-subsystem policy tuning: rubber-stamp alarm, security-mismatch retro threshold, self-consistency collapse short-circuit + temperature controller, canary promote-gate agreement, orphan-branch retry threshold, decomposer compaction + prompt ceiling
retention, trust, approval Retention caps, promote threshold, auto/manual approval

Agent roster

Grouped by phase; every agent has a YAML in studio/agents/ and a row in config.yaml:

Phase Agents
Plan Strategist (picks epic), Architect (designs interfaces), Decomposer (2–4 subtasks within gates), Plan Critic (reviews the decomposition), Validator (structural feasibility), Context Scout (impact maps, no LLM)
Build Sketcher (pre-worker draft gate), Worker / Green (OpenCode TDD in worktree with completion.json covenant)
Review Red (adversarial, reasoning tier), Fixer, Reviewer (reasoning tier, soft quality gates), Arbiter (reviewer-disagreement tiebreaker), Security Reviewer, Integration Architect
Merge + cleanup Merge (baseline-aware), Refactor (post-merge dedup), Playtester (headless smoke)
Learn Retrospective (writes validated proposals to .forge/evolution/)

17 prompt YAMLs under studio/agents/. director.yaml is legacy (superseded by Strategist); critic.yaml exists but isn't in the active agent config.

CLI

python -m forge.cli <subcommand> — sourced from platform/forge/cli.py:

Subcommand Purpose
init Scaffold a new project (studio/, .forge/, starter config.yaml)
run Launch the engine (--once, --dashboard, --max-tasks, --max-epic-retries, --dashboard-linger, --profile)
status Read .forge/checkpoint.json (--costs, --diagnostics, --deep)
sleeptime Run a meta-evolution cycle
task <repo> Clone a repo and run Forge against a free-form task
config validate YAML parse + structural check
canary / canary promote-gate Replay frozen canary epics; sign-off-gated promotion of features.gate_node_primary
doctor Preflight: config, dirs, PATH, MCP boot, DB schemas
costs Breakdown from .forge/costs.db
diagnostics Failure diagnostics from .forge/metrics.db
purge-index Evict stale worktree chunks from the code embedding collection
backlog unshelve <epic> Clear cooldown + demotion so the planner can re-select
preflight Dry-run the subtask acceptance gate without a worker
memory migrate Patterns library → project-scoped git-history retrieval
requeue <epic> Append to .forge/priority_queue.json
priority list / priority clear Inspect / clear the priority queue
profiles list Enumerate known ProjectProfile descriptors

Trace-replay: canary --replay --against <ref> --shadow --threshold --max-usd.

Observability and testing

Dashboard at http://127.0.0.1:8420 — D3 + dagre workflow DAG, live per-worker Docker status, per-model cost, activity-span hourly rate, subtask kanban, pause/resume, manual approval queue, /healthz + /readyz.

Structured events worth grepping for (all verified in-tree): epic_outcome_lie_prevented, epic_outcome_kind_invalid, lines_changed_measurement_failed, calibration_skipped_zero_actual, calibration_skipped_measurement_failure, completion_mismatch, subtask_result_contract_violation, trust_demotion, diff_preflight_reject, redecompose_budget_consumed, merge_baseline, merge_rollback, fan_out, reviewer_disagreement_arbitrated. Baseline-readiness failures raise BaselineNotReady (exception, not event). Run CI with FORGE_STRICT_OUTCOMES=1 so the outcome-lie diagnostic hard-asserts.

Tests. python3 -m pytest platform/tests/ -p no:xdist -o "addopts=" -v runs 8,400+ tests (as of 2026-05-06). Autouse _isolate_forge_state (platform/tests/conftest.py) redirects every .forge/*.db constructor and _REPO_ROOT to a per-test tmp dir. The default pytest invocation uses -n auto --dist loadscope from platform/pyproject.toml; the -p no:xdist -o "addopts=" form above disables parallelism for deterministic output ordering.

Eval harness. 30 gold scenarios in platform/forge/eval/scenarios.py; nightly detect_regression + per-layer attribution; BaselineNotReady on placeholder baselines (v4.5.2). See docs/EVAL-HARNESS.md.

Canary. forge canary replays 3 deterministic epics (epic_fix_typo, epic_add_pure_helper, epic_should_shelve) against .forge/canary/canary_baseline.json.

Project layout

forge/
├── run.sh, run-forever.sh     # launchers (nono sandbox, continuous daemon)
├── Dockerfile, Dockerfile.worker
├── platform/forge/            # Python package
│   ├── cli.py, engine/, runtime/, branching/, context/, memory/
│   ├── mcp/, trust/, eval/, meta/, dashboard/, approval/
│   └── quality/, observability/, integrations/, plugins/
├── platform/tests/            # 8,400+ tests; autouse _isolate_forge_state
├── studio/
│   ├── config.yaml, agents/   # 17 per-agent prompt YAMLs
│   └── workflows/             # orchestrator.py, worker_session.py, retrospective.py
│       └── nodes/             # extracted node modules with typed I/O contracts
├── examples/                  # 8 project profiles: go-game, go-roguelike, go-horror,
│                              # go-roguefort, go-arcology, go-voidrift, go-ironhold, python-api
├── benchmarks/                # per-model benchmark harness
├── docs/                      # release notes + deep-dive references
└── .forge/                    # runtime state: costs.db, metrics.db, trust.db,
                               # canary/, eval/baselines/, specs/, evolution/

Full layout and subsystem deep-dives in docs/ARCHITECTURE.md and docs/PLATFORM.md.


Documentation map

Start here, then drill down as needed. Full index at docs/README.md.

You want to… Read
Install and run Forge docs/QUICKSTART.md
Contribute code CONTRIBUTING.md
Understand the architecture docs/ARCHITECTURE.md, docs/PLATFORM.md
Pick / change models docs/MODELS.md
Configure a project docs/CONFIG-REFERENCE.md
Point Forge at a new codebase examples/README.md
Reproduce the benchmark numbers benchmarks/README.md, docs/POLYGLOT.md, docs/SWE-BENCH.md
See what changed per release CHANGELOG.md
Read design decisions docs/adr/
Understand the honesty invariants docs/EVAL-HARNESS.md, docs/V4.5.2.md

Releases

Version Theme Notes
v5.1.0 (current) Quality intelligence, multi-language toolchains 18 quality modules, Go/Rust/Java/C++ toolchains, plan critic enabled, wiring subtask injection, acceptance DSL, VerificationPipeline preflight
v5.0.0 Quality hardening, node architecture, CI gates Extracted node modules with typed I/O contracts, GitHub Actions CI, 8,400+ tests, 87% coverage
v4.5.2 Outcome honesty, test isolation, release gates docs/V4.5.2.md
v4.5.1.1 Live-validation patch (HTTP-102 follow-ups) docs/V4.5.1.1.md
v4.5.1 Live-validation hardening over v4.5 docs/V4.5.1.md
v4.5 Truthful, budget-aware workers docs/V4.5.md
v4.4.3 Eval harness (pure infrastructure, no behavior change) docs/V4.4.3.md
v4.4 Truthful outcomes + robust Forge docs/V4.4.md

Older: v4.3.1, v4.3, v4.2, v4 — see docs/V4*.md. Stage notes for v4.5.2 layers B–E live under docs/V4.5.2-stage*.md.

Design influences

Research

  • TDAD (Test-Driven Agentic Development) — AST-based code–test dependency graphs; direct parent of Forge's impact maps.
  • Self-consistency (Wang et al., arxiv 2203.11171) — Majority voting over multiple samples. Powers the decomposer / reviewer vote-and-select path with collapse short-circuit + temperature controller (platform/forge/engine/voting.py).
  • Ralph Wiggum pattern (Geoffrey Huntley) — Fresh context + git-as-memory iteration over stacked error histories. Drives the progress.json retry rebuild.
  • SWE-bench Pro (Princeton NLP) — Evaluation taxonomy for realistic, multi-file repository tasks. Forge ships a data layer for running against it (platform/forge/swebench/task.py).

Engineering lineage

  • LangGraph — Entire orchestrator is StateGraph composition with Send fanout and checkpointing.
  • Aider (Paul Gauthier) — SEARCH/REPLACE block grammar (platform/forge/tools/diff_applier.py), polyglot benchmark corpus (scripts/_polyglot_one.py), and the anti-pattern lesson that "fuzzy matchers accumulate edge cases" — so we don't have one.
  • Model Context Protocol (Anthropic) — Every worker gets a per-worktree MCP server (platform/forge/mcp/) exposing search_codebase, find_related_code, check_risk, recall_similar_tasks.
  • Circuit breaker (Nygard / Fowler) — Decompose/retry breaker shelves or accepts partial debt (platform/forge/engine/circuit_breakers.py). The cross-model failure-rate auto-swap variant was deleted in the 2026-05-02 cleanup along with ModelRouter; model selection is now operator-driven via studio/models.yaml.
  • tree-sitter — Symbol extraction under platform/forge/context/; feeds the repo map, CodeGraph, and impact-map construction.

License

Apache License 2.0 — see LICENSE.

Maintainer

Built and maintained by Allen SarkisyanGitHub · LinkedIn.

About

An autonomous coding agent that tells you the truth. Multi-agent LangGraph orchestrator with outcome-honesty invariants, trust ratchet, and baseline-aware merge gates.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors