diff --git a/docs/system_design.md b/docs/system_design.md
index 16010d4..6f1dad3 100644
--- a/docs/system_design.md
+++ b/docs/system_design.md
@@ -18,7 +18,7 @@ This document is the single canonical source of truth for how the MultiNet v2.0
 1. [Overview & north stars](#1-overview--north-stars)
 2. [Pipeline DAG: stages, artifacts, invalidation](#2-pipeline-dag-stages-artifacts-invalidation)
 3. [Task spec contract](#3-task-spec-contract)
-4. [Static scoring (13 dimensions)](#4-static-scoring-13-dimensions)
+4. [Static scoring (12 dimensions plus canonical-agent features)](#4-static-scoring-12-dimensions-plus-canonical-agent-features)
 5. [Runtime scoring](#5-runtime-scoring)
 6. [Backend & inference adapter contracts](#6-backend--inference-adapter-contracts)
 7. [Reporting & aggregate](#7-reporting--aggregate)
@@ -74,18 +74,18 @@ The pipeline is a five-stage DAG. Each stage has declared inputs and outputs and
 2. **Solve & Score-static**
    - Inputs: `task.json`.
    - Outputs:
-     - `canonical_paths.json` `{ bfs: { path, steps, states_explored }, greedy: { success, path, steps }, … }`
-     - `scored.json` `{ is_beatable, dimensions[13], fragility, mechanism_necessity_violations, distractor_safety_violations, message }`
+     - `canonical_paths.json` `{ bfs: { actions, positions, optimal_steps, states_explored }, greedy: { success, actions, positions, steps }, … }`
+     - `scored_static.json` `{ is_beatable, dimensions_12, canonical_agent_features, validation, message }`
    - Hash key: `hash(solver_v, scorer_v, task.json, agent_set_v)`.
-   - If `scored.json.is_beatable == false`, downstream stages skip the task; it is logged as ineligible and surfaced in reports.
+   - If `scored_static.json.is_beatable == false`, downstream stages skip the task; it is logged as ineligible and surfaced in reports.
 
 3. **Render-and-Run**
-   - Inputs: `task.json`, `scored.json` (gate on `is_beatable`), backend choice, adapter choice, `model_id`, `seed`.
+   - Inputs: `task.json`, `scored_static.json` (gate on `is_beatable`), backend choice, adapter choice, `model_id`, `seed`.
    - Outputs: `run.json` `{ trajectory, actions, tokens, terminated, success }`.
    - Hash key: `hash(backend_v, adapter_v, model_id, task.json, seed)`.
 
 4. **Score-runtime**
-   - Inputs: `run.json`, `scored.json`, `canonical_paths.json`.
+   - Inputs: `run.json`, `scored_static.json`, `canonical_paths.json`.
    - Outputs: `run_score.json` `{ success, step_ratio, cell_overlap_*, distractor_interactions, irreversible_failures, tokens, composite }`.
    - Hash key: `hash(runtime_scorer_v, inputs)`.
 
@@ -106,7 +106,7 @@ artifacts/
 ├── tasks/<task_id>/
 │   ├── task.json                # Stage 1
 │   ├── canonical_paths.json     # Stage 2 (a)
-│   └── scored.json              # Stage 2 (b) — includes is_beatable
+│   └── scored_static.json       # Stage 2 (b) — includes is_beatable
 ├── runs/<task_id>/<backend>/<adapter>/<model_id>/<seed>/
 │   ├── run.json                 # Stage 3
 │   └── run_score.json           # Stage 4
@@ -218,9 +218,9 @@ Enforced by `TaskSpecification.validate()`:
 
 ---
 
-## 4. Static scoring (13 dimensions)
+## 4. Static scoring (12 dimensions plus canonical-agent features)
 
-Static scoring runs once per task at pipeline stage 2 (Solve & Score-static). It produces `scored.json`, which carries `is_beatable` plus a 13-dimension vector and supporting validation reports. The static scorer consumes `task.json` and `canonical_paths.json`.
+Static scoring runs once per task at pipeline stage 2 (Solve & Score-static). It produces `scored_static.json`, which carries `is_beatable`, a 12-dimension vector, canonical-agent features, and supporting validation reports. The scorer consumes `task.json` and emits this artifact alongside `canonical_paths.json`.
 
 ### 4.1 Dimensions
 
@@ -238,7 +238,7 @@ All raw values are floats (or counts cast to float). Higher = harder *unless* ex
 10. **`wall_density`** — Source: spec. Computation: `len(walls) / grid_size`. Crude (does not separate interior vs functional walls); **calibration target**.
 11. **`partial_observability`** — Source: spec rules. Computation: ordinal `{full: 0, view_cone: 1, fog_of_war: 2}` from `rules.observability`.
 12. **`irreversibility`** — Source: spec rules + mechanisms. Computation: `key_consumption × #doors + #one_shot_switches + #non_bidirectional_teleporters`.
-13. **`greedy_solvability`** — Source: Greedy canonical agent. Computation: `1.0 if greedy succeeds else 0.0`. **Penalty** (greedy-solvable tasks lower the runtime composite, on the rationale that they are less a test of spatial reasoning).
+`greedy_solvability` is recorded separately under `canonical_agent_features`, rather than appended to the calibrated 12-dimension vector. Source: Greedy canonical agent. Computation: `1.0 if greedy succeeds else 0.0`. **Penalty** (greedy-solvable tasks lower the runtime composite, on the rationale that they are less a test of spatial reasoning).
 
 ### 4.2 Static composite (difficulty score)
 
@@ -246,13 +246,13 @@ All raw values are floats (or counts cast to float). Higher = harder *unless* ex
 static_composite = Σ_i (raw_dim_i × calibration.weights[dim_name_i])
 ```
 
-- `calibration.weights` lives in `calibration.yaml`; defaults to `1.0` for all dimensions until empirical tuning.
+- Calibration weights live in `scorer/scorer_config.json` by default; optional JSON or YAML overrides may be passed explicitly. Weights default to `1.0` for all dimensions until empirical tuning.
 - `static_composite` is used for task ranking and live-benchmark filtering (e.g., reject tasks whose composite falls outside a tier's target range).
 - It is *not* used directly in runtime scoring; runtime uses individual dimensions plus a derived "difficulty weight" (Section 5).
 
-### 4.3 Validation reports (also in `scored.json`)
+### 4.3 Validation reports (also in `scored_static.json`)
 
-Beyond the dimension vector, `scored.json` carries the validator's structural reports:
+Beyond the dimension vector, `scored_static.json` carries the validator's structural reports:
 
 - `is_beatable` (bool) and `message` (str) — gate for downstream stages.
 - `mechanism_necessity_violations` (list of strings) — mechanisms whose removal still leaves the task solvable; flags accidental decoration.
@@ -260,6 +260,7 @@ Beyond the dimension vector, `scored.json` carries the validator's structural re
 - `chain_ordering_valid` (bool) — each dependency step actually gates the next.
 
 These do not enter the composite but are surfaced in reports for task-quality auditing.
+Schema-invalid tasks are rejected before canonical planners execute and do not emit score artifacts.
 
 ### 4.4 Calibration notes
 
@@ -272,16 +273,16 @@ These do not enter the composite but are surfaced in reports for task-quality au
 
 ## 5. Runtime scoring
 
-Runtime scoring runs at pipeline stage 4 (Score-runtime), once per `run.json`. It produces `run_score.json`. It consumes the run trajectory plus the static scoring artifacts (`scored.json`, `canonical_paths.json`).
+Runtime scoring runs at pipeline stage 4 (Score-runtime), once per `run.json`. It produces `run_score.json`. It consumes the run trajectory plus the static scoring artifacts (`scored_static.json`, `canonical_paths.json`).
 
 ### 5.1 Per-run signal vector
 
 Recorded for every `(task, backend, adapter, model_id, seed)`:
 
 - `success` (bool) — goal reached within `max_steps`, no terminal hazard.
-- `steps` (int) — agent's actual step count.
+- `steps` (int) — agent's actual step count. Required; runtime scoring rejects missing telemetry.
 - `terminated_reason` (str) — one of `{goal_reached, hazard, max_steps, deadlock, invalid_action_excess}`.
-- `token_count` (int) — total prompt + response tokens summed over all model turns.
+- `token_count` (positive int) — total prompt + response tokens summed over all model turns. Required; runtime scoring rejects missing or non-positive telemetry.
 - `distractor_interactions` (int) — count of distractor-element interactions (any `pickup` / `toggle` / `push` on an element registered as a distractor).
 - `irreversible_failures` (int) — count of irreversible actions that broke solvability, detected by re-running the validator from the post-action state.
 
@@ -298,11 +299,11 @@ composite = success_factor × efficiency_factor × difficulty_weight − greedy_
 ```
 
 - `success_factor = 1.0 if success else 0.0` — hard gate; failed runs score 0 regardless of efficiency.
-- `efficiency_factor = α × step_ratio + β × cell_overlap_bfs + γ × token_efficiency` — weighted blend; default `α = β = γ = 1/3`. `token_efficiency = min(1, baseline_tokens / max(model_tokens, 1))` where `baseline_tokens` lives in `calibration.yaml`.
-- `difficulty_weight = normalize(static_composite)` — harder tasks contribute more. Default normalization: `f(x) = x / max_observed_static_composite_in_suite`.
+- `efficiency_factor = α × step_ratio + β × cell_overlap_bfs + γ × token_efficiency` — weighted blend; default `α = β = γ = 1/3`. `token_efficiency = min(1, baseline_tokens / model_tokens)` where `baseline_tokens` lives in scorer config. Missing or non-positive token telemetry is an artifact error, not a neutral score.
+- `difficulty_weight = normalize(static_composite)` — harder tasks contribute more. Default normalization: `f(x) = x / max_observed_static_composite_in_suite`. Runtime scoring requires that suite maximum either in scorer config or as an explicit runtime argument.
 - `greedy_penalty = δ × greedy_solvability × success_factor` — applied only to successful runs; `δ` is a calibration coefficient with default 0.5.
 
-All Greek-letter coefficients (`α, β, γ, δ`) and the normalization function live in `calibration.yaml`. The design commits to the *shape*, not the values.
+All Greek-letter coefficients (`α, β, γ, δ`) and the normalization value live in scorer config. The design commits to the *shape*, not the values.
 
 ### 5.4 Single-point benchmark score (ARC-AGI style)
 
@@ -340,7 +341,8 @@ Defaults to a uniform mean. Calibration may switch to a tier-weighted or difficu
 ### 5.6 Calibration notes
 
 - All composite coefficients ship as `1.0` or sensible defaults; the design does not claim correctness.
-- `calibration.yaml` is versioned in git; changes bump `calibration_version` and trigger stage-4 / stage-5 invalidation.
+- `scorer/scorer_config.json` is versioned in git; changes bump `calibration_version` and trigger stage-4 / stage-5 invalidation.
+- The shipped config intentionally leaves `difficulty_max_static_score` unset. Runtime scoring requires a calibrated suite maximum through config or `--difficulty-max-static-score`.
 - After a calibration update, the pipeline regenerates `run_score.json` and `reports/` from cached `run.json`. Run records do **not** re-execute model calls. This is a deliberate consequence of the DAG split.
 
 ---
@@ -533,16 +535,16 @@ Status legend:
 
 **2. Validator** — folded into Stage 2
 - ✅ `gridworld/task_validator.py::TaskValidator` does exhaustive BFS over the full mechanism state space, plus `compute_fragility`, `validate_mechanism_necessity`, `validate_chain_ordering`, `validate_distractor_safety`.
-- Delta: surface validation reports into `scored.json` instead of emitting a separate `validity.json`.
+- Delta: surface validation reports into `scored_static.json` instead of emitting a separate `validity.json`.
 
 **3. Solver suite (canonical agents)** — Stage 2
-- ⚠️ BFS exists inside `TaskValidator._find_solution`. Greedy does not yet exist as a separate canonical agent.
-- 🚧 Multi-tier solver suite pending; Greedy is the next addition, then heuristic, then random.
-- Delta: extract BFS path emission as one canonical agent, add Greedy as a peer, write combined output to `canonical_paths.json`.
+- ✅ `gridworld/baselines.py` exposes BFS and Greedy planners; `scorer/solvers.py` writes their combined output to `canonical_paths.json`.
+- 🚧 Heuristic and random canonical-agent peers remain optional future additions.
+- Delta: add calibration runs before extending the canonical-agent feature vector.
 
 **4. Static scorer** — Stage 2
-- ⚠️ `gridworld/scoring.py::compute_12d_score` exists with 12 dimensions matching dimensions 1–12 of §4 (modulo formula calibration).
-- Delta: add dimension 13 (`greedy_solvability`), restructure output to `scored.json` sidecar, move composite weights to `calibration.yaml`, include validation reports.
+- ✅ `scorer/scoring.py::compute_12d_score` exposes the public interface for the 12 calibrated dimensions and writes `scored_static.json` with validation reports plus `canonical_agent_features.greedy_solvability`.
+- Delta: empirically calibrate the shipped placeholder weights.
 
 **5. `MiniGridBackend`** — backend axis
 - ✅ `gridworld/backends/minigrid_backend.py` implements `AbstractGridBackend` for square grids with discrete actions + RGB rendering.
@@ -566,8 +568,8 @@ Status legend:
 - Delta: emit canonical `run.json`; remove inline scoring (move to Stage 4); add per-step trajectory recording.
 
 **10. Runtime scorer** — Stage 4
-- 🚧 Does not exist as a component. Some scoring logic lives inside `evaluation_harness.py`.
-- Delta: new module that consumes `run.json` + `scored.json` + `canonical_paths.json` and produces `run_score.json`.
+- ✅ `scorer/runtime.py` consumes `run.json` + `scored_static.json` + `canonical_paths.json` and produces `run_score.json`.
+- Delta: populate optional interaction diagnostics in runtime producers and calibrate the suite-level difficulty maximum.
 
 **11. Aggregator / reporter** — Stage 5
 - ⚠️ Partial. `evaluation_harness.py` produces some summary dicts; nothing matches the per-run-set artifact layout.
@@ -597,7 +599,7 @@ Items the design intentionally defers. None block initial implementation.
 - DAG runner technology — Snakemake leading candidate; final pick deferred to implementation.
 - Token-efficiency baseline (`baseline_tokens`) — per-task vs global constant; needs a sensible default once a few model runs exist.
 
-### 9.2 Calibration coefficients (live in `calibration.yaml`, default to placeholders)
+### 9.2 Calibration coefficients (live in scorer config, default to placeholders)
 - Runtime composite blend weights `α, β, γ` (step ratio / cell overlap / token efficiency).
 - Greedy penalty coefficient `δ`.
 - `difficulty_weight` normalization function (currently `x / max_observed`; may switch to a percentile or log normalization).
@@ -645,7 +647,7 @@ Mapping to the canonical pipeline:
 | JSON generator | Stage 1 (Generate) | §2.1 |
 | Task spec / Validator | folded into Stage 2 (Solve & Score-static) | §2.1 |
 | BFS-greedy agents | Multi-tier canonical agent suite (Stage 2) | §2.1, §4 |
-| Score calculation (static) | Static scoring (13 dimensions) (Stage 2) | §4 |
+| Score calculation (static) | Static scoring (12 dimensions plus canonical-agent features) (Stage 2) | §4 |
 | Backend Generator | Backend axis: `MiniGridBackend` / `MultiGridBackend` / `TextBackend` | §6 |
 | Inference scripts | Adapter axis: `ModelInterface` implementations | §6 |
 | Scoring code (final score, comparison) | Runtime scoring (Stage 4) + Aggregate (Stage 5) | §5, §7 |
diff --git a/evaluation_harness.py b/evaluation_harness.py
index 57fa3ee..55ea840 100644
--- a/evaluation_harness.py
+++ b/evaluation_harness.py
@@ -22,7 +22,8 @@
     from .gridworld.task_spec import TaskSpecification
     from .gridworld.actions import ACTION_NAMES, ACTION_DESCRIPTIONS
     from .gridworld.task_validator import compute_difficulty
-    from .gridworld.scoring import compute_12d_score
+    from .scorer.io import json_default as _json_default
+    from .scorer.scoring import compute_12d_score
 except ImportError:
     from model_interface import ModelInterface, ModelInput, ModelOutput
     from gridworld.runner.grid_runner import GridRunner, EpisodeResult
@@ -31,14 +32,8 @@
     from gridworld.task_spec import TaskSpecification
     from gridworld.actions import ACTION_NAMES, ACTION_DESCRIPTIONS
     from gridworld.task_validator import compute_difficulty
-    from gridworld.scoring import compute_12d_score
-
-
-def _json_default(value):
-    """Convert NumPy scalars to native Python types for JSON serialization."""
-    if isinstance(value, np.generic):
-        return value.item()
-    raise TypeError(f"Object of type {value.__class__.__name__} is not JSON serializable")
+    from scorer.io import json_default as _json_default
+    from scorer.scoring import compute_12d_score
 
 
 @dataclass
diff --git a/gridworld/__init__.py b/gridworld/__init__.py
index 27425ed..fd567a0 100644
--- a/gridworld/__init__.py
+++ b/gridworld/__init__.py
@@ -1,7 +1,7 @@
 """Gridworld domain for MultiNet-v2.0.
 
-This module provides task schema, validation, and scoring utilities for
-gridworld puzzle specifications.
+This module provides task schema and validation utilities for gridworld
+puzzle specifications.
 """
 
 from .bootstrap import disable_gymnasium_env_plugins
@@ -32,9 +32,6 @@
     TaskValidator,
     compute_difficulty,
 )
-from .scoring import ScoredDifficulty, compute_12d_score
-
-
 __all__ = [
     # Task specification
     "Position",
@@ -58,6 +55,4 @@
     "DifficultyReport",
     "FragilityReport",
     "compute_difficulty",
-    "ScoredDifficulty",
-    "compute_12d_score",
 ]
diff --git a/gridworld/baselines.py b/gridworld/baselines.py
index ee5c8b0..2ab41f3 100644
--- a/gridworld/baselines.py
+++ b/gridworld/baselines.py
@@ -49,6 +49,17 @@ class Transition:
     next_state: PlannerState
 
 
+@dataclass(frozen=True)
+class PlannedPath:
+    """Planner output with replayed positions for scorer/reporting artifacts."""
+
+    success: bool
+    actions: list[int]
+    action_labels: list[str]
+    positions: list[tuple[int, int]]
+    states_explored: int = 0
+
+
 class TaskPlanningContext:
     """Fast lookup tables derived from a ``TaskSpecification``."""
 
@@ -353,10 +364,10 @@ def _shortest_plan(
     ctx: TaskPlanningContext,
     start: PlannerState,
     is_goal: Callable[[PlannerState], bool],
-) -> tuple[list[int], PlannerState | None]:
+) -> tuple[list[int], PlannerState | None, int]:
     """Run BFS over executable actions and return the first shortest plan."""
     if is_goal(start):
-        return [], start
+        return [], start, 1
 
     queue = deque([start])
     parent: dict[PlannerState, tuple[PlannerState, int]] = {}
@@ -370,10 +381,14 @@ def _shortest_plan(
             visited.add(transition.next_state)
             parent[transition.next_state] = (state, transition.action)
             if is_goal(transition.next_state):
-                return _reconstruct_actions(parent, transition.next_state), transition.next_state
+                return (
+                    _reconstruct_actions(parent, transition.next_state),
+                    transition.next_state,
+                    len(visited),
+                )
             queue.append(transition.next_state)
 
-    return [], None
+    return [], None, len(visited)
 
 
 def _shortest_plan_to_interaction(
@@ -437,9 +452,18 @@ def _reconstruct_actions(
 
 
 def _bfs_actions(spec: TaskSpecification) -> list[int]:
+    actions, _ = _bfs_actions_with_stats(spec)
+    return actions
+
+
+def _bfs_actions_with_stats(spec: TaskSpecification) -> tuple[list[int], int]:
     ctx = TaskPlanningContext(spec)
-    actions, _ = _shortest_plan(ctx, ctx.initial_state(), lambda st: st.agent_pos == ctx.goal)
-    return actions or [int(MiniGridActions.DONE)]
+    actions, _, states_explored = _shortest_plan(
+        ctx,
+        ctx.initial_state(),
+        lambda st: st.agent_pos == ctx.goal,
+    )
+    return actions, states_explored
 
 
 def _greedy_actions(spec: TaskSpecification) -> list[int]:
@@ -452,13 +476,81 @@ def _greedy_actions(spec: TaskSpecification) -> list[int]:
             break
         chunk, next_state = _shortest_plan_to_interaction(ctx, state)
         if next_state is None:
-            chunk, next_state = _shortest_plan(ctx, state, lambda st: st.agent_pos == ctx.goal)
+            chunk, next_state, _ = _shortest_plan(
+                ctx,
+                state,
+                lambda st: st.agent_pos == ctx.goal,
+            )
         if next_state is None or not chunk:
             break
         actions.extend(chunk)
         state = next_state
 
-    return actions or [int(MiniGridActions.DONE)]
+    return actions
+
+
+def trace_planned_actions(spec: TaskSpecification, actions: list[int]) -> PlannedPath:
+    """Replay planner actions through the planner graph without running a backend."""
+    ctx = TaskPlanningContext(spec)
+    state = ctx.initial_state()
+    positions = [state.agent_pos]
+    executed_actions: list[int] = []
+    labels: list[str] = []
+
+    for action in actions:
+        if action == int(MiniGridActions.DONE):
+            break
+        executed_actions.append(action)
+        transition = next(
+            (candidate for candidate in _successors(ctx, state) if candidate.action == action),
+            None,
+        )
+        if transition is None:
+            labels.append(f"invalid:{action}")
+            return PlannedPath(
+                success=False,
+                actions=executed_actions,
+                action_labels=labels,
+                positions=positions,
+            )
+        labels.append(transition.label)
+        state = transition.next_state
+        positions.append(state.agent_pos)
+
+    return PlannedPath(
+        success=state.agent_pos == ctx.goal,
+        actions=executed_actions,
+        action_labels=labels,
+        positions=positions,
+    )
+
+
+def plan_bfs_actions(spec: TaskSpecification) -> list[int]:
+    """Return the deterministic BFS baseline action plan."""
+    return _bfs_actions(spec)
+
+
+def plan_greedy_actions(spec: TaskSpecification) -> list[int]:
+    """Return the deterministic greedy baseline action plan."""
+    return _greedy_actions(spec)
+
+
+def plan_bfs_path(spec: TaskSpecification) -> PlannedPath:
+    """Return the BFS baseline plan plus replayed positions."""
+    actions, states_explored = _bfs_actions_with_stats(spec)
+    path = trace_planned_actions(spec, actions)
+    return PlannedPath(
+        success=path.success,
+        actions=path.actions,
+        action_labels=path.action_labels,
+        positions=path.positions,
+        states_explored=states_explored,
+    )
+
+
+def plan_greedy_path(spec: TaskSpecification) -> PlannedPath:
+    """Return the greedy baseline plan plus replayed positions."""
+    return trace_planned_actions(spec, plan_greedy_actions(spec))
 
 
 class PlannedBaselineModel(ModelInterface):
diff --git a/gridworld/scoring.py b/gridworld/scoring.py
deleted file mode 100644
index 9dd3670..0000000
--- a/gridworld/scoring.py
+++ /dev/null
@@ -1,152 +0,0 @@
-"""12-dimension scoring for gridworld tasks."""
-
-from __future__ import annotations
-
-from dataclasses import dataclass, field
-
-from .task_spec import TaskSpecification
-from .task_validator import DifficultyReport, TaskValidator
-
-
-DIMENSION_NAMES = [
-    "optimal_path_length",
-    "search_space_size",
-    "backtracking_required",
-    "fragility",
-    "dependency_depth",
-    "dependency_variety",
-    "distractor_count",
-    "distractor_quality",
-    "grid_size",
-    "wall_density",
-    "partial_observability",
-    "irreversibility",
-]
-
-
-@dataclass
-class ScoredDifficulty:
-    """Full 12-dimension score report."""
-    dimensions: list[float]
-    dimension_names: list[str] = field(default_factory=lambda: DIMENSION_NAMES.copy())
-    composite: float = 0.0
-    weights: list[float] = field(default_factory=lambda: [1.0] * len(DIMENSION_NAMES))
-
-    def to_dict(self) -> dict:
-        return {
-            "dimensions": self.dimensions,
-            "dimension_names": self.dimension_names,
-            "composite": self.composite,
-            "weights": self.weights,
-        }
-
-
-def _count_backtracking(solution: list[tuple[int, int]] | None) -> float:
-    if not solution:
-        return 0.0
-    seen = set()
-    revisits = 0
-    previous_pos = None
-    for pos in solution:
-        if pos == previous_pos:
-            continue
-        if pos in seen:
-            revisits += 1
-        seen.add(pos)
-        previous_pos = pos
-    return float(revisits)
-
-
-def _dependency_variety(spec: TaskSpecification) -> float:
-    if spec.dependency_chain is not None:
-        return float(len({step.type for step in spec.dependency_chain.sequence}))
-
-    variety = 0
-    if spec.mechanisms.keys and spec.mechanisms.doors:
-        variety += 1
-    if spec.mechanisms.switches and spec.mechanisms.gates:
-        variety += 1
-    if spec.mechanisms.blocks:
-        variety += 1
-    if spec.mechanisms.teleporters:
-        variety += 1
-    if spec.mechanisms.hazards:
-        variety += 1
-    return float(variety)
-
-
-def _distractor_quality(spec: TaskSpecification) -> float:
-    if not spec.distractors:
-        return 0.0
-    weights = {
-        "wrong_color_key": 1.0,
-        "inactive_switch": 2.0,
-        "decoy_door": 2.0,
-        "distractor_chain": 3.0,
-    }
-    return float(sum(weights.get(d.type, 1.0) for d in spec.distractors))
-
-
-def _partial_observability(spec: TaskSpecification) -> float:
-    mapping = {"full": 0.0, "view_cone": 1.0, "fog_of_war": 2.0}
-    return mapping.get(spec.rules.observability, 0.0)
-
-
-def _irreversibility(spec: TaskSpecification) -> float:
-    score = 0.0
-    if spec.rules.key_consumption:
-        score += float(len(spec.mechanisms.doors))
-    score += float(sum(1 for switch in spec.mechanisms.switches if switch.switch_type == "one_shot"))
-    score += float(sum(1 for tp in spec.mechanisms.teleporters if not tp.bidirectional))
-    return score
-
-
-def compute_12d_score(
-    spec: TaskSpecification,
-    solver_output: DifficultyReport | None = None,
-    weights: list[float] | None = None,
-) -> ScoredDifficulty:
-    """
-    Compute the full 12-dimension benchmark score.
-
-    This wraps solver-derived metrics with rubric dimensions such as
-    fragility, dependency variety, distractor quality, partial observability,
-    wall density, and irreversibility. The compact solver report remains in
-    compute_difficulty for callers that only need path/search metrics.
-    """
-    validator = TaskValidator(spec)
-    is_beatable, solution, message = validator.validate()
-    if solver_output is None:
-        from .task_validator import compute_difficulty
-
-        solver_output = compute_difficulty(spec)
-
-    fragility = validator.compute_fragility()
-    fragility_value = 0.0 if fragility.min_steps_to_break == -1 else 1.0 / fragility.min_steps_to_break
-
-    width, height = spec.maze.dimensions
-    grid_size = float(width * height)
-    wall_density = float(len(spec.maze.walls) / grid_size) if grid_size else 0.0
-
-    dimensions = [
-        float(solver_output.optimal_steps),
-        float(solver_output.states_explored),
-        float(solver_output.backtrack_count if hasattr(solver_output, "backtrack_count") else _count_backtracking(solution)),
-        fragility_value,
-        float(spec.dependency_chain.depth if spec.dependency_chain is not None else solver_output.dependency_depth),
-        _dependency_variety(spec),
-        float(len(spec.distractors or [])),
-        _distractor_quality(spec),
-        grid_size,
-        wall_density,
-        _partial_observability(spec),
-        _irreversibility(spec),
-    ]
-
-    weight_vector = weights or [1.0] * len(DIMENSION_NAMES)
-    composite = float(sum(d * w for d, w in zip(dimensions, weight_vector)))
-    return ScoredDifficulty(
-        dimensions=dimensions,
-        composite=composite,
-        weights=weight_vector,
-    )
diff --git a/gridworld/task_validator.py b/gridworld/task_validator.py
index aee948f..4befedf 100644
--- a/gridworld/task_validator.py
+++ b/gridworld/task_validator.py
@@ -493,12 +493,13 @@ def validate_chain_ordering(self) -> bool:
                 return False
         return True
 
-    def validate_distractor_safety(self) -> list[str]:
+    def validate_distractor_safety(self, base_beatable: bool | None = None) -> list[str]:
         """Check whether a single distractor interaction can make the task unsolvable."""
         if not self.spec.distractors:
             return []
 
-        base_beatable, _, _ = self.validate()
+        if base_beatable is None:
+            base_beatable, _, _ = self.validate()
         if not base_beatable:
             return ["Base task is not solvable"]
 
@@ -767,17 +768,23 @@ def to_dict(self) -> dict:
         }
 
 
-def compute_difficulty(spec: TaskSpecification) -> DifficultyReport:
+def compute_difficulty(
+    spec: TaskSpecification,
+    validator: TaskValidator | None = None,
+    validation_result: tuple[bool, Optional[list[tuple[int, int]]], str] | None = None,
+) -> DifficultyReport:
     """
     Compute solver-derived difficulty metrics for a task.
 
     This is a compact report centered on BFS output: beatability, shortest
     action count, states explored, coarse mechanism complexity, and a legacy
-    composite score. Use compute_12d_score when the full rubric vector is
+    composite score. Use scorer.scoring.compute_12d_score when the full rubric vector is
     needed for benchmark comparison.
     """
-    validator = TaskValidator(spec)
-    is_beatable, solution, message = validator.validate()
+    task_validator = validator or TaskValidator(spec)
+    if validation_result is None:
+        validation_result = task_validator.validate()
+    is_beatable, solution, message = validation_result
 
     optimal_steps = len(solution) - 1 if solution else 0  # -1 because path includes start
     # Extract states_explored from message
diff --git a/interface/agents/claude.py b/interface/agents/claude.py
index 9a6fc8e..1a2466b 100644
--- a/interface/agents/claude.py
+++ b/interface/agents/claude.py
@@ -17,6 +17,7 @@
     parse_runner_content,
     split_system_prompt,
 )
+from interface.telemetry import normalize_token_usage
 
 logger = logging.getLogger(__name__)
 
@@ -83,7 +84,7 @@ def _post_messages(
     system: Optional[str],
     messages: List[Dict[str, object]],
     timeout: Optional[float],
-) -> str:
+) -> Tuple[str, Optional[Dict[str, int]]]:
     body: Dict[str, object] = {
         "model": model,
         "max_tokens": max_tokens,
@@ -136,7 +137,7 @@ def _post_messages(
     for block in payload.get("content", []) or []:
         if isinstance(block, dict) and block.get("type") == "text":
             parts.append(str(block.get("text", "")))
-    return "".join(parts).strip()
+    return "".join(parts).strip(), normalize_token_usage(payload.get("usage"))
 
 
 @dataclass
@@ -153,6 +154,7 @@ class ClaudeAnthropicAgent:
 
     config: ClaudeAnthropicConfig = field(default_factory=ClaudeAnthropicConfig)
     api_key: Optional[str] = None
+    last_usage: Optional[Dict[str, int]] = field(default=None, init=False)
 
     def __post_init__(self) -> None:
         key = (self.api_key or os.environ.get("ANTHROPIC_API_KEY") or "").strip()
@@ -165,7 +167,7 @@ def __post_init__(self) -> None:
 
     def __call__(self, messages: List[dict]) -> str:
         system, turns = _to_anthropic_turns(messages)
-        return _post_messages(
+        text, self.last_usage = _post_messages(
             self.api_key,
             model=self.config.model,
             max_tokens=self.config.max_tokens,
@@ -174,3 +176,4 @@ def __call__(self, messages: List[dict]) -> str:
             messages=turns,
             timeout=self.config.timeout,
         )
+        return text
diff --git a/interface/agents/qwen35_vl.py b/interface/agents/qwen35_vl.py
index 2ad4e90..3ce2669 100644
--- a/interface/agents/qwen35_vl.py
+++ b/interface/agents/qwen35_vl.py
@@ -78,6 +78,7 @@ class Qwen35VLAgent:
     config: Qwen35VLConfig = field(default_factory=Qwen35VLConfig)
     processor: Any = None
     model: Any = None
+    last_usage: dict[str, int] | None = field(default=None, init=False)
 
     def __post_init__(self) -> None:
         from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
@@ -121,4 +122,9 @@ def __call__(self, messages: List[dict]) -> str:
             )
 
         new_tokens = generated[0][prompt_len:]
+        self.last_usage = {
+            "input_tokens": int(prompt_len),
+            "output_tokens": int(len(new_tokens)),
+            "total_tokens": int(prompt_len + len(new_tokens)),
+        }
         return self.processor.decode(new_tokens, skip_special_tokens=True).strip()
diff --git a/interface/runner.py b/interface/runner.py
index 91dc448..a8c6f7b 100644
--- a/interface/runner.py
+++ b/interface/runner.py
@@ -57,6 +57,18 @@ def _trim_rolling_chat(messages: List[dict], max_pairs: int) -> None:
         del messages[1 : 1 + (tail_len - cap)]
 
 
+def _reset_agent_usage(agent: Callable[[List[dict]], str]) -> None:
+    """Clear per-call telemetry so stale usage cannot leak into a later query."""
+    reset_usage = getattr(agent, "reset_usage", None)
+    if callable(reset_usage):
+        reset_usage()
+        return
+    try:
+        setattr(agent, "last_usage", None)
+    except (AttributeError, TypeError):
+        pass
+
+
 def build_runner(
     config: ExperimentConfig,
     backend: MiniGridBackend,
@@ -159,6 +171,7 @@ def run(
                         len(agent_messages),
                         has_image,
                     )
+                _reset_agent_usage(agent)
                 t_llm = time.perf_counter()
                 model_text = agent(agent_messages)
                 llm_s = time.perf_counter() - t_llm
@@ -177,22 +190,24 @@ def run(
                     )
                 if logger.isEnabledFor(logging.DEBUG):
                     logger.debug("LLM query #%d reply:\n%s", query_count, model_text)
-                transcript.append(
-                    {
-                        "kind": "query",
-                        "query_index": query_count,
-                        "env_step_count": state.step_count,
-                        "agent_messages": copy.deepcopy(agent_messages),
-                        "assistant_reply": model_text,
-                        "parsed_actions": list(action_queue),
-                        "parse_ok": bool(action_queue),
-                        "has_image": has_image,
-                        "llm_latency_s": llm_s,
-                        "chat_history_mode": chat_history,
-                        "agent_message_count": len(agent_messages),
-                        "actions_remaining_before_step": len(action_queue),
-                    }
-                )
+                query_record = {
+                    "kind": "query",
+                    "query_index": query_count,
+                    "env_step_count": state.step_count,
+                    "agent_messages": copy.deepcopy(agent_messages),
+                    "assistant_reply": model_text,
+                    "parsed_actions": list(action_queue),
+                    "parse_ok": bool(action_queue),
+                    "has_image": has_image,
+                    "llm_latency_s": llm_s,
+                    "chat_history_mode": chat_history,
+                    "agent_message_count": len(agent_messages),
+                    "actions_remaining_before_step": len(action_queue),
+                }
+                usage = getattr(agent, "last_usage", None)
+                if isinstance(usage, dict):
+                    query_record["usage"] = dict(usage)
+                transcript.append(query_record)
                 # check if we got any valid actions; 
                 # if not, we'll count it as a parse failure and give feedback, 
                 # but still allow retries until max_parse_retries is reached
diff --git a/interface/smoke_tests/smoke_llm.py b/interface/smoke_tests/smoke_llm.py
index fd7d5e0..8d9058c 100644
--- a/interface/smoke_tests/smoke_llm.py
+++ b/interface/smoke_tests/smoke_llm.py
@@ -80,19 +80,35 @@ def __init__(
     def __call__(self, messages: list[dict]) -> str:
         self._query_seq += 1
         text = self._inner(messages)
-        self._records.append(
-            {
-                "query": self._query_seq,
-                "messages_in_context": len(messages),
-                "reply": text,
-            }
-        )
+        record = {
+            "query": self._query_seq,
+            "messages_in_context": len(messages),
+            "reply": text,
+        }
+        if self.last_usage is not None:
+            record["usage"] = dict(self.last_usage)
+        self._records.append(record)
         if self._log_replies:
             print(f"\n{'=' * 72}\nLLM query {self._query_seq} (messages={len(messages)})\n{'=' * 72}")
             print(text)
             print(f"{'=' * 72}\n")
         return text
 
+    @property
+    def last_usage(self) -> dict[str, int] | None:
+        usage = getattr(self._inner, "last_usage", None)
+        return usage if isinstance(usage, dict) else None
+
+    def reset_usage(self) -> None:
+        reset_usage = getattr(self._inner, "reset_usage", None)
+        if callable(reset_usage):
+            reset_usage()
+            return
+        try:
+            setattr(self._inner, "last_usage", None)
+        except (AttributeError, TypeError):
+            pass
+
 
 def main() -> None:
     parser = argparse.ArgumentParser(
diff --git a/interface/telemetry.py b/interface/telemetry.py
new file mode 100644
index 0000000..dd3a3c4
--- /dev/null
+++ b/interface/telemetry.py
@@ -0,0 +1,42 @@
+"""Shared telemetry normalization for interface producers and scorer consumers."""
+
+from __future__ import annotations
+
+from typing import Any
+
+
+TOKEN_COUNT_KEYS = ("total_tokens", "token_count", "tokens", "model_tokens")
+
+
+def normalize_token_usage(usage: Any) -> dict[str, int] | None:
+    """Normalize provider token usage into input, output, and total counts."""
+    if not isinstance(usage, dict):
+        return None
+    input_tokens = usage.get("input_tokens", usage.get("prompt_tokens"))
+    output_tokens = usage.get("output_tokens", usage.get("completion_tokens"))
+    total_tokens = usage.get("total_tokens")
+    if total_tokens is None and (input_tokens is not None or output_tokens is not None):
+        total_tokens = int(input_tokens or 0) + int(output_tokens or 0)
+
+    normalized = {}
+    if input_tokens is not None:
+        normalized["input_tokens"] = int(input_tokens)
+    if output_tokens is not None:
+        normalized["output_tokens"] = int(output_tokens)
+    if total_tokens is not None:
+        normalized["total_tokens"] = int(total_tokens)
+    return normalized or None
+
+
+def token_count_from_record(record: dict[str, Any]) -> int | None:
+    """Extract one token total without counting nested aliases twice."""
+    for container in (record, record.get("info"), record.get("metadata")):
+        if not isinstance(container, dict):
+            continue
+        for key in TOKEN_COUNT_KEYS:
+            if container.get(key) is not None:
+                return int(container[key])
+        usage = normalize_token_usage(container.get("usage"))
+        if usage is not None and usage.get("total_tokens") is not None:
+            return usage["total_tokens"]
+    return None
diff --git a/pyproject.toml b/pyproject.toml
index 7ce045e..b3c8b25 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -39,6 +39,7 @@ multinet-probe-vlm = "scripts.probe_vlm:main"
 multinet-ollama-vision-check = "scripts.ollama_vision_check:main"
 multinet-ollama-maze-shape-check = "scripts.ollama_maze_shape_check:main"
 multinet-vlm-sanity = "scripts.vlm_sanity_check:main"
+multinet-score-json = "scripts.score_json:main"
 
 [tool.setuptools]
 include-package-data = true
@@ -63,9 +64,11 @@ include = [
     "interface*",
     "mazes*",
     "multigrid*",
+    "scorer*",
     "scripts*",
 ]
 
 [tool.setuptools.package-data]
 gridworld = ["tasks/**/*.json", "tasks/*.json"]
 mazes = ["validation_10/**/*.json", "validation_10/*.json"]
+scorer = ["scorer_config.json"]
diff --git a/scorer/__init__.py b/scorer/__init__.py
new file mode 100644
index 0000000..df4a4db
--- /dev/null
+++ b/scorer/__init__.py
@@ -0,0 +1,33 @@
+"""Standalone scoring package for MultiNet task and run artifacts."""
+
+from .scoring import (
+    CanonicalPathReport,
+    RuntimeScoreArtifact,
+    ScoredDifficulty,
+    ScorerConfig,
+    StaticScoreArtifact,
+    compute_12d_score,
+    compute_canonical_paths,
+    compute_greedy_solvability,
+    compute_runtime_score,
+    compute_static_score_artifact,
+    load_scorer_config,
+    score_runtime_file,
+    score_task_file,
+)
+
+__all__ = [
+    "CanonicalPathReport",
+    "RuntimeScoreArtifact",
+    "ScoredDifficulty",
+    "ScorerConfig",
+    "StaticScoreArtifact",
+    "compute_12d_score",
+    "compute_canonical_paths",
+    "compute_greedy_solvability",
+    "compute_runtime_score",
+    "compute_static_score_artifact",
+    "load_scorer_config",
+    "score_runtime_file",
+    "score_task_file",
+]
diff --git a/scorer/artifacts.py b/scorer/artifacts.py
new file mode 100644
index 0000000..e61a70b
--- /dev/null
+++ b/scorer/artifacts.py
@@ -0,0 +1,170 @@
+"""Dataclasses for scorer artifact payloads."""
+
+from __future__ import annotations
+
+import copy
+from dataclasses import dataclass, field
+from typing import Any
+
+from .config import DIMENSION_NAMES, SCORER_VERSION
+
+
+@dataclass
+class ScoredDifficulty:
+    """Backward-compatible 12-dimension score report."""
+
+    dimensions: list[float]
+    dimension_names: list[str] = field(default_factory=lambda: DIMENSION_NAMES.copy())
+    composite: float = 0.0
+    weights: list[float] = field(default_factory=lambda: [1.0] * len(DIMENSION_NAMES))
+
+    @property
+    def dimensions_by_name(self) -> dict[str, float]:
+        return dict(zip(self.dimension_names, self.dimensions))
+
+    def to_dict(self) -> dict[str, Any]:
+        return {
+            "dimensions": list(self.dimensions),
+            "dimension_names": list(self.dimension_names),
+            "composite": self.composite,
+            "weights": list(self.weights),
+        }
+
+
+@dataclass
+class CanonicalPathReport:
+    """Canonical solver trace artifact for a task."""
+
+    task_id: str
+    success: bool
+    actions: list[str]
+    positions: list[tuple[int, int]]
+    optimal_steps: int
+    states_explored: int
+    message: str
+    greedy: dict[str, Any] | None = None
+    producer_version: str = SCORER_VERSION
+
+    @property
+    def bfs(self) -> dict[str, Any]:
+        return {
+            "success": self.success,
+            "actions": list(self.actions),
+            "positions": [list(pos) for pos in self.positions],
+            "optimal_steps": self.optimal_steps,
+            "states_explored": self.states_explored,
+            "message": self.message,
+        }
+
+    def to_dict(self) -> dict[str, Any]:
+        payload = {
+            "task_id": self.task_id,
+            "bfs": self.bfs,
+            "producer_version": self.producer_version,
+        }
+        if self.greedy is not None:
+            payload["greedy"] = copy.deepcopy(self.greedy)
+        return payload
+
+    @classmethod
+    def from_dict(cls, data: dict[str, Any]) -> "CanonicalPathReport":
+        bfs = data.get("bfs", data)
+        return cls(
+            task_id=str(data.get("task_id", "")),
+            success=bool(bfs.get("success", False)),
+            actions=[str(action) for action in bfs.get("actions", [])],
+            positions=[
+                (int(pos[0]), int(pos[1]))
+                for pos in bfs.get("positions", [])
+                if isinstance(pos, (list, tuple)) and len(pos) >= 2
+            ],
+            optimal_steps=int(bfs.get("optimal_steps", 0)),
+            states_explored=int(bfs.get("states_explored", 0)),
+            message=str(bfs.get("message", "")),
+            greedy=copy.deepcopy(data.get("greedy")),
+            producer_version=str(data.get("producer_version", SCORER_VERSION)),
+        )
+
+
+@dataclass
+class StaticScoreArtifact:
+    """Stage 2 static score artifact."""
+
+    task_id: str
+    is_beatable: bool
+    message: str
+    dimensions: dict[str, float]
+    static_score_unweighted: float
+    static_score: float
+    weights: dict[str, float]
+    validation: dict[str, Any]
+    canonical_agent_features: dict[str, float | None]
+    calibration_version: str
+    inputs_hash: str
+    producer_version: str = SCORER_VERSION
+
+    def to_dict(self) -> dict[str, Any]:
+        return {
+            "task_id": self.task_id,
+            "is_beatable": self.is_beatable,
+            "message": self.message,
+            "dimensions_12": dict(self.dimensions),
+            "static_score_unweighted": self.static_score_unweighted,
+            "static_score": self.static_score,
+            "weights": dict(self.weights),
+            "validation": copy.deepcopy(self.validation),
+            "canonical_agent_features": dict(self.canonical_agent_features),
+            "calibration_version": self.calibration_version,
+            "inputs_hash": self.inputs_hash,
+            "producer_version": self.producer_version,
+        }
+
+    @classmethod
+    def from_dict(cls, data: dict[str, Any]) -> "StaticScoreArtifact":
+        dimensions = data.get("dimensions_12", data.get("dimensions", {}))
+        if isinstance(dimensions, list):
+            dimensions = dict(zip(DIMENSION_NAMES, dimensions))
+        return cls(
+            task_id=str(data.get("task_id", "")),
+            is_beatable=bool(data.get("is_beatable", False)),
+            message=str(data.get("message", "")),
+            dimensions={str(k): float(v) for k, v in dimensions.items()},
+            static_score_unweighted=float(data.get("static_score_unweighted", 0.0)),
+            static_score=float(data.get("static_score", data.get("composite", 0.0))),
+            weights={str(k): float(v) for k, v in data.get("weights", {}).items()},
+            validation=dict(data.get("validation", {})),
+            canonical_agent_features=dict(data.get("canonical_agent_features", {})),
+            calibration_version=str(data.get("calibration_version", "unknown")),
+            inputs_hash=str(data.get("inputs_hash", "")),
+            producer_version=str(data.get("producer_version", SCORER_VERSION)),
+        )
+
+
+@dataclass
+class RuntimeScoreArtifact:
+    """Stage 4 runtime score artifact for one run."""
+
+    task_id: str
+    backend: str
+    adapter: str
+    model_id: str
+    seed: int | None
+    signals: dict[str, Any]
+    composite: float
+    calibration_version: str
+    inputs_hash: str
+    producer_version: str = SCORER_VERSION
+
+    def to_dict(self) -> dict[str, Any]:
+        return {
+            "task_id": self.task_id,
+            "backend": self.backend,
+            "adapter": self.adapter,
+            "model_id": self.model_id,
+            "seed": self.seed,
+            "signals": copy.deepcopy(self.signals),
+            "composite": self.composite,
+            "calibration_version": self.calibration_version,
+            "inputs_hash": self.inputs_hash,
+            "producer_version": self.producer_version,
+        }
diff --git a/scorer/config.py b/scorer/config.py
new file mode 100644
index 0000000..6e48130
--- /dev/null
+++ b/scorer/config.py
@@ -0,0 +1,146 @@
+"""Scorer configuration and calibration defaults."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any
+
+from .io import load_json
+
+
+SCORER_VERSION = "0.3.0"
+DEFAULT_CONFIG_PATH = Path(__file__).with_name("scorer_config.json")
+
+DIMENSION_NAMES = [
+    "optimal_path_length",
+    "search_space_size",
+    "backtracking_required",
+    "fragility",
+    "dependency_depth",
+    "dependency_variety",
+    "distractor_count",
+    "distractor_quality",
+    "grid_size",
+    "wall_density",
+    "partial_observability",
+    "irreversibility",
+]
+
+GREEDY_SOLVABILITY_FEATURE = "greedy_solvability"
+
+CANONICAL_AGENT_FEATURE_NAMES = [
+    GREEDY_SOLVABILITY_FEATURE,
+]
+
+DEFAULT_DISTRACTOR_TYPE_WEIGHTS = {
+    "wrong_color_key": 1.0,
+    "inactive_switch": 2.0,
+    "decoy_door": 2.0,
+    "distractor_chain": 3.0,
+}
+
+DEFAULT_RUNTIME_WEIGHTS = {
+    "step_ratio": 1.0,
+    "cell_overlap_bfs": 1.0,
+    "token_efficiency": 1.0,
+    "greedy_penalty": 0.5,
+}
+
+
+def _coerce_float_mapping(
+    values: dict[str, Any] | list[Any] | None,
+    names: list[str],
+    default: float = 1.0,
+) -> dict[str, float]:
+    if values is None:
+        return {name: default for name in names}
+    if isinstance(values, list):
+        if len(values) != len(names):
+            raise ValueError(f"Expected {len(names)} weights, got {len(values)}")
+        result = {name: default for name in names}
+        for name, value in zip(names, values):
+            result[name] = float(value)
+        return result
+    return {name: float(values.get(name, default)) for name in names}
+
+
+@dataclass
+class ScorerConfig:
+    """Weights and runtime coefficients used by the standalone scorer."""
+
+    version: str = "default"
+    static_dimension_weights: dict[str, float] = field(
+        default_factory=lambda: {name: 1.0 for name in DIMENSION_NAMES}
+    )
+    distractor_type_weights: dict[str, float] = field(
+        default_factory=lambda: DEFAULT_DISTRACTOR_TYPE_WEIGHTS.copy()
+    )
+    runtime_weights: dict[str, float] = field(
+        default_factory=lambda: DEFAULT_RUNTIME_WEIGHTS.copy()
+    )
+    baseline_tokens: float = 1000.0
+    difficulty_max_static_score: float | None = None
+
+    @classmethod
+    def default(cls) -> "ScorerConfig":
+        return cls()
+
+    @classmethod
+    def from_dict(cls, data: dict[str, Any]) -> "ScorerConfig":
+        static_weights = data.get("static_dimension_weights", data.get("static_weights"))
+        runtime_weights = data.get("runtime_weights")
+        distractor_weights = data.get("distractor_type_weights", data.get("distractor_weights"))
+
+        difficulty_max = data.get("difficulty_max_static_score")
+        return cls(
+            version=str(data.get("version", "default")),
+            static_dimension_weights=_coerce_float_mapping(static_weights, DIMENSION_NAMES),
+            distractor_type_weights={
+                **DEFAULT_DISTRACTOR_TYPE_WEIGHTS,
+                **{k: float(v) for k, v in (distractor_weights or {}).items()},
+            },
+            runtime_weights={
+                **DEFAULT_RUNTIME_WEIGHTS,
+                **{k: float(v) for k, v in (runtime_weights or {}).items()},
+            },
+            baseline_tokens=float(data.get("baseline_tokens", 1000.0)),
+            difficulty_max_static_score=(
+                None if difficulty_max is None else float(difficulty_max)
+            ),
+        )
+
+    def to_dict(self) -> dict[str, Any]:
+        return {
+            "version": self.version,
+            "static_dimension_weights": dict(self.static_dimension_weights),
+            "distractor_type_weights": dict(self.distractor_type_weights),
+            "runtime_weights": dict(self.runtime_weights),
+            "baseline_tokens": self.baseline_tokens,
+            "difficulty_max_static_score": self.difficulty_max_static_score,
+        }
+
+    def static_weight_list(self) -> list[float]:
+        return [self.static_dimension_weights.get(name, 1.0) for name in DIMENSION_NAMES]
+
+
+def load_scorer_config(path: str | Path | None = None) -> ScorerConfig:
+    """Load scorer weights from JSON, or return defaults if no file exists."""
+    config_path = Path(path) if path is not None else DEFAULT_CONFIG_PATH
+    if not config_path.exists():
+        if path is not None:
+            raise FileNotFoundError(f"Scorer config not found: {config_path}")
+        return ScorerConfig.default()
+    if config_path.suffix.lower() in {".yaml", ".yml"}:
+        try:
+            import yaml  # type: ignore
+        except ImportError as exc:
+            raise ImportError(
+                "YAML scorer configs require PyYAML. Use JSON or install PyYAML."
+            ) from exc
+        with open(config_path, "r") as f:
+            data = yaml.safe_load(f) or {}
+        if not isinstance(data, dict):
+            raise ValueError(f"Expected a YAML object in {config_path}")
+        return ScorerConfig.from_dict(data)
+    return ScorerConfig.from_dict(load_json(config_path))
diff --git a/scorer/io.py b/scorer/io.py
new file mode 100644
index 0000000..451fc44
--- /dev/null
+++ b/scorer/io.py
@@ -0,0 +1,62 @@
+"""JSON and hash helpers for scorer artifacts."""
+
+from __future__ import annotations
+
+import hashlib
+import json
+from pathlib import Path
+from typing import Any
+
+from gridworld.task_spec import TaskSpecification
+
+
+def json_default(value: Any) -> Any:
+    if hasattr(value, "item"):
+        return value.item()
+    raise TypeError(f"Object of type {value.__class__.__name__} is not JSON serializable")
+
+
+def load_json(path: str | Path) -> dict[str, Any]:
+    with open(path, "r") as f:
+        data = json.load(f)
+    if not isinstance(data, dict):
+        raise ValueError(f"Expected a JSON object in {path}")
+    return data
+
+
+def dump_json(path: str | Path, payload: dict[str, Any]) -> None:
+    output_path = Path(path)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    with open(output_path, "w") as f:
+        json.dump(payload, f, indent=2, default=json_default)
+        f.write("\n")
+
+
+def json_files(paths: list[str]) -> list[Path]:
+    """Expand JSON files and directories into a stable file list."""
+    files: list[Path] = []
+    for value in paths:
+        path = Path(value)
+        if path.is_dir():
+            files.extend(sorted(path.rglob("*.json")))
+        else:
+            files.append(path)
+    return files
+
+
+def stable_hash(payload: Any) -> str:
+    encoded = json.dumps(payload, sort_keys=True, separators=(",", ":"), default=json_default)
+    return hashlib.sha256(encoded.encode("utf-8")).hexdigest()
+
+
+def task_spec_from_payload(data: dict[str, Any]) -> TaskSpecification:
+    if "task_spec" in data and isinstance(data["task_spec"], dict):
+        return TaskSpecification.from_dict(data["task_spec"])
+    if "TaskSpecification" in data and isinstance(data["TaskSpecification"], dict):
+        return TaskSpecification.from_dict(data)
+    required_fields = {"task_id", "maze", "goal", "max_steps"}
+    if not required_fields.issubset(data):
+        raise ValueError(
+            "Input JSON is not a task artifact. Expected task fields or a nested task_spec."
+        )
+    return TaskSpecification.from_dict(data)
diff --git a/scorer/runtime.py b/scorer/runtime.py
new file mode 100644
index 0000000..0d43f8b
--- /dev/null
+++ b/scorer/runtime.py
@@ -0,0 +1,362 @@
+"""Runtime scoring for run and episode JSON artifacts."""
+
+from __future__ import annotations
+
+from pathlib import Path
+from typing import Any
+
+from .artifacts import CanonicalPathReport, RuntimeScoreArtifact, StaticScoreArtifact
+from .config import SCORER_VERSION, ScorerConfig
+from .io import dump_json, load_json, stable_hash
+from interface.telemetry import token_count_from_record
+
+
+def _artifact_dict(value: dict[str, Any] | StaticScoreArtifact | CanonicalPathReport) -> dict[str, Any]:
+    if hasattr(value, "to_dict"):
+        return value.to_dict()  # type: ignore[no-any-return]
+    return value
+
+
+def _lookup_path(data: dict[str, Any], *keys: str) -> Any:
+    current: Any = data
+    for key in keys:
+        if not isinstance(current, dict) or key not in current:
+            return None
+        current = current[key]
+    return current
+
+
+def _extract_task_id(run: dict[str, Any], fallback: str = "") -> str:
+    return str(
+        run.get("task_id")
+        or _lookup_path(run, "task_spec", "task_id")
+        or _lookup_path(run, "episode", "task_id")
+        or fallback
+    )
+
+
+def _extract_bool(run: dict[str, Any], *keys: str, default: bool = False) -> bool:
+    for key in keys:
+        value = run.get(key)
+        if value is not None:
+            return bool(value)
+    return default
+
+
+def _extract_steps(run: dict[str, Any]) -> int | None:
+    for key in ("steps", "steps_taken", "steps_used"):
+        if run.get(key) is not None:
+            return int(run[key])
+    signal_steps = _lookup_path(run, "signals", "steps")
+    if signal_steps is not None:
+        return int(signal_steps)
+    final_step = _lookup_path(run, "final_state", "step_count")
+    if final_step is not None:
+        return int(final_step)
+    return None
+
+
+def _extract_token_count(run: dict[str, Any]) -> int | None:
+    for key in ("total_tokens", "token_count", "tokens"):
+        if run.get(key) is not None:
+            return int(run[key])
+    signal_tokens = _lookup_path(run, "signals", "token_count")
+    if signal_tokens is not None:
+        return int(signal_tokens)
+
+    trajectory_total = _sum_record_tokens(run.get("trajectory", []))
+    if trajectory_total is not None:
+        return trajectory_total
+    return _sum_record_tokens(run.get("transcript", []), kind="query")
+
+
+def _sum_record_tokens(records: Any, kind: str | None = None) -> int | None:
+    if not isinstance(records, list):
+        return None
+    total = 0
+    found = False
+    for item in records:
+        if not isinstance(item, dict):
+            continue
+        if kind is not None and item.get("kind") != kind:
+            continue
+        item_tokens = token_count_from_record(item)
+        if item_tokens is not None:
+            total += item_tokens
+            found = True
+    return total if found else None
+
+
+def _state_position(state: Any) -> tuple[int, int] | None:
+    if not isinstance(state, dict):
+        return None
+    raw = state.get("agent_position") or state.get("position")
+    if isinstance(raw, (list, tuple)) and len(raw) >= 2:
+        return int(raw[0]), int(raw[1])
+    return None
+
+
+def _extract_run_positions(run: dict[str, Any]) -> list[tuple[int, int]]:
+    positions: list[tuple[int, int]] = []
+
+    initial_pos = _state_position(run.get("initial_state"))
+    if initial_pos is not None:
+        positions.append(initial_pos)
+
+    for item in run.get("trajectory", []):
+        if not isinstance(item, dict):
+            continue
+        pos = _state_position(item.get("state"))
+        if pos is not None:
+            positions.append(pos)
+
+    for item in run.get("transcript", []):
+        if not isinstance(item, dict):
+            continue
+        if item.get("kind") == "reset":
+            pos = _state_position(item.get("state"))
+        else:
+            pos = _state_position(item.get("state_after"))
+            if pos is None:
+                raw = item.get("position_after")
+                pos = (int(raw[0]), int(raw[1])) if isinstance(raw, list) and len(raw) >= 2 else None
+        if pos is not None:
+            positions.append(pos)
+
+    final_pos = _state_position(run.get("final_state"))
+    if final_pos is not None:
+        positions.append(final_pos)
+
+    deduped: list[tuple[int, int]] = []
+    for pos in positions:
+        if not deduped or deduped[-1] != pos:
+            deduped.append(pos)
+    return deduped
+
+
+def _extract_canonical_positions(
+    canonical_paths: dict[str, Any],
+    agent: str = "bfs",
+) -> list[tuple[int, int]]:
+    path = canonical_paths.get(agent, canonical_paths if agent == "bfs" else {})
+    if not isinstance(path, dict):
+        return []
+    positions = []
+    for pos in path.get("positions", []):
+        if isinstance(pos, (list, tuple)) and len(pos) >= 2:
+            positions.append((int(pos[0]), int(pos[1])))
+    return positions
+
+
+def _cell_overlap(run_positions: list[tuple[int, int]], canonical_positions: list[tuple[int, int]]) -> float:
+    canonical_cells = set(canonical_positions)
+    if not canonical_cells:
+        return 0.0
+    return len(set(run_positions) & canonical_cells) / len(canonical_cells)
+
+
+def _extract_static_score(static_score: dict[str, Any]) -> float:
+    return float(static_score.get("static_score", static_score.get("composite", 0.0)))
+
+
+def _extract_greedy_solvability(static_score: dict[str, Any]) -> float:
+    value = _lookup_path(static_score, "canonical_agent_features", "greedy_solvability")
+    if value is None:
+        raise ValueError("Runtime scoring requires evaluated canonical_agent_features.greedy_solvability")
+    solvability = float(value)
+    if not 0.0 <= solvability <= 1.0:
+        raise ValueError("greedy_solvability must be between 0.0 and 1.0")
+    return solvability
+
+
+def _runtime_weighted_average(signals: dict[str, float], weights: dict[str, float]) -> float:
+    numerator = 0.0
+    denominator = 0.0
+    for key in ("step_ratio", "cell_overlap_bfs", "token_efficiency"):
+        weight = float(weights.get(key, 0.0))
+        numerator += signals[key] * weight
+        denominator += weight
+    return numerator / denominator if denominator else 0.0
+
+
+def _first_present(*values: Any) -> Any:
+    for value in values:
+        if value is not None:
+            return value
+    return None
+
+
+def compute_runtime_score(
+    run: dict[str, Any],
+    static_score: dict[str, Any] | StaticScoreArtifact,
+    canonical_paths: dict[str, Any] | CanonicalPathReport,
+    config: ScorerConfig | None = None,
+    difficulty_max_static_score: float | None = None,
+) -> RuntimeScoreArtifact:
+    """Compute the Stage 4 runtime score for one run JSON payload."""
+    scorer_config = config or ScorerConfig.default()
+    static_data = _artifact_dict(static_score)
+    canonical_data = _artifact_dict(canonical_paths)
+    if _lookup_path(static_data, "validation", "schema_valid") is False:
+        raise ValueError("Runtime scoring requires a schema-valid scored_static.json artifact")
+
+    task_id = _extract_task_id(run, fallback=str(static_data.get("task_id", "")))
+    success = _extract_bool(run, "success", default=bool(_lookup_path(run, "signals", "success") or False))
+    steps = _extract_steps(run)
+    token_count = _extract_token_count(run)
+    canonical_positions = _extract_canonical_positions(canonical_data)
+    greedy_positions = _extract_canonical_positions(canonical_data, agent="greedy")
+    run_positions = _extract_run_positions(run)
+
+    optimal_steps_value = _first_present(
+        _lookup_path(canonical_data, "bfs", "optimal_steps"),
+        canonical_data.get("optimal_steps"),
+        static_data.get("optimal_steps"),
+    )
+    if optimal_steps_value is None:
+        raise ValueError("Runtime scoring requires bfs.optimal_steps in canonical_paths.json")
+    optimal_steps = int(optimal_steps_value)
+    if steps is None:
+        raise ValueError("Runtime scoring requires step telemetry")
+    if steps < 0:
+        raise ValueError("steps must not be negative")
+    step_ratio = 0.0
+    if success and optimal_steps == 0:
+        step_ratio = 1.0 if steps == 0 else 0.0
+    elif success:
+        step_ratio = optimal_steps / max(float(steps), float(optimal_steps), 1.0)
+
+    cell_overlap_bfs = _cell_overlap(run_positions, canonical_positions)
+    cell_overlap_greedy = (
+        _cell_overlap(run_positions, greedy_positions)
+        if isinstance(canonical_data.get("greedy"), dict)
+        else None
+    )
+    if token_count is None:
+        raise ValueError("Runtime scoring requires positive token telemetry")
+    if token_count <= 0:
+        raise ValueError("token_count must be greater than zero")
+    token_efficiency = min(1.0, scorer_config.baseline_tokens / float(token_count))
+
+    static_composite = _extract_static_score(static_data)
+    normalizer = (
+        difficulty_max_static_score
+        if difficulty_max_static_score is not None
+        else scorer_config.difficulty_max_static_score
+    )
+    if normalizer is None:
+        raise ValueError(
+            "Runtime scoring requires difficulty_max_static_score from the task suite "
+            "or scorer config"
+        )
+    if normalizer <= 0:
+        raise ValueError("difficulty_max_static_score must be greater than zero")
+    if static_composite > normalizer:
+        raise ValueError("difficulty_max_static_score must be at least the task static score")
+    difficulty_weight = static_composite / normalizer
+    success_factor = 1.0 if success else 0.0
+    efficiency_signals = {
+        "step_ratio": step_ratio,
+        "cell_overlap_bfs": cell_overlap_bfs,
+        "token_efficiency": token_efficiency,
+    }
+    efficiency_factor = _runtime_weighted_average(
+        efficiency_signals,
+        scorer_config.runtime_weights,
+    )
+    greedy_solvability = _extract_greedy_solvability(static_data)
+    greedy_penalty = (
+        scorer_config.runtime_weights.get("greedy_penalty", 0.0)
+        * greedy_solvability
+        * success_factor
+    )
+    composite = max(0.0, success_factor * efficiency_factor * difficulty_weight - greedy_penalty)
+
+    signals: dict[str, Any] = {
+        "success": success,
+        "steps": steps,
+        "terminated": _extract_bool(run, "terminated", default=False),
+        "truncated": _extract_bool(run, "truncated", default=False),
+        "terminated_reason": run.get("terminated_reason") or run.get("end_reason") or ("success" if success else "unknown"),
+        "reward": run.get("reward", run.get("total_reward")),
+        "token_count": token_count,
+        "optimal_steps": optimal_steps,
+        "step_ratio": step_ratio,
+        "cell_overlap_bfs": cell_overlap_bfs,
+        "cell_overlap_greedy": cell_overlap_greedy,
+        "token_efficiency": token_efficiency,
+        "difficulty_weight": difficulty_weight,
+        "efficiency_factor": efficiency_factor,
+        "greedy_penalty": greedy_penalty,
+    }
+    for key in (
+        "distractor_interactions",
+        "irreversible_failures",
+        "path_choice",
+        "mechanism_interaction_order",
+        "failure_point",
+    ):
+        if run.get(key) is not None:
+            signals[key] = run[key]
+
+    inputs_hash = stable_hash(
+        {
+            "run": {
+                "task_id": task_id,
+                "backend": run.get("backend"),
+                "adapter": run.get("adapter", run.get("agent_or_model")),
+                "model_id": run.get("model_id", run.get("model_name", run.get("agent_or_model"))),
+                "seed": run.get("seed"),
+                "positions": run_positions,
+                "signals": signals,
+            },
+            "static_score": {
+                "task_id": static_data.get("task_id"),
+                "static_score": static_composite,
+                "greedy_solvability": greedy_solvability,
+            },
+            "canonical_paths": {
+                "bfs_positions": canonical_positions,
+                "greedy_positions": greedy_positions,
+                "optimal_steps": optimal_steps,
+            },
+            "config": scorer_config.to_dict(),
+            "scorer_version": SCORER_VERSION,
+        }
+    )
+
+    return RuntimeScoreArtifact(
+        task_id=task_id,
+        backend=str(run.get("backend", "")),
+        adapter=str(run.get("adapter", run.get("agent_or_model", ""))),
+        model_id=str(run.get("model_id", run.get("model_name", run.get("agent_or_model", "")))),
+        seed=int(run["seed"]) if run.get("seed") is not None else None,
+        signals=signals,
+        composite=composite,
+        calibration_version=scorer_config.version,
+        inputs_hash=inputs_hash,
+    )
+
+
+def score_runtime_file(
+    run_path: str | Path,
+    static_score_path: str | Path,
+    canonical_paths_path: str | Path,
+    output_path: str | Path | None = None,
+    config: ScorerConfig | None = None,
+    difficulty_max_static_score: float | None = None,
+) -> RuntimeScoreArtifact:
+    """Score one run JSON file and optionally write run_score.json."""
+    run = load_json(run_path)
+    static_score = load_json(static_score_path)
+    canonical_paths = load_json(canonical_paths_path)
+    score = compute_runtime_score(
+        run,
+        static_score=static_score,
+        canonical_paths=canonical_paths,
+        config=config,
+        difficulty_max_static_score=difficulty_max_static_score,
+    )
+    if output_path is not None:
+        dump_json(output_path, score.to_dict())
+    return score
diff --git a/scorer/scorer_config.json b/scorer/scorer_config.json
new file mode 100644
index 0000000..fb7ed8f
--- /dev/null
+++ b/scorer/scorer_config.json
@@ -0,0 +1,31 @@
+{
+  "version": "default-v2",
+  "static_dimension_weights": {
+    "optimal_path_length": 1.0,
+    "search_space_size": 1.0,
+    "backtracking_required": 1.0,
+    "fragility": 1.0,
+    "dependency_depth": 1.0,
+    "dependency_variety": 1.0,
+    "distractor_count": 1.0,
+    "distractor_quality": 1.0,
+    "grid_size": 1.0,
+    "wall_density": 1.0,
+    "partial_observability": 1.0,
+    "irreversibility": 1.0
+  },
+  "distractor_type_weights": {
+    "wrong_color_key": 1.0,
+    "inactive_switch": 2.0,
+    "decoy_door": 2.0,
+    "distractor_chain": 3.0
+  },
+  "runtime_weights": {
+    "step_ratio": 1.0,
+    "cell_overlap_bfs": 1.0,
+    "token_efficiency": 1.0,
+    "greedy_penalty": 0.5
+  },
+  "baseline_tokens": 1000.0,
+  "difficulty_max_static_score": null
+}
diff --git a/scorer/scoring.py b/scorer/scoring.py
new file mode 100644
index 0000000..6d12100
--- /dev/null
+++ b/scorer/scoring.py
@@ -0,0 +1,45 @@
+"""Public scorer interface for static and runtime analysis."""
+
+from __future__ import annotations
+
+from .artifacts import (
+    CanonicalPathReport,
+    RuntimeScoreArtifact,
+    ScoredDifficulty,
+    StaticScoreArtifact,
+)
+from .config import (
+    CANONICAL_AGENT_FEATURE_NAMES,
+    DEFAULT_CONFIG_PATH,
+    DEFAULT_DISTRACTOR_TYPE_WEIGHTS,
+    DEFAULT_RUNTIME_WEIGHTS,
+    DIMENSION_NAMES,
+    SCORER_VERSION,
+    ScorerConfig,
+    load_scorer_config,
+)
+from .runtime import compute_runtime_score, score_runtime_file
+from .solvers import compute_canonical_paths, compute_greedy_solvability
+from .static import compute_12d_score, compute_static_score_artifact, score_task_file
+
+__all__ = [
+    "CANONICAL_AGENT_FEATURE_NAMES",
+    "DEFAULT_CONFIG_PATH",
+    "DEFAULT_DISTRACTOR_TYPE_WEIGHTS",
+    "DEFAULT_RUNTIME_WEIGHTS",
+    "DIMENSION_NAMES",
+    "SCORER_VERSION",
+    "CanonicalPathReport",
+    "RuntimeScoreArtifact",
+    "ScoredDifficulty",
+    "ScorerConfig",
+    "StaticScoreArtifact",
+    "compute_12d_score",
+    "compute_canonical_paths",
+    "compute_greedy_solvability",
+    "compute_runtime_score",
+    "compute_static_score_artifact",
+    "load_scorer_config",
+    "score_runtime_file",
+    "score_task_file",
+]
diff --git a/scorer/solvers.py b/scorer/solvers.py
new file mode 100644
index 0000000..95fe56f
--- /dev/null
+++ b/scorer/solvers.py
@@ -0,0 +1,74 @@
+"""Canonical solver integration for scorer artifacts."""
+
+from __future__ import annotations
+
+from typing import Any
+
+from gridworld.baselines import PlannedPath, plan_bfs_path, plan_greedy_path
+from gridworld.task_spec import TaskSpecification
+
+from .artifacts import CanonicalPathReport
+
+
+def _path_payload(path) -> dict[str, Any]:
+    return {
+        "success": path.success,
+        "actions": list(path.action_labels),
+        "positions": [list(pos) for pos in path.positions],
+        "steps": len(path.action_labels),
+    }
+
+
+def require_scorable_spec(spec: TaskSpecification) -> None:
+    """Reject malformed tasks before canonical planners inspect their coordinates."""
+    schema_valid, schema_errors = spec.validate()
+    if not schema_valid:
+        detail = "; ".join(schema_errors)
+        raise ValueError(f"Task {spec.task_id!r} failed schema validation: {detail}")
+
+
+def compute_canonical_paths(
+    spec: TaskSpecification,
+    bfs_path: PlannedPath | None = None,
+    greedy_path: PlannedPath | None = None,
+) -> CanonicalPathReport:
+    """Emit canonical BFS and greedy traces using the merged baseline solvers."""
+    require_scorable_spec(spec)
+    if bfs_path is None:
+        bfs_path = plan_bfs_path(spec)
+    if greedy_path is None:
+        greedy_path = plan_greedy_path(spec)
+
+    if bfs_path.success:
+        message = (
+            f"Solution found in {len(bfs_path.action_labels)} steps "
+            f"({bfs_path.states_explored} states explored)"
+        )
+    elif bfs_path.states_explored:
+        message = (
+            "No solution found "
+            f"({bfs_path.states_explored} states explored, all reachable states checked)"
+        )
+    else:
+        message = "No solution found"
+
+    return CanonicalPathReport(
+        task_id=spec.task_id,
+        success=bfs_path.success,
+        actions=list(bfs_path.action_labels),
+        positions=list(bfs_path.positions),
+        optimal_steps=len(bfs_path.action_labels) if bfs_path.success else 0,
+        states_explored=bfs_path.states_explored,
+        message=message,
+        greedy=_path_payload(greedy_path),
+    )
+
+
+def compute_greedy_solvability(
+    spec: TaskSpecification,
+    greedy_path: PlannedPath | None = None,
+) -> float:
+    """Return 1 when the merged greedy planner solves the task, else 0."""
+    if greedy_path is None:
+        greedy_path = plan_greedy_path(spec)
+    return 1.0 if greedy_path.success else 0.0
diff --git a/scorer/static.py b/scorer/static.py
new file mode 100644
index 0000000..adac502
--- /dev/null
+++ b/scorer/static.py
@@ -0,0 +1,264 @@
+"""Static task scoring and Stage 2 artifact generation."""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+from gridworld.baselines import PlannedPath, plan_bfs_path, plan_greedy_path
+from gridworld.task_spec import TaskSpecification
+from gridworld.task_validator import DifficultyReport, TaskValidator, compute_difficulty
+
+from .artifacts import ScoredDifficulty, StaticScoreArtifact
+from .config import (
+    DEFAULT_DISTRACTOR_TYPE_WEIGHTS,
+    DIMENSION_NAMES,
+    GREEDY_SOLVABILITY_FEATURE,
+    SCORER_VERSION,
+    ScorerConfig,
+)
+from .io import dump_json, load_json, stable_hash, task_spec_from_payload
+from .solvers import compute_canonical_paths, compute_greedy_solvability, require_scorable_spec
+
+
+def _count_backtracking(solution: list[tuple[int, int]] | None) -> float:
+    if not solution:
+        return 0.0
+    seen = set()
+    revisits = 0
+    previous_pos = None
+    for pos in solution:
+        if pos == previous_pos:
+            continue
+        if pos in seen:
+            revisits += 1
+        seen.add(pos)
+        previous_pos = pos
+    return float(revisits)
+
+
+def _dependency_variety(spec: TaskSpecification) -> float:
+    if spec.dependency_chain is not None:
+        return float(len({step.type for step in spec.dependency_chain.sequence}))
+
+    variety = 0
+    if spec.mechanisms.keys and spec.mechanisms.doors:
+        variety += 1
+    if spec.mechanisms.switches and spec.mechanisms.gates:
+        variety += 1
+    if spec.mechanisms.blocks:
+        variety += 1
+    if spec.mechanisms.teleporters:
+        variety += 1
+    if spec.mechanisms.hazards:
+        variety += 1
+    return float(variety)
+
+
+def _distractor_quality(
+    spec: TaskSpecification,
+    distractor_type_weights: dict[str, float] | None = None,
+) -> float:
+    if not spec.distractors:
+        return 0.0
+    weights = distractor_type_weights or DEFAULT_DISTRACTOR_TYPE_WEIGHTS
+    return float(sum(weights.get(d.type, 1.0) for d in spec.distractors))
+
+
+def _partial_observability(spec: TaskSpecification) -> float:
+    mapping = {"full": 0.0, "view_cone": 1.0, "fog_of_war": 2.0}
+    return mapping.get(spec.rules.observability, 0.0)
+
+
+def _irreversibility(spec: TaskSpecification) -> float:
+    score = 0.0
+    if spec.rules.key_consumption:
+        score += float(len(spec.mechanisms.doors))
+    score += float(sum(1 for switch in spec.mechanisms.switches if switch.switch_type == "one_shot"))
+    score += float(sum(1 for tp in spec.mechanisms.teleporters if not tp.bidirectional))
+    return score
+
+
+def compute_12d_score(
+    spec: TaskSpecification,
+    solver_output: DifficultyReport | None = None,
+    weights: list[float] | None = None,
+    config: ScorerConfig | None = None,
+    validator: TaskValidator | None = None,
+    bfs_path: PlannedPath | None = None,
+) -> ScoredDifficulty:
+    """
+    Compute the 12-dimension static benchmark score.
+
+    This keeps the old call shape while calibration and artifact generation
+    live in the standalone scorer package.
+    """
+    require_scorable_spec(spec)
+    scorer_config = config or ScorerConfig.default()
+    task_validator = validator or TaskValidator(spec)
+    if solver_output is None:
+        solver_output = compute_difficulty(spec, validator=task_validator)
+    if bfs_path is None:
+        bfs_path = plan_bfs_path(spec)
+
+    fragility = task_validator.compute_fragility()
+    fragility_value = 0.0 if fragility.min_steps_to_break == -1 else 1.0 / fragility.min_steps_to_break
+
+    width, height = spec.maze.dimensions
+    grid_size = float(width * height)
+    wall_density = float(len(spec.maze.walls) / grid_size) if grid_size else 0.0
+
+    dimensions = [
+        float(len(bfs_path.action_labels) if bfs_path.success else 0),
+        float(bfs_path.states_explored),
+        _count_backtracking(bfs_path.positions),
+        fragility_value,
+        float(spec.dependency_chain.depth if spec.dependency_chain is not None else solver_output.dependency_depth),
+        _dependency_variety(spec),
+        float(len(spec.distractors or [])),
+        _distractor_quality(spec, scorer_config.distractor_type_weights),
+        grid_size,
+        wall_density,
+        _partial_observability(spec),
+        _irreversibility(spec),
+    ]
+
+    weight_vector = (
+        scorer_config.static_weight_list()
+        if weights is None
+        else [float(weight) for weight in weights]
+    )
+    if len(weight_vector) != len(dimensions):
+        raise ValueError(f"Expected {len(dimensions)} static weights, got {len(weight_vector)}")
+    composite = float(sum(d * w for d, w in zip(dimensions, weight_vector)))
+    return ScoredDifficulty(
+        dimensions=dimensions,
+        dimension_names=DIMENSION_NAMES.copy(),
+        composite=composite,
+        weights=weight_vector,
+    )
+
+
+def compute_static_score_artifact(
+    spec: TaskSpecification,
+    config: ScorerConfig | None = None,
+    solver_output: DifficultyReport | None = None,
+    validator: TaskValidator | None = None,
+    validation_result: tuple[bool, list[tuple[int, int]] | None, str] | None = None,
+    bfs_path: PlannedPath | None = None,
+    greedy_path: PlannedPath | None = None,
+) -> StaticScoreArtifact:
+    """Compute the Stage 2 static score artifact for one task."""
+    require_scorable_spec(spec)
+    scorer_config = config or ScorerConfig.default()
+    schema_valid, schema_errors = spec.validate()
+    task_validator = validator or TaskValidator(spec)
+    if validation_result is None:
+        validation_result = task_validator.validate()
+    is_beatable, _, message = validation_result
+    if solver_output is None:
+        solver_output = compute_difficulty(
+            spec,
+            validator=task_validator,
+            validation_result=validation_result,
+        )
+    if bfs_path is None:
+        bfs_path = plan_bfs_path(spec)
+    if is_beatable != bfs_path.success:
+        raise ValueError(
+            "Task validator and canonical BFS disagree on beatability for "
+            f"{spec.task_id!r}"
+        )
+    score = compute_12d_score(
+        spec,
+        solver_output=solver_output,
+        config=scorer_config,
+        validator=task_validator,
+        bfs_path=bfs_path,
+    )
+
+    mechanism_necessity_violations: list[str] = []
+    distractor_safety_violations: list[str] = []
+    chain_ordering_valid = True
+    if schema_valid:
+        mechanism_necessity_violations = task_validator.validate_mechanism_necessity()
+        distractor_safety_violations = task_validator.validate_distractor_safety(
+            base_beatable=is_beatable
+        )
+        chain_ordering_valid = task_validator.validate_chain_ordering()
+
+    dimensions = score.dimensions_by_name
+    static_score_unweighted = float(sum(dimensions.values()))
+    inputs_hash = stable_hash(
+        {
+            "task": spec.to_dict(),
+            "config": scorer_config.to_dict(),
+            "scorer_version": SCORER_VERSION,
+        }
+    )
+
+    return StaticScoreArtifact(
+        task_id=spec.task_id,
+        is_beatable=is_beatable,
+        message=message,
+        dimensions=dimensions,
+        static_score_unweighted=static_score_unweighted,
+        static_score=score.composite,
+        weights=dict(scorer_config.static_dimension_weights),
+        validation={
+            "schema_valid": schema_valid,
+            "schema_errors": schema_errors,
+            "mechanism_necessity_violations": mechanism_necessity_violations,
+            "distractor_safety_violations": distractor_safety_violations,
+            "chain_ordering_valid": chain_ordering_valid,
+        },
+        canonical_agent_features={
+            GREEDY_SOLVABILITY_FEATURE: (
+                compute_greedy_solvability(spec, greedy_path=greedy_path)
+                if schema_valid
+                else None
+            ),
+        },
+        calibration_version=scorer_config.version,
+        inputs_hash=inputs_hash,
+    )
+
+
+def score_task_file(
+    task_path: str | Path,
+    output_dir: str | Path | None = None,
+    config: ScorerConfig | None = None,
+):
+    """Score a task JSON file and optionally write canonical score artifacts."""
+    spec = task_spec_from_payload(load_json(task_path))
+    require_scorable_spec(spec)
+    validator = TaskValidator(spec)
+    validation_result = validator.validate()
+    difficulty = compute_difficulty(
+        spec,
+        validator=validator,
+        validation_result=validation_result,
+    )
+    bfs_path = plan_bfs_path(spec)
+    greedy_path = plan_greedy_path(spec)
+    canonical_paths = compute_canonical_paths(
+        spec,
+        bfs_path=bfs_path,
+        greedy_path=greedy_path,
+    )
+    static_score = compute_static_score_artifact(
+        spec,
+        config=config,
+        solver_output=difficulty,
+        validator=validator,
+        validation_result=validation_result,
+        bfs_path=bfs_path,
+        greedy_path=greedy_path,
+    )
+
+    if output_dir is not None:
+        out = Path(output_dir)
+        out.mkdir(parents=True, exist_ok=True)
+        dump_json(out / "canonical_paths.json", canonical_paths.to_dict())
+        dump_json(out / "scored_static.json", static_score.to_dict())
+
+    return canonical_paths, static_score
diff --git a/scripts/score_json.py b/scripts/score_json.py
new file mode 100644
index 0000000..d2277c6
--- /dev/null
+++ b/scripts/score_json.py
@@ -0,0 +1,172 @@
+#!/usr/bin/env python3
+"""CLI for scoring task and run JSON artifacts."""
+
+from __future__ import annotations
+
+import argparse
+from pathlib import Path
+
+from scorer.io import dump_json, json_files, load_json
+from scorer.scoring import (
+    ScorerConfig,
+    compute_runtime_score,
+    load_scorer_config,
+    score_runtime_file,
+    score_task_file,
+)
+
+
+def _load_config(args: argparse.Namespace) -> ScorerConfig:
+    return load_scorer_config(args.config)
+
+
+def _static_target_dirs(files: list[Path], output_root: Path | None) -> list[Path]:
+    if output_root is None:
+        return [path.with_suffix("").with_name(f"{path.stem}_score") for path in files]
+    if len(files) == 1:
+        return [output_root]
+
+    target_dirs = [output_root / path.stem for path in files]
+    duplicates = sorted(
+        {
+            str(target)
+            for target in target_dirs
+            if target_dirs.count(target) > 1
+        }
+    )
+    if duplicates:
+        raise ValueError(
+            "Static output directories collide for same-stem inputs: "
+            f"{', '.join(duplicates)}. Score those inputs separately or use distinct filenames."
+        )
+    return target_dirs
+
+
+def _default_runtime_output(run_path: str | Path) -> Path:
+    path = Path(run_path)
+    return path.with_name(f"{path.stem}_score.json")
+
+
+def _static(args: argparse.Namespace) -> int:
+    config = _load_config(args)
+    files = json_files(args.inputs)
+    if not files:
+        raise FileNotFoundError("No JSON files matched the static scoring inputs")
+
+    output_root = Path(args.output_dir) if args.output_dir else None
+    for task_path, target_dir in zip(files, _static_target_dirs(files, output_root)):
+        canonical, static_score = score_task_file(
+            task_path,
+            output_dir=target_dir,
+            config=config,
+        )
+        print(
+            f"{static_score.task_id}: static_score={static_score.static_score:.3f}, "
+            f"beatable={static_score.is_beatable}, optimal_steps={canonical.optimal_steps} -> {target_dir}"
+        )
+    return 0
+
+
+def _runtime(args: argparse.Namespace) -> int:
+    config = _load_config(args)
+    output_path = Path(args.output) if args.output else _default_runtime_output(args.run)
+    if (args.static_score is None) != (args.canonical_paths is None):
+        raise ValueError("--static-score and --canonical-paths must be provided together")
+    if (
+        args.difficulty_max_static_score is None
+        and config.difficulty_max_static_score is None
+    ):
+        raise ValueError(
+            "Runtime scoring needs a suite maximum. Pass --difficulty-max-static-score "
+            "or set difficulty_max_static_score in scorer config."
+        )
+
+    if args.static_score and args.canonical_paths:
+        score = score_runtime_file(
+            args.run,
+            static_score_path=args.static_score,
+            canonical_paths_path=args.canonical_paths,
+            output_path=output_path,
+            config=config,
+            difficulty_max_static_score=args.difficulty_max_static_score,
+        )
+    else:
+        if not args.task:
+            raise ValueError(
+                "Runtime scoring needs --static-score and --canonical-paths, "
+                "or --task so those artifacts can be computed."
+            )
+        canonical, static_score = score_task_file(
+            args.task,
+            output_dir=args.artifact_dir,
+            config=config,
+        )
+        run = load_json(args.run)
+        score = compute_runtime_score(
+            run,
+            static_score=static_score,
+            canonical_paths=canonical,
+            config=config,
+            difficulty_max_static_score=args.difficulty_max_static_score,
+        )
+        dump_json(output_path, score.to_dict())
+
+    print(f"{score.task_id}: runtime_score={score.composite:.3f} -> {output_path}")
+    return 0
+
+
+def build_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(description="Score MultiNet task and run JSON artifacts.")
+    parser.add_argument(
+        "--config",
+        default=None,
+        help="Optional scorer config JSON/YAML path. Defaults to scorer/scorer_config.json.",
+    )
+
+    subparsers = parser.add_subparsers(dest="command", required=True)
+
+    static_parser = subparsers.add_parser(
+        "static",
+        help="Write canonical_paths.json and scored_static.json for task JSON files.",
+    )
+    static_parser.add_argument("inputs", nargs="+", help="Task JSON files or directories.")
+    static_parser.add_argument(
+        "--output-dir",
+        default=None,
+        help="Directory for score artifacts. Multiple inputs are written under per-file subdirectories.",
+    )
+    static_parser.set_defaults(func=_static)
+
+    runtime_parser = subparsers.add_parser(
+        "runtime",
+        help="Write run_score.json for one run/episode JSON artifact.",
+    )
+    runtime_parser.add_argument("run", help="Run or episode JSON file.")
+    runtime_parser.add_argument("--task", default=None, help="Task JSON file, used when static artifacts are omitted.")
+    runtime_parser.add_argument("--static-score", default=None, help="Existing scored_static.json path.")
+    runtime_parser.add_argument("--canonical-paths", default=None, help="Existing canonical_paths.json path.")
+    runtime_parser.add_argument(
+        "--artifact-dir",
+        default=None,
+        help="Optional directory to write computed static artifacts when --task is used.",
+    )
+    runtime_parser.add_argument("--output", default=None, help="Output run_score.json path.")
+    runtime_parser.add_argument(
+        "--difficulty-max-static-score",
+        type=float,
+        default=None,
+        help="Suite max static score for difficulty normalization. Required unless configured.",
+    )
+    runtime_parser.set_defaults(func=_runtime)
+
+    return parser
+
+
+def main(argv: list[str] | None = None) -> int:
+    parser = build_parser()
+    args = parser.parse_args(argv)
+    return int(args.func(args))
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/tests/test_interface_token_usage.py b/tests/test_interface_token_usage.py
new file mode 100644
index 0000000..a5919f1
--- /dev/null
+++ b/tests/test_interface_token_usage.py
@@ -0,0 +1,101 @@
+from interface.config import ExperimentConfig
+from interface.loader import default_maze_path, load_task
+from interface.runner import build_runner
+from interface.smoke_tests.plans import v01_empty_room_trajectory
+from interface.smoke_tests.smoke_llm import _AgentRecorder
+from interface.telemetry import normalize_token_usage
+
+
+class UsageReplayAgent:
+    def __init__(self):
+        self._actions = iter(v01_empty_room_trajectory())
+        self.last_usage = None
+
+    def __call__(self, messages):
+        self.last_usage = {
+            "input_tokens": 8,
+            "output_tokens": 2,
+            "total_tokens": 10,
+        }
+        return f"FINAL_OUTPUT: {next(self._actions)}"
+
+
+class FirstQueryUsageReplayAgent(UsageReplayAgent):
+    def __init__(self):
+        super().__init__()
+        self._calls = 0
+
+    def __call__(self, messages):
+        self._calls += 1
+        if self._calls == 1:
+            self.last_usage = {
+                "input_tokens": 8,
+                "output_tokens": 2,
+                "total_tokens": 10,
+            }
+        return f"FINAL_OUTPUT: {next(self._actions)}"
+
+
+def test_normalized_usage_accepts_provider_token_keys():
+    assert normalize_token_usage({"input_tokens": 8, "output_tokens": 2}) == {
+        "input_tokens": 8,
+        "output_tokens": 2,
+        "total_tokens": 10,
+    }
+
+
+def test_agent_recorder_forwards_usage_metadata():
+    records = []
+    recorder = _AgentRecorder(UsageReplayAgent(), records)
+
+    recorder([])
+
+    assert recorder.last_usage == {
+        "input_tokens": 8,
+        "output_tokens": 2,
+        "total_tokens": 10,
+    }
+    assert records[0]["usage"]["total_tokens"] == 10
+
+
+def test_runner_persists_agent_usage_in_query_transcript():
+    maze_path = default_maze_path("V01_empty_room.json")
+    backend, spec = load_task(maze_path)
+    runner = build_runner(
+        ExperimentConfig(
+            observation="text_only",
+            context_window="current",
+            querying="step_by_step",
+            chat_history="stateless",
+        ),
+        backend,
+        spec,
+    )
+
+    result = runner.run(UsageReplayAgent(), verbose=False, maze_path=maze_path)
+    query_records = [item for item in result["transcript"] if item.get("kind") == "query"]
+
+    assert result["success"] is True
+    assert query_records
+    assert query_records[0]["usage"]["total_tokens"] == 10
+
+
+def test_runner_clears_stale_usage_between_queries():
+    maze_path = default_maze_path("V01_empty_room.json")
+    backend, spec = load_task(maze_path)
+    runner = build_runner(
+        ExperimentConfig(
+            observation="text_only",
+            context_window="current",
+            querying="step_by_step",
+            chat_history="stateless",
+        ),
+        backend,
+        spec,
+    )
+
+    result = runner.run(FirstQueryUsageReplayAgent(), verbose=False, maze_path=maze_path)
+    query_records = [item for item in result["transcript"] if item.get("kind") == "query"]
+
+    assert query_records[0]["usage"]["total_tokens"] == 10
+    assert "usage" not in query_records[1]
diff --git a/tests/test_scoring_system.py b/tests/test_scoring_system.py
new file mode 100644
index 0000000..b463e18
--- /dev/null
+++ b/tests/test_scoring_system.py
@@ -0,0 +1,549 @@
+import argparse
+import json
+
+import pytest
+
+from gridworld.actions import MiniGridActions
+from gridworld.baselines import plan_bfs_path, trace_planned_actions
+from gridworld.task_spec import TaskSpecification
+from gridworld.task_validator import TaskValidator
+from scorer.artifacts import CanonicalPathReport, ScoredDifficulty
+from scorer.config import (
+    DEFAULT_CONFIG_PATH,
+    DEFAULT_DISTRACTOR_TYPE_WEIGHTS,
+    DEFAULT_RUNTIME_WEIGHTS,
+    DIMENSION_NAMES,
+    load_scorer_config,
+)
+from scorer.scoring import (
+    ScorerConfig,
+    compute_12d_score,
+    compute_canonical_paths,
+    compute_runtime_score,
+    compute_static_score_artifact,
+    score_task_file,
+)
+from scripts.score_json import _default_runtime_output, _runtime, _static_target_dirs
+
+
+def make_spec(**overrides):
+    data = {
+        "task_id": "scorer_case",
+        "seed": 7,
+        "difficulty_tier": 1,
+        "maze": {
+            "dimensions": [5, 5],
+            "walls": [],
+            "start": [1, 1],
+            "goal": [3, 1],
+        },
+        "mechanisms": {},
+        "rules": {"observability": "full", "view_size": 7},
+        "goal": {"type": "reach_position", "target": [3, 1]},
+        "max_steps": 20,
+    }
+    data.update(overrides)
+    return TaskSpecification.from_dict(data)
+
+
+def test_canonical_paths_include_bfs_actions_and_positions():
+    spec = make_spec()
+
+    report = compute_canonical_paths(spec)
+
+    assert report.success is True
+    assert report.actions == ["move_forward", "move_forward"]
+    assert report.positions == [(1, 1), (2, 1), (3, 1)]
+    assert report.optimal_steps == 2
+    assert report.states_explored > 0
+    assert report.greedy is not None
+    assert report.greedy["success"] is True
+
+
+def test_static_score_uses_configurable_weights():
+    spec = make_spec()
+    default_score = compute_12d_score(spec)
+    config = ScorerConfig.from_dict(
+        {
+            "version": "unit",
+            "static_dimension_weights": {
+                "optimal_path_length": 2.0,
+                "grid_size": 0.0,
+            },
+        }
+    )
+
+    weighted = compute_12d_score(spec, config=config)
+
+    assert weighted.weights[0] == 2.0
+    assert weighted.weights[8] == 0.0
+    assert weighted.composite != default_score.composite
+
+
+def test_static_score_rejects_partial_explicit_weight_vectors():
+    spec = make_spec()
+
+    with pytest.raises(ValueError, match="Expected 12 static weights"):
+        compute_12d_score(spec, weights=[1.0, 2.0])
+    with pytest.raises(ValueError, match="Expected 12 static weights"):
+        compute_12d_score(spec, weights=[])
+
+
+def test_shipped_config_matches_code_defaults():
+    config = load_scorer_config(DEFAULT_CONFIG_PATH)
+
+    assert list(config.static_dimension_weights) == DIMENSION_NAMES
+    assert config.distractor_type_weights == DEFAULT_DISTRACTOR_TYPE_WEIGHTS
+    assert config.runtime_weights == DEFAULT_RUNTIME_WEIGHTS
+
+
+def test_explicit_missing_config_path_fails(tmp_path):
+    with pytest.raises(FileNotFoundError, match="Scorer config not found"):
+        load_scorer_config(tmp_path / "missing_config.json")
+
+
+def test_score_task_file_writes_stage_two_artifacts(tmp_path):
+    spec = make_spec()
+    task_path = tmp_path / "task.json"
+    spec.to_json(str(task_path))
+
+    canonical, static_score = score_task_file(task_path, output_dir=tmp_path / "artifacts")
+
+    assert canonical.success is True
+    assert static_score.is_beatable is True
+    assert (tmp_path / "artifacts" / "canonical_paths.json").exists()
+    scored_path = tmp_path / "artifacts" / "scored_static.json"
+    assert scored_path.exists()
+    with open(scored_path) as f:
+        payload = json.load(f)
+    assert payload["task_id"] == spec.task_id
+    assert "dimensions_12" in payload
+    assert "dimensions" not in payload
+    assert "composite" not in payload
+    assert payload["validation"]["schema_valid"] is True
+    assert payload["canonical_agent_features"]["greedy_solvability"] == 1.0
+
+
+def test_score_task_file_reuses_primary_validator_result(tmp_path, monkeypatch):
+    spec = make_spec()
+    task_path = tmp_path / "task.json"
+    spec.to_json(str(task_path))
+    calls = 0
+    original_validate = TaskValidator.validate
+
+    def count_validate(self, *args, **kwargs):
+        nonlocal calls
+        calls += 1
+        return original_validate(self, *args, **kwargs)
+
+    monkeypatch.setattr(TaskValidator, "validate", count_validate)
+
+    score_task_file(task_path)
+
+    assert calls == 1
+
+
+def test_score_task_file_rejects_invalid_schema_before_planning(tmp_path, monkeypatch):
+    spec = make_spec(
+        maze={
+            "dimensions": [5, 5],
+            "walls": [],
+            "start": [1, 1],
+            "goal": [9, 9],
+        },
+        goal={"type": "reach_position", "target": [9, 9]},
+    )
+    task_path = tmp_path / "task.json"
+    spec.to_json(str(task_path))
+
+    def fail_if_called(*args, **kwargs):
+        raise AssertionError("planner must not execute for schema-invalid tasks")
+
+    monkeypatch.setattr("scorer.static.plan_bfs_path", fail_if_called)
+    monkeypatch.setattr("scorer.static.plan_greedy_path", fail_if_called)
+
+    with pytest.raises(ValueError, match="failed schema validation"):
+        score_task_file(task_path)
+
+
+def test_static_score_uses_canonical_bfs_metrics():
+    spec = make_spec()
+    bfs_path = plan_bfs_path(spec)
+    score = compute_12d_score(spec, bfs_path=bfs_path)
+
+    assert score.dimensions[0] == len(bfs_path.action_labels)
+    assert score.dimensions[1] == bfs_path.states_explored
+
+
+def test_runtime_score_from_episode_json_payload():
+    spec = make_spec()
+    canonical = compute_canonical_paths(spec)
+    static_score = compute_static_score_artifact(spec)
+    run = {
+        "task_id": spec.task_id,
+        "backend": "minigrid",
+        "adapter": "unit",
+        "model_id": "unit-model",
+        "seed": 7,
+        "success": True,
+        "steps_taken": 2,
+        "terminated": True,
+        "truncated": False,
+        "total_tokens": 500,
+        "trajectory": [
+            {"state": {"agent_position": [1, 1]}},
+            {"state": {"agent_position": [2, 1]}},
+        ],
+        "final_state": {"agent_position": [3, 1], "step_count": 2},
+    }
+
+    config = ScorerConfig.from_dict({"runtime_weights": {"greedy_penalty": 0.0}})
+    score = compute_runtime_score(
+        run,
+        static_score=static_score,
+        canonical_paths=canonical,
+        config=config,
+        difficulty_max_static_score=static_score.static_score,
+    )
+
+    assert score.task_id == spec.task_id
+    assert score.composite == 1.0
+    assert score.signals["step_ratio"] == 1.0
+    assert score.signals["cell_overlap_bfs"] == 1.0
+    assert score.signals["cell_overlap_greedy"] == 1.0
+    assert score.signals["token_efficiency"] == 1.0
+    assert "path_choice" not in score.signals
+    assert "distractor_interactions" not in score.signals
+
+
+def test_runtime_score_prefers_interface_state_after_over_row_col_position_after():
+    spec = make_spec()
+    canonical = compute_canonical_paths(spec)
+    static_score = compute_static_score_artifact(spec)
+    run = {
+        "success": True,
+        "steps_used": 2,
+        "total_tokens": 100,
+        "end_reason": "success",
+        "task_spec": spec.to_dict(),
+        "initial_state": {"agent_position": [1, 1]},
+        "final_state": {"agent_position": [3, 1], "step_count": 2},
+        "transcript": [
+            {
+                "kind": "reset",
+                "state": {"agent_position": [1, 1]},
+            },
+            {
+                "kind": "step",
+                "position_after": [1, 2],
+                "state_after": {"agent_position": [2, 1]},
+            },
+            {
+                "kind": "step",
+                "position_after": [1, 3],
+                "state_after": {"agent_position": [3, 1]},
+            },
+        ],
+    }
+
+    config = ScorerConfig.from_dict({"runtime_weights": {"greedy_penalty": 0.0}})
+    score = compute_runtime_score(
+        run,
+        static_score=static_score,
+        canonical_paths=canonical,
+        config=config,
+        difficulty_max_static_score=static_score.static_score,
+    )
+
+    assert score.signals["cell_overlap_bfs"] == 1.0
+
+
+def test_runtime_score_requires_suite_difficulty_normalizer():
+    spec = make_spec()
+    canonical = compute_canonical_paths(spec)
+    static_score = compute_static_score_artifact(spec)
+
+    with pytest.raises(ValueError, match="difficulty_max_static_score"):
+        compute_runtime_score(
+            {"success": True, "steps": 2, "total_tokens": 100},
+            static_score=static_score,
+            canonical_paths=canonical,
+        )
+
+
+def test_runtime_score_rejects_suite_max_smaller_than_task_score():
+    spec = make_spec()
+    canonical = compute_canonical_paths(spec)
+    static_score = compute_static_score_artifact(spec)
+
+    with pytest.raises(ValueError, match="at least the task static score"):
+        compute_runtime_score(
+            {"success": True, "steps": 2, "total_tokens": 100},
+            static_score=static_score,
+            canonical_paths=canonical,
+            difficulty_max_static_score=static_score.static_score - 1,
+        )
+
+
+def test_runtime_score_rejects_unevaluated_greedy_solvability():
+    spec = make_spec()
+    canonical = compute_canonical_paths(spec)
+    static_score = compute_static_score_artifact(spec).to_dict()
+    static_score["canonical_agent_features"]["greedy_solvability"] = None
+
+    with pytest.raises(ValueError, match="greedy_solvability"):
+        compute_runtime_score(
+            {"success": True, "steps": 2, "total_tokens": 100},
+            static_score=static_score,
+            canonical_paths=canonical,
+            difficulty_max_static_score=static_score["static_score"],
+        )
+
+
+def test_runtime_score_rejects_schema_invalid_static_artifact_clearly():
+    spec = make_spec()
+    canonical = compute_canonical_paths(spec)
+    static_score = compute_static_score_artifact(spec).to_dict()
+    static_score["validation"]["schema_valid"] = False
+
+    with pytest.raises(ValueError, match="schema-valid"):
+        compute_runtime_score(
+            {"success": True, "steps": 2, "total_tokens": 100},
+            static_score=static_score,
+            canonical_paths=canonical,
+            difficulty_max_static_score=static_score["static_score"],
+        )
+
+
+def test_runtime_token_count_does_not_double_count_nested_step_tokens():
+    spec = make_spec()
+    canonical = compute_canonical_paths(spec)
+    static_score = compute_static_score_artifact(spec)
+    score = compute_runtime_score(
+        {
+            "success": True,
+            "steps": 2,
+            "trajectory": [{"tokens": 100, "info": {"tokens": 100}}],
+        },
+        static_score=static_score,
+        canonical_paths=canonical,
+        difficulty_max_static_score=static_score.static_score,
+    )
+
+    assert score.signals["token_count"] == 100
+
+
+def test_runtime_token_count_reads_query_transcript_usage():
+    spec = make_spec()
+    canonical = compute_canonical_paths(spec)
+    static_score = compute_static_score_artifact(spec)
+    score = compute_runtime_score(
+        {
+            "success": True,
+            "steps": 2,
+            "transcript": [
+                {
+                    "kind": "query",
+                    "usage": {"input_tokens": 80, "output_tokens": 20},
+                }
+            ],
+        },
+        static_score=static_score,
+        canonical_paths=canonical,
+        difficulty_max_static_score=static_score.static_score,
+    )
+
+    assert score.signals["token_count"] == 100
+
+
+def test_runtime_hash_ignores_non_scoring_transcript_context():
+    spec = make_spec()
+    canonical = compute_canonical_paths(spec)
+    static_score = compute_static_score_artifact(spec)
+    base_run = {
+        "success": True,
+        "steps": 2,
+        "total_tokens": 100,
+        "transcript": [
+            {
+                "kind": "query",
+                "agent_messages": [{"role": "user", "content": "first"}],
+            }
+        ],
+    }
+    changed_context = {
+        **base_run,
+        "transcript": [
+            {
+                "kind": "query",
+                "agent_messages": [{"role": "user", "content": "second"}],
+            }
+        ],
+    }
+
+    first = compute_runtime_score(
+        base_run,
+        static_score=static_score,
+        canonical_paths=canonical,
+        difficulty_max_static_score=static_score.static_score,
+    )
+    second = compute_runtime_score(
+        changed_context,
+        static_score=static_score,
+        canonical_paths=canonical,
+        difficulty_max_static_score=static_score.static_score,
+    )
+
+    assert first.inputs_hash == second.inputs_hash
+
+
+@pytest.mark.parametrize("token_count", [None, 0])
+def test_runtime_score_rejects_missing_or_zero_token_telemetry(token_count):
+    spec = make_spec()
+    canonical = compute_canonical_paths(spec)
+    static_score = compute_static_score_artifact(spec)
+    run = {"success": True, "steps": 2}
+    if token_count is not None:
+        run["total_tokens"] = token_count
+
+    with pytest.raises(ValueError, match="token"):
+        compute_runtime_score(
+            run,
+            static_score=static_score,
+            canonical_paths=canonical,
+            difficulty_max_static_score=static_score.static_score,
+        )
+
+
+def test_runtime_score_rejects_missing_step_telemetry():
+    spec = make_spec()
+    canonical = compute_canonical_paths(spec)
+    static_score = compute_static_score_artifact(spec)
+
+    with pytest.raises(ValueError, match="step telemetry"):
+        compute_runtime_score(
+            {"success": True, "total_tokens": 100},
+            static_score=static_score,
+            canonical_paths=canonical,
+            difficulty_max_static_score=static_score.static_score,
+        )
+
+
+def test_zero_step_plans_do_not_inflate_optimal_steps_with_done():
+    spec = make_spec(
+        maze={
+            "dimensions": [5, 5],
+            "walls": [],
+            "start": [1, 1],
+            "goal": [1, 1],
+        },
+        goal={"type": "reach_position", "target": [1, 1]},
+    )
+
+    path = plan_bfs_path(spec)
+    traced_done = trace_planned_actions(spec, [int(MiniGridActions.DONE)])
+
+    assert path.success is True
+    assert path.action_labels == []
+    assert traced_done.success is True
+    assert traced_done.action_labels == []
+
+
+def test_runtime_zero_step_success_gets_full_step_credit():
+    spec = make_spec(
+        maze={
+            "dimensions": [5, 5],
+            "walls": [],
+            "start": [1, 1],
+            "goal": [1, 1],
+        },
+        goal={"type": "reach_position", "target": [1, 1]},
+    )
+    canonical = compute_canonical_paths(spec)
+    static_score = compute_static_score_artifact(spec)
+    score = compute_runtime_score(
+        {
+            "success": True,
+            "steps": 0,
+            "total_tokens": 100,
+            "initial_state": {"agent_position": [1, 1]},
+            "final_state": {"agent_position": [1, 1], "step_count": 0},
+        },
+        static_score=static_score,
+        canonical_paths=canonical,
+        config=ScorerConfig.from_dict({"runtime_weights": {"greedy_penalty": 0.0}}),
+        difficulty_max_static_score=static_score.static_score,
+    )
+
+    assert score.signals["step_ratio"] == 1.0
+    assert score.composite == 1.0
+
+
+def test_static_cli_target_dirs_reject_same_stem_collisions(tmp_path):
+    files = [tmp_path / "a" / "task.json", tmp_path / "b" / "task.json"]
+
+    with pytest.raises(ValueError, match="collide"):
+        _static_target_dirs(files, tmp_path / "scores")
+
+
+def test_runtime_cli_default_output_uses_source_stem(tmp_path):
+    assert _default_runtime_output(tmp_path / "run.json") == tmp_path / "run_score.json"
+    assert _default_runtime_output(tmp_path / "episode.json") == tmp_path / "episode_score.json"
+
+
+def test_runtime_cli_rejects_half_specified_artifacts(tmp_path):
+    args = argparse.Namespace(
+        config=None,
+        run=str(tmp_path / "episode.json"),
+        output=None,
+        static_score=str(tmp_path / "scored_static.json"),
+        canonical_paths=None,
+        task=str(tmp_path / "task.json"),
+        artifact_dir=None,
+        difficulty_max_static_score=100.0,
+    )
+
+    with pytest.raises(ValueError, match="provided together"):
+        _runtime(args)
+
+
+def test_runtime_cli_explains_missing_suite_maximum(tmp_path):
+    args = argparse.Namespace(
+        config=None,
+        run=str(tmp_path / "episode.json"),
+        output=None,
+        static_score=str(tmp_path / "scored_static.json"),
+        canonical_paths=str(tmp_path / "canonical_paths.json"),
+        task=None,
+        artifact_dir=None,
+        difficulty_max_static_score=None,
+    )
+
+    with pytest.raises(ValueError, match="--difficulty-max-static-score"):
+        _runtime(args)
+
+
+def test_artifact_serialization_returns_detached_data():
+    scored = ScoredDifficulty(dimensions=[1.0], dimension_names=["only"], weights=[2.0])
+    scored_payload = scored.to_dict()
+    scored_payload["dimensions"][0] = 9.0
+    scored_payload["weights"][0] = 9.0
+
+    canonical = CanonicalPathReport(
+        task_id="task",
+        success=True,
+        actions=["move_forward"],
+        positions=[(1, 1), (2, 1)],
+        optimal_steps=1,
+        states_explored=2,
+        message="ok",
+        greedy={"actions": ["move_forward"]},
+    )
+    canonical_payload = canonical.to_dict()
+    canonical_payload["bfs"]["actions"][0] = "mutated"
+    canonical_payload["greedy"]["actions"][0] = "mutated"
+
+    assert scored.dimensions == [1.0]
+    assert scored.weights == [2.0]
+    assert canonical.actions == ["move_forward"]
+    assert canonical.greedy == {"actions": ["move_forward"]}