diff --git a/docs/system_design.md b/docs/system_design.md index 16010d4..6f1dad3 100644 --- a/docs/system_design.md +++ b/docs/system_design.md @@ -18,7 +18,7 @@ This document is the single canonical source of truth for how the MultiNet v2.0 1. [Overview & north stars](#1-overview--north-stars) 2. [Pipeline DAG: stages, artifacts, invalidation](#2-pipeline-dag-stages-artifacts-invalidation) 3. [Task spec contract](#3-task-spec-contract) -4. [Static scoring (13 dimensions)](#4-static-scoring-13-dimensions) +4. [Static scoring (12 dimensions plus canonical-agent features)](#4-static-scoring-12-dimensions-plus-canonical-agent-features) 5. [Runtime scoring](#5-runtime-scoring) 6. [Backend & inference adapter contracts](#6-backend--inference-adapter-contracts) 7. [Reporting & aggregate](#7-reporting--aggregate) @@ -74,18 +74,18 @@ The pipeline is a five-stage DAG. Each stage has declared inputs and outputs and 2. **Solve & Score-static** - Inputs: `task.json`. - Outputs: - - `canonical_paths.json` `{ bfs: { path, steps, states_explored }, greedy: { success, path, steps }, … }` - - `scored.json` `{ is_beatable, dimensions[13], fragility, mechanism_necessity_violations, distractor_safety_violations, message }` + - `canonical_paths.json` `{ bfs: { actions, positions, optimal_steps, states_explored }, greedy: { success, actions, positions, steps }, … }` + - `scored_static.json` `{ is_beatable, dimensions_12, canonical_agent_features, validation, message }` - Hash key: `hash(solver_v, scorer_v, task.json, agent_set_v)`. - - If `scored.json.is_beatable == false`, downstream stages skip the task; it is logged as ineligible and surfaced in reports. + - If `scored_static.json.is_beatable == false`, downstream stages skip the task; it is logged as ineligible and surfaced in reports. 3. **Render-and-Run** - - Inputs: `task.json`, `scored.json` (gate on `is_beatable`), backend choice, adapter choice, `model_id`, `seed`. + - Inputs: `task.json`, `scored_static.json` (gate on `is_beatable`), backend choice, adapter choice, `model_id`, `seed`. - Outputs: `run.json` `{ trajectory, actions, tokens, terminated, success }`. - Hash key: `hash(backend_v, adapter_v, model_id, task.json, seed)`. 4. **Score-runtime** - - Inputs: `run.json`, `scored.json`, `canonical_paths.json`. + - Inputs: `run.json`, `scored_static.json`, `canonical_paths.json`. - Outputs: `run_score.json` `{ success, step_ratio, cell_overlap_*, distractor_interactions, irreversible_failures, tokens, composite }`. - Hash key: `hash(runtime_scorer_v, inputs)`. @@ -106,7 +106,7 @@ artifacts/ ├── tasks// │ ├── task.json # Stage 1 │ ├── canonical_paths.json # Stage 2 (a) -│ └── scored.json # Stage 2 (b) — includes is_beatable +│ └── scored_static.json # Stage 2 (b) — includes is_beatable ├── runs////// │ ├── run.json # Stage 3 │ └── run_score.json # Stage 4 @@ -218,9 +218,9 @@ Enforced by `TaskSpecification.validate()`: --- -## 4. Static scoring (13 dimensions) +## 4. Static scoring (12 dimensions plus canonical-agent features) -Static scoring runs once per task at pipeline stage 2 (Solve & Score-static). It produces `scored.json`, which carries `is_beatable` plus a 13-dimension vector and supporting validation reports. The static scorer consumes `task.json` and `canonical_paths.json`. +Static scoring runs once per task at pipeline stage 2 (Solve & Score-static). It produces `scored_static.json`, which carries `is_beatable`, a 12-dimension vector, canonical-agent features, and supporting validation reports. The scorer consumes `task.json` and emits this artifact alongside `canonical_paths.json`. ### 4.1 Dimensions @@ -238,7 +238,7 @@ All raw values are floats (or counts cast to float). Higher = harder *unless* ex 10. **`wall_density`** — Source: spec. Computation: `len(walls) / grid_size`. Crude (does not separate interior vs functional walls); **calibration target**. 11. **`partial_observability`** — Source: spec rules. Computation: ordinal `{full: 0, view_cone: 1, fog_of_war: 2}` from `rules.observability`. 12. **`irreversibility`** — Source: spec rules + mechanisms. Computation: `key_consumption × #doors + #one_shot_switches + #non_bidirectional_teleporters`. -13. **`greedy_solvability`** — Source: Greedy canonical agent. Computation: `1.0 if greedy succeeds else 0.0`. **Penalty** (greedy-solvable tasks lower the runtime composite, on the rationale that they are less a test of spatial reasoning). +`greedy_solvability` is recorded separately under `canonical_agent_features`, rather than appended to the calibrated 12-dimension vector. Source: Greedy canonical agent. Computation: `1.0 if greedy succeeds else 0.0`. **Penalty** (greedy-solvable tasks lower the runtime composite, on the rationale that they are less a test of spatial reasoning). ### 4.2 Static composite (difficulty score) @@ -246,13 +246,13 @@ All raw values are floats (or counts cast to float). Higher = harder *unless* ex static_composite = Σ_i (raw_dim_i × calibration.weights[dim_name_i]) ``` -- `calibration.weights` lives in `calibration.yaml`; defaults to `1.0` for all dimensions until empirical tuning. +- Calibration weights live in `scorer/scorer_config.json` by default; optional JSON or YAML overrides may be passed explicitly. Weights default to `1.0` for all dimensions until empirical tuning. - `static_composite` is used for task ranking and live-benchmark filtering (e.g., reject tasks whose composite falls outside a tier's target range). - It is *not* used directly in runtime scoring; runtime uses individual dimensions plus a derived "difficulty weight" (Section 5). -### 4.3 Validation reports (also in `scored.json`) +### 4.3 Validation reports (also in `scored_static.json`) -Beyond the dimension vector, `scored.json` carries the validator's structural reports: +Beyond the dimension vector, `scored_static.json` carries the validator's structural reports: - `is_beatable` (bool) and `message` (str) — gate for downstream stages. - `mechanism_necessity_violations` (list of strings) — mechanisms whose removal still leaves the task solvable; flags accidental decoration. @@ -260,6 +260,7 @@ Beyond the dimension vector, `scored.json` carries the validator's structural re - `chain_ordering_valid` (bool) — each dependency step actually gates the next. These do not enter the composite but are surfaced in reports for task-quality auditing. +Schema-invalid tasks are rejected before canonical planners execute and do not emit score artifacts. ### 4.4 Calibration notes @@ -272,16 +273,16 @@ These do not enter the composite but are surfaced in reports for task-quality au ## 5. Runtime scoring -Runtime scoring runs at pipeline stage 4 (Score-runtime), once per `run.json`. It produces `run_score.json`. It consumes the run trajectory plus the static scoring artifacts (`scored.json`, `canonical_paths.json`). +Runtime scoring runs at pipeline stage 4 (Score-runtime), once per `run.json`. It produces `run_score.json`. It consumes the run trajectory plus the static scoring artifacts (`scored_static.json`, `canonical_paths.json`). ### 5.1 Per-run signal vector Recorded for every `(task, backend, adapter, model_id, seed)`: - `success` (bool) — goal reached within `max_steps`, no terminal hazard. -- `steps` (int) — agent's actual step count. +- `steps` (int) — agent's actual step count. Required; runtime scoring rejects missing telemetry. - `terminated_reason` (str) — one of `{goal_reached, hazard, max_steps, deadlock, invalid_action_excess}`. -- `token_count` (int) — total prompt + response tokens summed over all model turns. +- `token_count` (positive int) — total prompt + response tokens summed over all model turns. Required; runtime scoring rejects missing or non-positive telemetry. - `distractor_interactions` (int) — count of distractor-element interactions (any `pickup` / `toggle` / `push` on an element registered as a distractor). - `irreversible_failures` (int) — count of irreversible actions that broke solvability, detected by re-running the validator from the post-action state. @@ -298,11 +299,11 @@ composite = success_factor × efficiency_factor × difficulty_weight − greedy_ ``` - `success_factor = 1.0 if success else 0.0` — hard gate; failed runs score 0 regardless of efficiency. -- `efficiency_factor = α × step_ratio + β × cell_overlap_bfs + γ × token_efficiency` — weighted blend; default `α = β = γ = 1/3`. `token_efficiency = min(1, baseline_tokens / max(model_tokens, 1))` where `baseline_tokens` lives in `calibration.yaml`. -- `difficulty_weight = normalize(static_composite)` — harder tasks contribute more. Default normalization: `f(x) = x / max_observed_static_composite_in_suite`. +- `efficiency_factor = α × step_ratio + β × cell_overlap_bfs + γ × token_efficiency` — weighted blend; default `α = β = γ = 1/3`. `token_efficiency = min(1, baseline_tokens / model_tokens)` where `baseline_tokens` lives in scorer config. Missing or non-positive token telemetry is an artifact error, not a neutral score. +- `difficulty_weight = normalize(static_composite)` — harder tasks contribute more. Default normalization: `f(x) = x / max_observed_static_composite_in_suite`. Runtime scoring requires that suite maximum either in scorer config or as an explicit runtime argument. - `greedy_penalty = δ × greedy_solvability × success_factor` — applied only to successful runs; `δ` is a calibration coefficient with default 0.5. -All Greek-letter coefficients (`α, β, γ, δ`) and the normalization function live in `calibration.yaml`. The design commits to the *shape*, not the values. +All Greek-letter coefficients (`α, β, γ, δ`) and the normalization value live in scorer config. The design commits to the *shape*, not the values. ### 5.4 Single-point benchmark score (ARC-AGI style) @@ -340,7 +341,8 @@ Defaults to a uniform mean. Calibration may switch to a tier-weighted or difficu ### 5.6 Calibration notes - All composite coefficients ship as `1.0` or sensible defaults; the design does not claim correctness. -- `calibration.yaml` is versioned in git; changes bump `calibration_version` and trigger stage-4 / stage-5 invalidation. +- `scorer/scorer_config.json` is versioned in git; changes bump `calibration_version` and trigger stage-4 / stage-5 invalidation. +- The shipped config intentionally leaves `difficulty_max_static_score` unset. Runtime scoring requires a calibrated suite maximum through config or `--difficulty-max-static-score`. - After a calibration update, the pipeline regenerates `run_score.json` and `reports/` from cached `run.json`. Run records do **not** re-execute model calls. This is a deliberate consequence of the DAG split. --- @@ -533,16 +535,16 @@ Status legend: **2. Validator** — folded into Stage 2 - ✅ `gridworld/task_validator.py::TaskValidator` does exhaustive BFS over the full mechanism state space, plus `compute_fragility`, `validate_mechanism_necessity`, `validate_chain_ordering`, `validate_distractor_safety`. -- Delta: surface validation reports into `scored.json` instead of emitting a separate `validity.json`. +- Delta: surface validation reports into `scored_static.json` instead of emitting a separate `validity.json`. **3. Solver suite (canonical agents)** — Stage 2 -- ⚠️ BFS exists inside `TaskValidator._find_solution`. Greedy does not yet exist as a separate canonical agent. -- 🚧 Multi-tier solver suite pending; Greedy is the next addition, then heuristic, then random. -- Delta: extract BFS path emission as one canonical agent, add Greedy as a peer, write combined output to `canonical_paths.json`. +- ✅ `gridworld/baselines.py` exposes BFS and Greedy planners; `scorer/solvers.py` writes their combined output to `canonical_paths.json`. +- 🚧 Heuristic and random canonical-agent peers remain optional future additions. +- Delta: add calibration runs before extending the canonical-agent feature vector. **4. Static scorer** — Stage 2 -- ⚠️ `gridworld/scoring.py::compute_12d_score` exists with 12 dimensions matching dimensions 1–12 of §4 (modulo formula calibration). -- Delta: add dimension 13 (`greedy_solvability`), restructure output to `scored.json` sidecar, move composite weights to `calibration.yaml`, include validation reports. +- ✅ `scorer/scoring.py::compute_12d_score` exposes the public interface for the 12 calibrated dimensions and writes `scored_static.json` with validation reports plus `canonical_agent_features.greedy_solvability`. +- Delta: empirically calibrate the shipped placeholder weights. **5. `MiniGridBackend`** — backend axis - ✅ `gridworld/backends/minigrid_backend.py` implements `AbstractGridBackend` for square grids with discrete actions + RGB rendering. @@ -566,8 +568,8 @@ Status legend: - Delta: emit canonical `run.json`; remove inline scoring (move to Stage 4); add per-step trajectory recording. **10. Runtime scorer** — Stage 4 -- 🚧 Does not exist as a component. Some scoring logic lives inside `evaluation_harness.py`. -- Delta: new module that consumes `run.json` + `scored.json` + `canonical_paths.json` and produces `run_score.json`. +- ✅ `scorer/runtime.py` consumes `run.json` + `scored_static.json` + `canonical_paths.json` and produces `run_score.json`. +- Delta: populate optional interaction diagnostics in runtime producers and calibrate the suite-level difficulty maximum. **11. Aggregator / reporter** — Stage 5 - ⚠️ Partial. `evaluation_harness.py` produces some summary dicts; nothing matches the per-run-set artifact layout. @@ -597,7 +599,7 @@ Items the design intentionally defers. None block initial implementation. - DAG runner technology — Snakemake leading candidate; final pick deferred to implementation. - Token-efficiency baseline (`baseline_tokens`) — per-task vs global constant; needs a sensible default once a few model runs exist. -### 9.2 Calibration coefficients (live in `calibration.yaml`, default to placeholders) +### 9.2 Calibration coefficients (live in scorer config, default to placeholders) - Runtime composite blend weights `α, β, γ` (step ratio / cell overlap / token efficiency). - Greedy penalty coefficient `δ`. - `difficulty_weight` normalization function (currently `x / max_observed`; may switch to a percentile or log normalization). @@ -645,7 +647,7 @@ Mapping to the canonical pipeline: | JSON generator | Stage 1 (Generate) | §2.1 | | Task spec / Validator | folded into Stage 2 (Solve & Score-static) | §2.1 | | BFS-greedy agents | Multi-tier canonical agent suite (Stage 2) | §2.1, §4 | -| Score calculation (static) | Static scoring (13 dimensions) (Stage 2) | §4 | +| Score calculation (static) | Static scoring (12 dimensions plus canonical-agent features) (Stage 2) | §4 | | Backend Generator | Backend axis: `MiniGridBackend` / `MultiGridBackend` / `TextBackend` | §6 | | Inference scripts | Adapter axis: `ModelInterface` implementations | §6 | | Scoring code (final score, comparison) | Runtime scoring (Stage 4) + Aggregate (Stage 5) | §5, §7 | diff --git a/evaluation_harness.py b/evaluation_harness.py index 57fa3ee..55ea840 100644 --- a/evaluation_harness.py +++ b/evaluation_harness.py @@ -22,7 +22,8 @@ from .gridworld.task_spec import TaskSpecification from .gridworld.actions import ACTION_NAMES, ACTION_DESCRIPTIONS from .gridworld.task_validator import compute_difficulty - from .gridworld.scoring import compute_12d_score + from .scorer.io import json_default as _json_default + from .scorer.scoring import compute_12d_score except ImportError: from model_interface import ModelInterface, ModelInput, ModelOutput from gridworld.runner.grid_runner import GridRunner, EpisodeResult @@ -31,14 +32,8 @@ from gridworld.task_spec import TaskSpecification from gridworld.actions import ACTION_NAMES, ACTION_DESCRIPTIONS from gridworld.task_validator import compute_difficulty - from gridworld.scoring import compute_12d_score - - -def _json_default(value): - """Convert NumPy scalars to native Python types for JSON serialization.""" - if isinstance(value, np.generic): - return value.item() - raise TypeError(f"Object of type {value.__class__.__name__} is not JSON serializable") + from scorer.io import json_default as _json_default + from scorer.scoring import compute_12d_score @dataclass diff --git a/gridworld/__init__.py b/gridworld/__init__.py index 27425ed..fd567a0 100644 --- a/gridworld/__init__.py +++ b/gridworld/__init__.py @@ -1,7 +1,7 @@ """Gridworld domain for MultiNet-v2.0. -This module provides task schema, validation, and scoring utilities for -gridworld puzzle specifications. +This module provides task schema and validation utilities for gridworld +puzzle specifications. """ from .bootstrap import disable_gymnasium_env_plugins @@ -32,9 +32,6 @@ TaskValidator, compute_difficulty, ) -from .scoring import ScoredDifficulty, compute_12d_score - - __all__ = [ # Task specification "Position", @@ -58,6 +55,4 @@ "DifficultyReport", "FragilityReport", "compute_difficulty", - "ScoredDifficulty", - "compute_12d_score", ] diff --git a/gridworld/baselines.py b/gridworld/baselines.py index ee5c8b0..2ab41f3 100644 --- a/gridworld/baselines.py +++ b/gridworld/baselines.py @@ -49,6 +49,17 @@ class Transition: next_state: PlannerState +@dataclass(frozen=True) +class PlannedPath: + """Planner output with replayed positions for scorer/reporting artifacts.""" + + success: bool + actions: list[int] + action_labels: list[str] + positions: list[tuple[int, int]] + states_explored: int = 0 + + class TaskPlanningContext: """Fast lookup tables derived from a ``TaskSpecification``.""" @@ -353,10 +364,10 @@ def _shortest_plan( ctx: TaskPlanningContext, start: PlannerState, is_goal: Callable[[PlannerState], bool], -) -> tuple[list[int], PlannerState | None]: +) -> tuple[list[int], PlannerState | None, int]: """Run BFS over executable actions and return the first shortest plan.""" if is_goal(start): - return [], start + return [], start, 1 queue = deque([start]) parent: dict[PlannerState, tuple[PlannerState, int]] = {} @@ -370,10 +381,14 @@ def _shortest_plan( visited.add(transition.next_state) parent[transition.next_state] = (state, transition.action) if is_goal(transition.next_state): - return _reconstruct_actions(parent, transition.next_state), transition.next_state + return ( + _reconstruct_actions(parent, transition.next_state), + transition.next_state, + len(visited), + ) queue.append(transition.next_state) - return [], None + return [], None, len(visited) def _shortest_plan_to_interaction( @@ -437,9 +452,18 @@ def _reconstruct_actions( def _bfs_actions(spec: TaskSpecification) -> list[int]: + actions, _ = _bfs_actions_with_stats(spec) + return actions + + +def _bfs_actions_with_stats(spec: TaskSpecification) -> tuple[list[int], int]: ctx = TaskPlanningContext(spec) - actions, _ = _shortest_plan(ctx, ctx.initial_state(), lambda st: st.agent_pos == ctx.goal) - return actions or [int(MiniGridActions.DONE)] + actions, _, states_explored = _shortest_plan( + ctx, + ctx.initial_state(), + lambda st: st.agent_pos == ctx.goal, + ) + return actions, states_explored def _greedy_actions(spec: TaskSpecification) -> list[int]: @@ -452,13 +476,81 @@ def _greedy_actions(spec: TaskSpecification) -> list[int]: break chunk, next_state = _shortest_plan_to_interaction(ctx, state) if next_state is None: - chunk, next_state = _shortest_plan(ctx, state, lambda st: st.agent_pos == ctx.goal) + chunk, next_state, _ = _shortest_plan( + ctx, + state, + lambda st: st.agent_pos == ctx.goal, + ) if next_state is None or not chunk: break actions.extend(chunk) state = next_state - return actions or [int(MiniGridActions.DONE)] + return actions + + +def trace_planned_actions(spec: TaskSpecification, actions: list[int]) -> PlannedPath: + """Replay planner actions through the planner graph without running a backend.""" + ctx = TaskPlanningContext(spec) + state = ctx.initial_state() + positions = [state.agent_pos] + executed_actions: list[int] = [] + labels: list[str] = [] + + for action in actions: + if action == int(MiniGridActions.DONE): + break + executed_actions.append(action) + transition = next( + (candidate for candidate in _successors(ctx, state) if candidate.action == action), + None, + ) + if transition is None: + labels.append(f"invalid:{action}") + return PlannedPath( + success=False, + actions=executed_actions, + action_labels=labels, + positions=positions, + ) + labels.append(transition.label) + state = transition.next_state + positions.append(state.agent_pos) + + return PlannedPath( + success=state.agent_pos == ctx.goal, + actions=executed_actions, + action_labels=labels, + positions=positions, + ) + + +def plan_bfs_actions(spec: TaskSpecification) -> list[int]: + """Return the deterministic BFS baseline action plan.""" + return _bfs_actions(spec) + + +def plan_greedy_actions(spec: TaskSpecification) -> list[int]: + """Return the deterministic greedy baseline action plan.""" + return _greedy_actions(spec) + + +def plan_bfs_path(spec: TaskSpecification) -> PlannedPath: + """Return the BFS baseline plan plus replayed positions.""" + actions, states_explored = _bfs_actions_with_stats(spec) + path = trace_planned_actions(spec, actions) + return PlannedPath( + success=path.success, + actions=path.actions, + action_labels=path.action_labels, + positions=path.positions, + states_explored=states_explored, + ) + + +def plan_greedy_path(spec: TaskSpecification) -> PlannedPath: + """Return the greedy baseline plan plus replayed positions.""" + return trace_planned_actions(spec, plan_greedy_actions(spec)) class PlannedBaselineModel(ModelInterface): diff --git a/gridworld/scoring.py b/gridworld/scoring.py deleted file mode 100644 index 9dd3670..0000000 --- a/gridworld/scoring.py +++ /dev/null @@ -1,152 +0,0 @@ -"""12-dimension scoring for gridworld tasks.""" - -from __future__ import annotations - -from dataclasses import dataclass, field - -from .task_spec import TaskSpecification -from .task_validator import DifficultyReport, TaskValidator - - -DIMENSION_NAMES = [ - "optimal_path_length", - "search_space_size", - "backtracking_required", - "fragility", - "dependency_depth", - "dependency_variety", - "distractor_count", - "distractor_quality", - "grid_size", - "wall_density", - "partial_observability", - "irreversibility", -] - - -@dataclass -class ScoredDifficulty: - """Full 12-dimension score report.""" - dimensions: list[float] - dimension_names: list[str] = field(default_factory=lambda: DIMENSION_NAMES.copy()) - composite: float = 0.0 - weights: list[float] = field(default_factory=lambda: [1.0] * len(DIMENSION_NAMES)) - - def to_dict(self) -> dict: - return { - "dimensions": self.dimensions, - "dimension_names": self.dimension_names, - "composite": self.composite, - "weights": self.weights, - } - - -def _count_backtracking(solution: list[tuple[int, int]] | None) -> float: - if not solution: - return 0.0 - seen = set() - revisits = 0 - previous_pos = None - for pos in solution: - if pos == previous_pos: - continue - if pos in seen: - revisits += 1 - seen.add(pos) - previous_pos = pos - return float(revisits) - - -def _dependency_variety(spec: TaskSpecification) -> float: - if spec.dependency_chain is not None: - return float(len({step.type for step in spec.dependency_chain.sequence})) - - variety = 0 - if spec.mechanisms.keys and spec.mechanisms.doors: - variety += 1 - if spec.mechanisms.switches and spec.mechanisms.gates: - variety += 1 - if spec.mechanisms.blocks: - variety += 1 - if spec.mechanisms.teleporters: - variety += 1 - if spec.mechanisms.hazards: - variety += 1 - return float(variety) - - -def _distractor_quality(spec: TaskSpecification) -> float: - if not spec.distractors: - return 0.0 - weights = { - "wrong_color_key": 1.0, - "inactive_switch": 2.0, - "decoy_door": 2.0, - "distractor_chain": 3.0, - } - return float(sum(weights.get(d.type, 1.0) for d in spec.distractors)) - - -def _partial_observability(spec: TaskSpecification) -> float: - mapping = {"full": 0.0, "view_cone": 1.0, "fog_of_war": 2.0} - return mapping.get(spec.rules.observability, 0.0) - - -def _irreversibility(spec: TaskSpecification) -> float: - score = 0.0 - if spec.rules.key_consumption: - score += float(len(spec.mechanisms.doors)) - score += float(sum(1 for switch in spec.mechanisms.switches if switch.switch_type == "one_shot")) - score += float(sum(1 for tp in spec.mechanisms.teleporters if not tp.bidirectional)) - return score - - -def compute_12d_score( - spec: TaskSpecification, - solver_output: DifficultyReport | None = None, - weights: list[float] | None = None, -) -> ScoredDifficulty: - """ - Compute the full 12-dimension benchmark score. - - This wraps solver-derived metrics with rubric dimensions such as - fragility, dependency variety, distractor quality, partial observability, - wall density, and irreversibility. The compact solver report remains in - compute_difficulty for callers that only need path/search metrics. - """ - validator = TaskValidator(spec) - is_beatable, solution, message = validator.validate() - if solver_output is None: - from .task_validator import compute_difficulty - - solver_output = compute_difficulty(spec) - - fragility = validator.compute_fragility() - fragility_value = 0.0 if fragility.min_steps_to_break == -1 else 1.0 / fragility.min_steps_to_break - - width, height = spec.maze.dimensions - grid_size = float(width * height) - wall_density = float(len(spec.maze.walls) / grid_size) if grid_size else 0.0 - - dimensions = [ - float(solver_output.optimal_steps), - float(solver_output.states_explored), - float(solver_output.backtrack_count if hasattr(solver_output, "backtrack_count") else _count_backtracking(solution)), - fragility_value, - float(spec.dependency_chain.depth if spec.dependency_chain is not None else solver_output.dependency_depth), - _dependency_variety(spec), - float(len(spec.distractors or [])), - _distractor_quality(spec), - grid_size, - wall_density, - _partial_observability(spec), - _irreversibility(spec), - ] - - weight_vector = weights or [1.0] * len(DIMENSION_NAMES) - composite = float(sum(d * w for d, w in zip(dimensions, weight_vector))) - return ScoredDifficulty( - dimensions=dimensions, - composite=composite, - weights=weight_vector, - ) diff --git a/gridworld/task_validator.py b/gridworld/task_validator.py index aee948f..4befedf 100644 --- a/gridworld/task_validator.py +++ b/gridworld/task_validator.py @@ -493,12 +493,13 @@ def validate_chain_ordering(self) -> bool: return False return True - def validate_distractor_safety(self) -> list[str]: + def validate_distractor_safety(self, base_beatable: bool | None = None) -> list[str]: """Check whether a single distractor interaction can make the task unsolvable.""" if not self.spec.distractors: return [] - base_beatable, _, _ = self.validate() + if base_beatable is None: + base_beatable, _, _ = self.validate() if not base_beatable: return ["Base task is not solvable"] @@ -767,17 +768,23 @@ def to_dict(self) -> dict: } -def compute_difficulty(spec: TaskSpecification) -> DifficultyReport: +def compute_difficulty( + spec: TaskSpecification, + validator: TaskValidator | None = None, + validation_result: tuple[bool, Optional[list[tuple[int, int]]], str] | None = None, +) -> DifficultyReport: """ Compute solver-derived difficulty metrics for a task. This is a compact report centered on BFS output: beatability, shortest action count, states explored, coarse mechanism complexity, and a legacy - composite score. Use compute_12d_score when the full rubric vector is + composite score. Use scorer.scoring.compute_12d_score when the full rubric vector is needed for benchmark comparison. """ - validator = TaskValidator(spec) - is_beatable, solution, message = validator.validate() + task_validator = validator or TaskValidator(spec) + if validation_result is None: + validation_result = task_validator.validate() + is_beatable, solution, message = validation_result optimal_steps = len(solution) - 1 if solution else 0 # -1 because path includes start # Extract states_explored from message diff --git a/interface/agents/claude.py b/interface/agents/claude.py index 9a6fc8e..1a2466b 100644 --- a/interface/agents/claude.py +++ b/interface/agents/claude.py @@ -17,6 +17,7 @@ parse_runner_content, split_system_prompt, ) +from interface.telemetry import normalize_token_usage logger = logging.getLogger(__name__) @@ -83,7 +84,7 @@ def _post_messages( system: Optional[str], messages: List[Dict[str, object]], timeout: Optional[float], -) -> str: +) -> Tuple[str, Optional[Dict[str, int]]]: body: Dict[str, object] = { "model": model, "max_tokens": max_tokens, @@ -136,7 +137,7 @@ def _post_messages( for block in payload.get("content", []) or []: if isinstance(block, dict) and block.get("type") == "text": parts.append(str(block.get("text", ""))) - return "".join(parts).strip() + return "".join(parts).strip(), normalize_token_usage(payload.get("usage")) @dataclass @@ -153,6 +154,7 @@ class ClaudeAnthropicAgent: config: ClaudeAnthropicConfig = field(default_factory=ClaudeAnthropicConfig) api_key: Optional[str] = None + last_usage: Optional[Dict[str, int]] = field(default=None, init=False) def __post_init__(self) -> None: key = (self.api_key or os.environ.get("ANTHROPIC_API_KEY") or "").strip() @@ -165,7 +167,7 @@ def __post_init__(self) -> None: def __call__(self, messages: List[dict]) -> str: system, turns = _to_anthropic_turns(messages) - return _post_messages( + text, self.last_usage = _post_messages( self.api_key, model=self.config.model, max_tokens=self.config.max_tokens, @@ -174,3 +176,4 @@ def __call__(self, messages: List[dict]) -> str: messages=turns, timeout=self.config.timeout, ) + return text diff --git a/interface/agents/qwen35_vl.py b/interface/agents/qwen35_vl.py index 2ad4e90..3ce2669 100644 --- a/interface/agents/qwen35_vl.py +++ b/interface/agents/qwen35_vl.py @@ -78,6 +78,7 @@ class Qwen35VLAgent: config: Qwen35VLConfig = field(default_factory=Qwen35VLConfig) processor: Any = None model: Any = None + last_usage: dict[str, int] | None = field(default=None, init=False) def __post_init__(self) -> None: from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration @@ -121,4 +122,9 @@ def __call__(self, messages: List[dict]) -> str: ) new_tokens = generated[0][prompt_len:] + self.last_usage = { + "input_tokens": int(prompt_len), + "output_tokens": int(len(new_tokens)), + "total_tokens": int(prompt_len + len(new_tokens)), + } return self.processor.decode(new_tokens, skip_special_tokens=True).strip() diff --git a/interface/runner.py b/interface/runner.py index 91dc448..a8c6f7b 100644 --- a/interface/runner.py +++ b/interface/runner.py @@ -57,6 +57,18 @@ def _trim_rolling_chat(messages: List[dict], max_pairs: int) -> None: del messages[1 : 1 + (tail_len - cap)] +def _reset_agent_usage(agent: Callable[[List[dict]], str]) -> None: + """Clear per-call telemetry so stale usage cannot leak into a later query.""" + reset_usage = getattr(agent, "reset_usage", None) + if callable(reset_usage): + reset_usage() + return + try: + setattr(agent, "last_usage", None) + except (AttributeError, TypeError): + pass + + def build_runner( config: ExperimentConfig, backend: MiniGridBackend, @@ -159,6 +171,7 @@ def run( len(agent_messages), has_image, ) + _reset_agent_usage(agent) t_llm = time.perf_counter() model_text = agent(agent_messages) llm_s = time.perf_counter() - t_llm @@ -177,22 +190,24 @@ def run( ) if logger.isEnabledFor(logging.DEBUG): logger.debug("LLM query #%d reply:\n%s", query_count, model_text) - transcript.append( - { - "kind": "query", - "query_index": query_count, - "env_step_count": state.step_count, - "agent_messages": copy.deepcopy(agent_messages), - "assistant_reply": model_text, - "parsed_actions": list(action_queue), - "parse_ok": bool(action_queue), - "has_image": has_image, - "llm_latency_s": llm_s, - "chat_history_mode": chat_history, - "agent_message_count": len(agent_messages), - "actions_remaining_before_step": len(action_queue), - } - ) + query_record = { + "kind": "query", + "query_index": query_count, + "env_step_count": state.step_count, + "agent_messages": copy.deepcopy(agent_messages), + "assistant_reply": model_text, + "parsed_actions": list(action_queue), + "parse_ok": bool(action_queue), + "has_image": has_image, + "llm_latency_s": llm_s, + "chat_history_mode": chat_history, + "agent_message_count": len(agent_messages), + "actions_remaining_before_step": len(action_queue), + } + usage = getattr(agent, "last_usage", None) + if isinstance(usage, dict): + query_record["usage"] = dict(usage) + transcript.append(query_record) # check if we got any valid actions; # if not, we'll count it as a parse failure and give feedback, # but still allow retries until max_parse_retries is reached diff --git a/interface/smoke_tests/smoke_llm.py b/interface/smoke_tests/smoke_llm.py index fd7d5e0..8d9058c 100644 --- a/interface/smoke_tests/smoke_llm.py +++ b/interface/smoke_tests/smoke_llm.py @@ -80,19 +80,35 @@ def __init__( def __call__(self, messages: list[dict]) -> str: self._query_seq += 1 text = self._inner(messages) - self._records.append( - { - "query": self._query_seq, - "messages_in_context": len(messages), - "reply": text, - } - ) + record = { + "query": self._query_seq, + "messages_in_context": len(messages), + "reply": text, + } + if self.last_usage is not None: + record["usage"] = dict(self.last_usage) + self._records.append(record) if self._log_replies: print(f"\n{'=' * 72}\nLLM query {self._query_seq} (messages={len(messages)})\n{'=' * 72}") print(text) print(f"{'=' * 72}\n") return text + @property + def last_usage(self) -> dict[str, int] | None: + usage = getattr(self._inner, "last_usage", None) + return usage if isinstance(usage, dict) else None + + def reset_usage(self) -> None: + reset_usage = getattr(self._inner, "reset_usage", None) + if callable(reset_usage): + reset_usage() + return + try: + setattr(self._inner, "last_usage", None) + except (AttributeError, TypeError): + pass + def main() -> None: parser = argparse.ArgumentParser( diff --git a/interface/telemetry.py b/interface/telemetry.py new file mode 100644 index 0000000..dd3a3c4 --- /dev/null +++ b/interface/telemetry.py @@ -0,0 +1,42 @@ +"""Shared telemetry normalization for interface producers and scorer consumers.""" + +from __future__ import annotations + +from typing import Any + + +TOKEN_COUNT_KEYS = ("total_tokens", "token_count", "tokens", "model_tokens") + + +def normalize_token_usage(usage: Any) -> dict[str, int] | None: + """Normalize provider token usage into input, output, and total counts.""" + if not isinstance(usage, dict): + return None + input_tokens = usage.get("input_tokens", usage.get("prompt_tokens")) + output_tokens = usage.get("output_tokens", usage.get("completion_tokens")) + total_tokens = usage.get("total_tokens") + if total_tokens is None and (input_tokens is not None or output_tokens is not None): + total_tokens = int(input_tokens or 0) + int(output_tokens or 0) + + normalized = {} + if input_tokens is not None: + normalized["input_tokens"] = int(input_tokens) + if output_tokens is not None: + normalized["output_tokens"] = int(output_tokens) + if total_tokens is not None: + normalized["total_tokens"] = int(total_tokens) + return normalized or None + + +def token_count_from_record(record: dict[str, Any]) -> int | None: + """Extract one token total without counting nested aliases twice.""" + for container in (record, record.get("info"), record.get("metadata")): + if not isinstance(container, dict): + continue + for key in TOKEN_COUNT_KEYS: + if container.get(key) is not None: + return int(container[key]) + usage = normalize_token_usage(container.get("usage")) + if usage is not None and usage.get("total_tokens") is not None: + return usage["total_tokens"] + return None diff --git a/pyproject.toml b/pyproject.toml index 7ce045e..b3c8b25 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -39,6 +39,7 @@ multinet-probe-vlm = "scripts.probe_vlm:main" multinet-ollama-vision-check = "scripts.ollama_vision_check:main" multinet-ollama-maze-shape-check = "scripts.ollama_maze_shape_check:main" multinet-vlm-sanity = "scripts.vlm_sanity_check:main" +multinet-score-json = "scripts.score_json:main" [tool.setuptools] include-package-data = true @@ -63,9 +64,11 @@ include = [ "interface*", "mazes*", "multigrid*", + "scorer*", "scripts*", ] [tool.setuptools.package-data] gridworld = ["tasks/**/*.json", "tasks/*.json"] mazes = ["validation_10/**/*.json", "validation_10/*.json"] +scorer = ["scorer_config.json"] diff --git a/scorer/__init__.py b/scorer/__init__.py new file mode 100644 index 0000000..df4a4db --- /dev/null +++ b/scorer/__init__.py @@ -0,0 +1,33 @@ +"""Standalone scoring package for MultiNet task and run artifacts.""" + +from .scoring import ( + CanonicalPathReport, + RuntimeScoreArtifact, + ScoredDifficulty, + ScorerConfig, + StaticScoreArtifact, + compute_12d_score, + compute_canonical_paths, + compute_greedy_solvability, + compute_runtime_score, + compute_static_score_artifact, + load_scorer_config, + score_runtime_file, + score_task_file, +) + +__all__ = [ + "CanonicalPathReport", + "RuntimeScoreArtifact", + "ScoredDifficulty", + "ScorerConfig", + "StaticScoreArtifact", + "compute_12d_score", + "compute_canonical_paths", + "compute_greedy_solvability", + "compute_runtime_score", + "compute_static_score_artifact", + "load_scorer_config", + "score_runtime_file", + "score_task_file", +] diff --git a/scorer/artifacts.py b/scorer/artifacts.py new file mode 100644 index 0000000..e61a70b --- /dev/null +++ b/scorer/artifacts.py @@ -0,0 +1,170 @@ +"""Dataclasses for scorer artifact payloads.""" + +from __future__ import annotations + +import copy +from dataclasses import dataclass, field +from typing import Any + +from .config import DIMENSION_NAMES, SCORER_VERSION + + +@dataclass +class ScoredDifficulty: + """Backward-compatible 12-dimension score report.""" + + dimensions: list[float] + dimension_names: list[str] = field(default_factory=lambda: DIMENSION_NAMES.copy()) + composite: float = 0.0 + weights: list[float] = field(default_factory=lambda: [1.0] * len(DIMENSION_NAMES)) + + @property + def dimensions_by_name(self) -> dict[str, float]: + return dict(zip(self.dimension_names, self.dimensions)) + + def to_dict(self) -> dict[str, Any]: + return { + "dimensions": list(self.dimensions), + "dimension_names": list(self.dimension_names), + "composite": self.composite, + "weights": list(self.weights), + } + + +@dataclass +class CanonicalPathReport: + """Canonical solver trace artifact for a task.""" + + task_id: str + success: bool + actions: list[str] + positions: list[tuple[int, int]] + optimal_steps: int + states_explored: int + message: str + greedy: dict[str, Any] | None = None + producer_version: str = SCORER_VERSION + + @property + def bfs(self) -> dict[str, Any]: + return { + "success": self.success, + "actions": list(self.actions), + "positions": [list(pos) for pos in self.positions], + "optimal_steps": self.optimal_steps, + "states_explored": self.states_explored, + "message": self.message, + } + + def to_dict(self) -> dict[str, Any]: + payload = { + "task_id": self.task_id, + "bfs": self.bfs, + "producer_version": self.producer_version, + } + if self.greedy is not None: + payload["greedy"] = copy.deepcopy(self.greedy) + return payload + + @classmethod + def from_dict(cls, data: dict[str, Any]) -> "CanonicalPathReport": + bfs = data.get("bfs", data) + return cls( + task_id=str(data.get("task_id", "")), + success=bool(bfs.get("success", False)), + actions=[str(action) for action in bfs.get("actions", [])], + positions=[ + (int(pos[0]), int(pos[1])) + for pos in bfs.get("positions", []) + if isinstance(pos, (list, tuple)) and len(pos) >= 2 + ], + optimal_steps=int(bfs.get("optimal_steps", 0)), + states_explored=int(bfs.get("states_explored", 0)), + message=str(bfs.get("message", "")), + greedy=copy.deepcopy(data.get("greedy")), + producer_version=str(data.get("producer_version", SCORER_VERSION)), + ) + + +@dataclass +class StaticScoreArtifact: + """Stage 2 static score artifact.""" + + task_id: str + is_beatable: bool + message: str + dimensions: dict[str, float] + static_score_unweighted: float + static_score: float + weights: dict[str, float] + validation: dict[str, Any] + canonical_agent_features: dict[str, float | None] + calibration_version: str + inputs_hash: str + producer_version: str = SCORER_VERSION + + def to_dict(self) -> dict[str, Any]: + return { + "task_id": self.task_id, + "is_beatable": self.is_beatable, + "message": self.message, + "dimensions_12": dict(self.dimensions), + "static_score_unweighted": self.static_score_unweighted, + "static_score": self.static_score, + "weights": dict(self.weights), + "validation": copy.deepcopy(self.validation), + "canonical_agent_features": dict(self.canonical_agent_features), + "calibration_version": self.calibration_version, + "inputs_hash": self.inputs_hash, + "producer_version": self.producer_version, + } + + @classmethod + def from_dict(cls, data: dict[str, Any]) -> "StaticScoreArtifact": + dimensions = data.get("dimensions_12", data.get("dimensions", {})) + if isinstance(dimensions, list): + dimensions = dict(zip(DIMENSION_NAMES, dimensions)) + return cls( + task_id=str(data.get("task_id", "")), + is_beatable=bool(data.get("is_beatable", False)), + message=str(data.get("message", "")), + dimensions={str(k): float(v) for k, v in dimensions.items()}, + static_score_unweighted=float(data.get("static_score_unweighted", 0.0)), + static_score=float(data.get("static_score", data.get("composite", 0.0))), + weights={str(k): float(v) for k, v in data.get("weights", {}).items()}, + validation=dict(data.get("validation", {})), + canonical_agent_features=dict(data.get("canonical_agent_features", {})), + calibration_version=str(data.get("calibration_version", "unknown")), + inputs_hash=str(data.get("inputs_hash", "")), + producer_version=str(data.get("producer_version", SCORER_VERSION)), + ) + + +@dataclass +class RuntimeScoreArtifact: + """Stage 4 runtime score artifact for one run.""" + + task_id: str + backend: str + adapter: str + model_id: str + seed: int | None + signals: dict[str, Any] + composite: float + calibration_version: str + inputs_hash: str + producer_version: str = SCORER_VERSION + + def to_dict(self) -> dict[str, Any]: + return { + "task_id": self.task_id, + "backend": self.backend, + "adapter": self.adapter, + "model_id": self.model_id, + "seed": self.seed, + "signals": copy.deepcopy(self.signals), + "composite": self.composite, + "calibration_version": self.calibration_version, + "inputs_hash": self.inputs_hash, + "producer_version": self.producer_version, + } diff --git a/scorer/config.py b/scorer/config.py new file mode 100644 index 0000000..6e48130 --- /dev/null +++ b/scorer/config.py @@ -0,0 +1,146 @@ +"""Scorer configuration and calibration defaults.""" + +from __future__ import annotations + +from dataclasses import dataclass, field +from pathlib import Path +from typing import Any + +from .io import load_json + + +SCORER_VERSION = "0.3.0" +DEFAULT_CONFIG_PATH = Path(__file__).with_name("scorer_config.json") + +DIMENSION_NAMES = [ + "optimal_path_length", + "search_space_size", + "backtracking_required", + "fragility", + "dependency_depth", + "dependency_variety", + "distractor_count", + "distractor_quality", + "grid_size", + "wall_density", + "partial_observability", + "irreversibility", +] + +GREEDY_SOLVABILITY_FEATURE = "greedy_solvability" + +CANONICAL_AGENT_FEATURE_NAMES = [ + GREEDY_SOLVABILITY_FEATURE, +] + +DEFAULT_DISTRACTOR_TYPE_WEIGHTS = { + "wrong_color_key": 1.0, + "inactive_switch": 2.0, + "decoy_door": 2.0, + "distractor_chain": 3.0, +} + +DEFAULT_RUNTIME_WEIGHTS = { + "step_ratio": 1.0, + "cell_overlap_bfs": 1.0, + "token_efficiency": 1.0, + "greedy_penalty": 0.5, +} + + +def _coerce_float_mapping( + values: dict[str, Any] | list[Any] | None, + names: list[str], + default: float = 1.0, +) -> dict[str, float]: + if values is None: + return {name: default for name in names} + if isinstance(values, list): + if len(values) != len(names): + raise ValueError(f"Expected {len(names)} weights, got {len(values)}") + result = {name: default for name in names} + for name, value in zip(names, values): + result[name] = float(value) + return result + return {name: float(values.get(name, default)) for name in names} + + +@dataclass +class ScorerConfig: + """Weights and runtime coefficients used by the standalone scorer.""" + + version: str = "default" + static_dimension_weights: dict[str, float] = field( + default_factory=lambda: {name: 1.0 for name in DIMENSION_NAMES} + ) + distractor_type_weights: dict[str, float] = field( + default_factory=lambda: DEFAULT_DISTRACTOR_TYPE_WEIGHTS.copy() + ) + runtime_weights: dict[str, float] = field( + default_factory=lambda: DEFAULT_RUNTIME_WEIGHTS.copy() + ) + baseline_tokens: float = 1000.0 + difficulty_max_static_score: float | None = None + + @classmethod + def default(cls) -> "ScorerConfig": + return cls() + + @classmethod + def from_dict(cls, data: dict[str, Any]) -> "ScorerConfig": + static_weights = data.get("static_dimension_weights", data.get("static_weights")) + runtime_weights = data.get("runtime_weights") + distractor_weights = data.get("distractor_type_weights", data.get("distractor_weights")) + + difficulty_max = data.get("difficulty_max_static_score") + return cls( + version=str(data.get("version", "default")), + static_dimension_weights=_coerce_float_mapping(static_weights, DIMENSION_NAMES), + distractor_type_weights={ + **DEFAULT_DISTRACTOR_TYPE_WEIGHTS, + **{k: float(v) for k, v in (distractor_weights or {}).items()}, + }, + runtime_weights={ + **DEFAULT_RUNTIME_WEIGHTS, + **{k: float(v) for k, v in (runtime_weights or {}).items()}, + }, + baseline_tokens=float(data.get("baseline_tokens", 1000.0)), + difficulty_max_static_score=( + None if difficulty_max is None else float(difficulty_max) + ), + ) + + def to_dict(self) -> dict[str, Any]: + return { + "version": self.version, + "static_dimension_weights": dict(self.static_dimension_weights), + "distractor_type_weights": dict(self.distractor_type_weights), + "runtime_weights": dict(self.runtime_weights), + "baseline_tokens": self.baseline_tokens, + "difficulty_max_static_score": self.difficulty_max_static_score, + } + + def static_weight_list(self) -> list[float]: + return [self.static_dimension_weights.get(name, 1.0) for name in DIMENSION_NAMES] + + +def load_scorer_config(path: str | Path | None = None) -> ScorerConfig: + """Load scorer weights from JSON, or return defaults if no file exists.""" + config_path = Path(path) if path is not None else DEFAULT_CONFIG_PATH + if not config_path.exists(): + if path is not None: + raise FileNotFoundError(f"Scorer config not found: {config_path}") + return ScorerConfig.default() + if config_path.suffix.lower() in {".yaml", ".yml"}: + try: + import yaml # type: ignore + except ImportError as exc: + raise ImportError( + "YAML scorer configs require PyYAML. Use JSON or install PyYAML." + ) from exc + with open(config_path, "r") as f: + data = yaml.safe_load(f) or {} + if not isinstance(data, dict): + raise ValueError(f"Expected a YAML object in {config_path}") + return ScorerConfig.from_dict(data) + return ScorerConfig.from_dict(load_json(config_path)) diff --git a/scorer/io.py b/scorer/io.py new file mode 100644 index 0000000..451fc44 --- /dev/null +++ b/scorer/io.py @@ -0,0 +1,62 @@ +"""JSON and hash helpers for scorer artifacts.""" + +from __future__ import annotations + +import hashlib +import json +from pathlib import Path +from typing import Any + +from gridworld.task_spec import TaskSpecification + + +def json_default(value: Any) -> Any: + if hasattr(value, "item"): + return value.item() + raise TypeError(f"Object of type {value.__class__.__name__} is not JSON serializable") + + +def load_json(path: str | Path) -> dict[str, Any]: + with open(path, "r") as f: + data = json.load(f) + if not isinstance(data, dict): + raise ValueError(f"Expected a JSON object in {path}") + return data + + +def dump_json(path: str | Path, payload: dict[str, Any]) -> None: + output_path = Path(path) + output_path.parent.mkdir(parents=True, exist_ok=True) + with open(output_path, "w") as f: + json.dump(payload, f, indent=2, default=json_default) + f.write("\n") + + +def json_files(paths: list[str]) -> list[Path]: + """Expand JSON files and directories into a stable file list.""" + files: list[Path] = [] + for value in paths: + path = Path(value) + if path.is_dir(): + files.extend(sorted(path.rglob("*.json"))) + else: + files.append(path) + return files + + +def stable_hash(payload: Any) -> str: + encoded = json.dumps(payload, sort_keys=True, separators=(",", ":"), default=json_default) + return hashlib.sha256(encoded.encode("utf-8")).hexdigest() + + +def task_spec_from_payload(data: dict[str, Any]) -> TaskSpecification: + if "task_spec" in data and isinstance(data["task_spec"], dict): + return TaskSpecification.from_dict(data["task_spec"]) + if "TaskSpecification" in data and isinstance(data["TaskSpecification"], dict): + return TaskSpecification.from_dict(data) + required_fields = {"task_id", "maze", "goal", "max_steps"} + if not required_fields.issubset(data): + raise ValueError( + "Input JSON is not a task artifact. Expected task fields or a nested task_spec." + ) + return TaskSpecification.from_dict(data) diff --git a/scorer/runtime.py b/scorer/runtime.py new file mode 100644 index 0000000..0d43f8b --- /dev/null +++ b/scorer/runtime.py @@ -0,0 +1,362 @@ +"""Runtime scoring for run and episode JSON artifacts.""" + +from __future__ import annotations + +from pathlib import Path +from typing import Any + +from .artifacts import CanonicalPathReport, RuntimeScoreArtifact, StaticScoreArtifact +from .config import SCORER_VERSION, ScorerConfig +from .io import dump_json, load_json, stable_hash +from interface.telemetry import token_count_from_record + + +def _artifact_dict(value: dict[str, Any] | StaticScoreArtifact | CanonicalPathReport) -> dict[str, Any]: + if hasattr(value, "to_dict"): + return value.to_dict() # type: ignore[no-any-return] + return value + + +def _lookup_path(data: dict[str, Any], *keys: str) -> Any: + current: Any = data + for key in keys: + if not isinstance(current, dict) or key not in current: + return None + current = current[key] + return current + + +def _extract_task_id(run: dict[str, Any], fallback: str = "") -> str: + return str( + run.get("task_id") + or _lookup_path(run, "task_spec", "task_id") + or _lookup_path(run, "episode", "task_id") + or fallback + ) + + +def _extract_bool(run: dict[str, Any], *keys: str, default: bool = False) -> bool: + for key in keys: + value = run.get(key) + if value is not None: + return bool(value) + return default + + +def _extract_steps(run: dict[str, Any]) -> int | None: + for key in ("steps", "steps_taken", "steps_used"): + if run.get(key) is not None: + return int(run[key]) + signal_steps = _lookup_path(run, "signals", "steps") + if signal_steps is not None: + return int(signal_steps) + final_step = _lookup_path(run, "final_state", "step_count") + if final_step is not None: + return int(final_step) + return None + + +def _extract_token_count(run: dict[str, Any]) -> int | None: + for key in ("total_tokens", "token_count", "tokens"): + if run.get(key) is not None: + return int(run[key]) + signal_tokens = _lookup_path(run, "signals", "token_count") + if signal_tokens is not None: + return int(signal_tokens) + + trajectory_total = _sum_record_tokens(run.get("trajectory", [])) + if trajectory_total is not None: + return trajectory_total + return _sum_record_tokens(run.get("transcript", []), kind="query") + + +def _sum_record_tokens(records: Any, kind: str | None = None) -> int | None: + if not isinstance(records, list): + return None + total = 0 + found = False + for item in records: + if not isinstance(item, dict): + continue + if kind is not None and item.get("kind") != kind: + continue + item_tokens = token_count_from_record(item) + if item_tokens is not None: + total += item_tokens + found = True + return total if found else None + + +def _state_position(state: Any) -> tuple[int, int] | None: + if not isinstance(state, dict): + return None + raw = state.get("agent_position") or state.get("position") + if isinstance(raw, (list, tuple)) and len(raw) >= 2: + return int(raw[0]), int(raw[1]) + return None + + +def _extract_run_positions(run: dict[str, Any]) -> list[tuple[int, int]]: + positions: list[tuple[int, int]] = [] + + initial_pos = _state_position(run.get("initial_state")) + if initial_pos is not None: + positions.append(initial_pos) + + for item in run.get("trajectory", []): + if not isinstance(item, dict): + continue + pos = _state_position(item.get("state")) + if pos is not None: + positions.append(pos) + + for item in run.get("transcript", []): + if not isinstance(item, dict): + continue + if item.get("kind") == "reset": + pos = _state_position(item.get("state")) + else: + pos = _state_position(item.get("state_after")) + if pos is None: + raw = item.get("position_after") + pos = (int(raw[0]), int(raw[1])) if isinstance(raw, list) and len(raw) >= 2 else None + if pos is not None: + positions.append(pos) + + final_pos = _state_position(run.get("final_state")) + if final_pos is not None: + positions.append(final_pos) + + deduped: list[tuple[int, int]] = [] + for pos in positions: + if not deduped or deduped[-1] != pos: + deduped.append(pos) + return deduped + + +def _extract_canonical_positions( + canonical_paths: dict[str, Any], + agent: str = "bfs", +) -> list[tuple[int, int]]: + path = canonical_paths.get(agent, canonical_paths if agent == "bfs" else {}) + if not isinstance(path, dict): + return [] + positions = [] + for pos in path.get("positions", []): + if isinstance(pos, (list, tuple)) and len(pos) >= 2: + positions.append((int(pos[0]), int(pos[1]))) + return positions + + +def _cell_overlap(run_positions: list[tuple[int, int]], canonical_positions: list[tuple[int, int]]) -> float: + canonical_cells = set(canonical_positions) + if not canonical_cells: + return 0.0 + return len(set(run_positions) & canonical_cells) / len(canonical_cells) + + +def _extract_static_score(static_score: dict[str, Any]) -> float: + return float(static_score.get("static_score", static_score.get("composite", 0.0))) + + +def _extract_greedy_solvability(static_score: dict[str, Any]) -> float: + value = _lookup_path(static_score, "canonical_agent_features", "greedy_solvability") + if value is None: + raise ValueError("Runtime scoring requires evaluated canonical_agent_features.greedy_solvability") + solvability = float(value) + if not 0.0 <= solvability <= 1.0: + raise ValueError("greedy_solvability must be between 0.0 and 1.0") + return solvability + + +def _runtime_weighted_average(signals: dict[str, float], weights: dict[str, float]) -> float: + numerator = 0.0 + denominator = 0.0 + for key in ("step_ratio", "cell_overlap_bfs", "token_efficiency"): + weight = float(weights.get(key, 0.0)) + numerator += signals[key] * weight + denominator += weight + return numerator / denominator if denominator else 0.0 + + +def _first_present(*values: Any) -> Any: + for value in values: + if value is not None: + return value + return None + + +def compute_runtime_score( + run: dict[str, Any], + static_score: dict[str, Any] | StaticScoreArtifact, + canonical_paths: dict[str, Any] | CanonicalPathReport, + config: ScorerConfig | None = None, + difficulty_max_static_score: float | None = None, +) -> RuntimeScoreArtifact: + """Compute the Stage 4 runtime score for one run JSON payload.""" + scorer_config = config or ScorerConfig.default() + static_data = _artifact_dict(static_score) + canonical_data = _artifact_dict(canonical_paths) + if _lookup_path(static_data, "validation", "schema_valid") is False: + raise ValueError("Runtime scoring requires a schema-valid scored_static.json artifact") + + task_id = _extract_task_id(run, fallback=str(static_data.get("task_id", ""))) + success = _extract_bool(run, "success", default=bool(_lookup_path(run, "signals", "success") or False)) + steps = _extract_steps(run) + token_count = _extract_token_count(run) + canonical_positions = _extract_canonical_positions(canonical_data) + greedy_positions = _extract_canonical_positions(canonical_data, agent="greedy") + run_positions = _extract_run_positions(run) + + optimal_steps_value = _first_present( + _lookup_path(canonical_data, "bfs", "optimal_steps"), + canonical_data.get("optimal_steps"), + static_data.get("optimal_steps"), + ) + if optimal_steps_value is None: + raise ValueError("Runtime scoring requires bfs.optimal_steps in canonical_paths.json") + optimal_steps = int(optimal_steps_value) + if steps is None: + raise ValueError("Runtime scoring requires step telemetry") + if steps < 0: + raise ValueError("steps must not be negative") + step_ratio = 0.0 + if success and optimal_steps == 0: + step_ratio = 1.0 if steps == 0 else 0.0 + elif success: + step_ratio = optimal_steps / max(float(steps), float(optimal_steps), 1.0) + + cell_overlap_bfs = _cell_overlap(run_positions, canonical_positions) + cell_overlap_greedy = ( + _cell_overlap(run_positions, greedy_positions) + if isinstance(canonical_data.get("greedy"), dict) + else None + ) + if token_count is None: + raise ValueError("Runtime scoring requires positive token telemetry") + if token_count <= 0: + raise ValueError("token_count must be greater than zero") + token_efficiency = min(1.0, scorer_config.baseline_tokens / float(token_count)) + + static_composite = _extract_static_score(static_data) + normalizer = ( + difficulty_max_static_score + if difficulty_max_static_score is not None + else scorer_config.difficulty_max_static_score + ) + if normalizer is None: + raise ValueError( + "Runtime scoring requires difficulty_max_static_score from the task suite " + "or scorer config" + ) + if normalizer <= 0: + raise ValueError("difficulty_max_static_score must be greater than zero") + if static_composite > normalizer: + raise ValueError("difficulty_max_static_score must be at least the task static score") + difficulty_weight = static_composite / normalizer + success_factor = 1.0 if success else 0.0 + efficiency_signals = { + "step_ratio": step_ratio, + "cell_overlap_bfs": cell_overlap_bfs, + "token_efficiency": token_efficiency, + } + efficiency_factor = _runtime_weighted_average( + efficiency_signals, + scorer_config.runtime_weights, + ) + greedy_solvability = _extract_greedy_solvability(static_data) + greedy_penalty = ( + scorer_config.runtime_weights.get("greedy_penalty", 0.0) + * greedy_solvability + * success_factor + ) + composite = max(0.0, success_factor * efficiency_factor * difficulty_weight - greedy_penalty) + + signals: dict[str, Any] = { + "success": success, + "steps": steps, + "terminated": _extract_bool(run, "terminated", default=False), + "truncated": _extract_bool(run, "truncated", default=False), + "terminated_reason": run.get("terminated_reason") or run.get("end_reason") or ("success" if success else "unknown"), + "reward": run.get("reward", run.get("total_reward")), + "token_count": token_count, + "optimal_steps": optimal_steps, + "step_ratio": step_ratio, + "cell_overlap_bfs": cell_overlap_bfs, + "cell_overlap_greedy": cell_overlap_greedy, + "token_efficiency": token_efficiency, + "difficulty_weight": difficulty_weight, + "efficiency_factor": efficiency_factor, + "greedy_penalty": greedy_penalty, + } + for key in ( + "distractor_interactions", + "irreversible_failures", + "path_choice", + "mechanism_interaction_order", + "failure_point", + ): + if run.get(key) is not None: + signals[key] = run[key] + + inputs_hash = stable_hash( + { + "run": { + "task_id": task_id, + "backend": run.get("backend"), + "adapter": run.get("adapter", run.get("agent_or_model")), + "model_id": run.get("model_id", run.get("model_name", run.get("agent_or_model"))), + "seed": run.get("seed"), + "positions": run_positions, + "signals": signals, + }, + "static_score": { + "task_id": static_data.get("task_id"), + "static_score": static_composite, + "greedy_solvability": greedy_solvability, + }, + "canonical_paths": { + "bfs_positions": canonical_positions, + "greedy_positions": greedy_positions, + "optimal_steps": optimal_steps, + }, + "config": scorer_config.to_dict(), + "scorer_version": SCORER_VERSION, + } + ) + + return RuntimeScoreArtifact( + task_id=task_id, + backend=str(run.get("backend", "")), + adapter=str(run.get("adapter", run.get("agent_or_model", ""))), + model_id=str(run.get("model_id", run.get("model_name", run.get("agent_or_model", "")))), + seed=int(run["seed"]) if run.get("seed") is not None else None, + signals=signals, + composite=composite, + calibration_version=scorer_config.version, + inputs_hash=inputs_hash, + ) + + +def score_runtime_file( + run_path: str | Path, + static_score_path: str | Path, + canonical_paths_path: str | Path, + output_path: str | Path | None = None, + config: ScorerConfig | None = None, + difficulty_max_static_score: float | None = None, +) -> RuntimeScoreArtifact: + """Score one run JSON file and optionally write run_score.json.""" + run = load_json(run_path) + static_score = load_json(static_score_path) + canonical_paths = load_json(canonical_paths_path) + score = compute_runtime_score( + run, + static_score=static_score, + canonical_paths=canonical_paths, + config=config, + difficulty_max_static_score=difficulty_max_static_score, + ) + if output_path is not None: + dump_json(output_path, score.to_dict()) + return score diff --git a/scorer/scorer_config.json b/scorer/scorer_config.json new file mode 100644 index 0000000..fb7ed8f --- /dev/null +++ b/scorer/scorer_config.json @@ -0,0 +1,31 @@ +{ + "version": "default-v2", + "static_dimension_weights": { + "optimal_path_length": 1.0, + "search_space_size": 1.0, + "backtracking_required": 1.0, + "fragility": 1.0, + "dependency_depth": 1.0, + "dependency_variety": 1.0, + "distractor_count": 1.0, + "distractor_quality": 1.0, + "grid_size": 1.0, + "wall_density": 1.0, + "partial_observability": 1.0, + "irreversibility": 1.0 + }, + "distractor_type_weights": { + "wrong_color_key": 1.0, + "inactive_switch": 2.0, + "decoy_door": 2.0, + "distractor_chain": 3.0 + }, + "runtime_weights": { + "step_ratio": 1.0, + "cell_overlap_bfs": 1.0, + "token_efficiency": 1.0, + "greedy_penalty": 0.5 + }, + "baseline_tokens": 1000.0, + "difficulty_max_static_score": null +} diff --git a/scorer/scoring.py b/scorer/scoring.py new file mode 100644 index 0000000..6d12100 --- /dev/null +++ b/scorer/scoring.py @@ -0,0 +1,45 @@ +"""Public scorer interface for static and runtime analysis.""" + +from __future__ import annotations + +from .artifacts import ( + CanonicalPathReport, + RuntimeScoreArtifact, + ScoredDifficulty, + StaticScoreArtifact, +) +from .config import ( + CANONICAL_AGENT_FEATURE_NAMES, + DEFAULT_CONFIG_PATH, + DEFAULT_DISTRACTOR_TYPE_WEIGHTS, + DEFAULT_RUNTIME_WEIGHTS, + DIMENSION_NAMES, + SCORER_VERSION, + ScorerConfig, + load_scorer_config, +) +from .runtime import compute_runtime_score, score_runtime_file +from .solvers import compute_canonical_paths, compute_greedy_solvability +from .static import compute_12d_score, compute_static_score_artifact, score_task_file + +__all__ = [ + "CANONICAL_AGENT_FEATURE_NAMES", + "DEFAULT_CONFIG_PATH", + "DEFAULT_DISTRACTOR_TYPE_WEIGHTS", + "DEFAULT_RUNTIME_WEIGHTS", + "DIMENSION_NAMES", + "SCORER_VERSION", + "CanonicalPathReport", + "RuntimeScoreArtifact", + "ScoredDifficulty", + "ScorerConfig", + "StaticScoreArtifact", + "compute_12d_score", + "compute_canonical_paths", + "compute_greedy_solvability", + "compute_runtime_score", + "compute_static_score_artifact", + "load_scorer_config", + "score_runtime_file", + "score_task_file", +] diff --git a/scorer/solvers.py b/scorer/solvers.py new file mode 100644 index 0000000..95fe56f --- /dev/null +++ b/scorer/solvers.py @@ -0,0 +1,74 @@ +"""Canonical solver integration for scorer artifacts.""" + +from __future__ import annotations + +from typing import Any + +from gridworld.baselines import PlannedPath, plan_bfs_path, plan_greedy_path +from gridworld.task_spec import TaskSpecification + +from .artifacts import CanonicalPathReport + + +def _path_payload(path) -> dict[str, Any]: + return { + "success": path.success, + "actions": list(path.action_labels), + "positions": [list(pos) for pos in path.positions], + "steps": len(path.action_labels), + } + + +def require_scorable_spec(spec: TaskSpecification) -> None: + """Reject malformed tasks before canonical planners inspect their coordinates.""" + schema_valid, schema_errors = spec.validate() + if not schema_valid: + detail = "; ".join(schema_errors) + raise ValueError(f"Task {spec.task_id!r} failed schema validation: {detail}") + + +def compute_canonical_paths( + spec: TaskSpecification, + bfs_path: PlannedPath | None = None, + greedy_path: PlannedPath | None = None, +) -> CanonicalPathReport: + """Emit canonical BFS and greedy traces using the merged baseline solvers.""" + require_scorable_spec(spec) + if bfs_path is None: + bfs_path = plan_bfs_path(spec) + if greedy_path is None: + greedy_path = plan_greedy_path(spec) + + if bfs_path.success: + message = ( + f"Solution found in {len(bfs_path.action_labels)} steps " + f"({bfs_path.states_explored} states explored)" + ) + elif bfs_path.states_explored: + message = ( + "No solution found " + f"({bfs_path.states_explored} states explored, all reachable states checked)" + ) + else: + message = "No solution found" + + return CanonicalPathReport( + task_id=spec.task_id, + success=bfs_path.success, + actions=list(bfs_path.action_labels), + positions=list(bfs_path.positions), + optimal_steps=len(bfs_path.action_labels) if bfs_path.success else 0, + states_explored=bfs_path.states_explored, + message=message, + greedy=_path_payload(greedy_path), + ) + + +def compute_greedy_solvability( + spec: TaskSpecification, + greedy_path: PlannedPath | None = None, +) -> float: + """Return 1 when the merged greedy planner solves the task, else 0.""" + if greedy_path is None: + greedy_path = plan_greedy_path(spec) + return 1.0 if greedy_path.success else 0.0 diff --git a/scorer/static.py b/scorer/static.py new file mode 100644 index 0000000..adac502 --- /dev/null +++ b/scorer/static.py @@ -0,0 +1,264 @@ +"""Static task scoring and Stage 2 artifact generation.""" + +from __future__ import annotations + +from pathlib import Path + +from gridworld.baselines import PlannedPath, plan_bfs_path, plan_greedy_path +from gridworld.task_spec import TaskSpecification +from gridworld.task_validator import DifficultyReport, TaskValidator, compute_difficulty + +from .artifacts import ScoredDifficulty, StaticScoreArtifact +from .config import ( + DEFAULT_DISTRACTOR_TYPE_WEIGHTS, + DIMENSION_NAMES, + GREEDY_SOLVABILITY_FEATURE, + SCORER_VERSION, + ScorerConfig, +) +from .io import dump_json, load_json, stable_hash, task_spec_from_payload +from .solvers import compute_canonical_paths, compute_greedy_solvability, require_scorable_spec + + +def _count_backtracking(solution: list[tuple[int, int]] | None) -> float: + if not solution: + return 0.0 + seen = set() + revisits = 0 + previous_pos = None + for pos in solution: + if pos == previous_pos: + continue + if pos in seen: + revisits += 1 + seen.add(pos) + previous_pos = pos + return float(revisits) + + +def _dependency_variety(spec: TaskSpecification) -> float: + if spec.dependency_chain is not None: + return float(len({step.type for step in spec.dependency_chain.sequence})) + + variety = 0 + if spec.mechanisms.keys and spec.mechanisms.doors: + variety += 1 + if spec.mechanisms.switches and spec.mechanisms.gates: + variety += 1 + if spec.mechanisms.blocks: + variety += 1 + if spec.mechanisms.teleporters: + variety += 1 + if spec.mechanisms.hazards: + variety += 1 + return float(variety) + + +def _distractor_quality( + spec: TaskSpecification, + distractor_type_weights: dict[str, float] | None = None, +) -> float: + if not spec.distractors: + return 0.0 + weights = distractor_type_weights or DEFAULT_DISTRACTOR_TYPE_WEIGHTS + return float(sum(weights.get(d.type, 1.0) for d in spec.distractors)) + + +def _partial_observability(spec: TaskSpecification) -> float: + mapping = {"full": 0.0, "view_cone": 1.0, "fog_of_war": 2.0} + return mapping.get(spec.rules.observability, 0.0) + + +def _irreversibility(spec: TaskSpecification) -> float: + score = 0.0 + if spec.rules.key_consumption: + score += float(len(spec.mechanisms.doors)) + score += float(sum(1 for switch in spec.mechanisms.switches if switch.switch_type == "one_shot")) + score += float(sum(1 for tp in spec.mechanisms.teleporters if not tp.bidirectional)) + return score + + +def compute_12d_score( + spec: TaskSpecification, + solver_output: DifficultyReport | None = None, + weights: list[float] | None = None, + config: ScorerConfig | None = None, + validator: TaskValidator | None = None, + bfs_path: PlannedPath | None = None, +) -> ScoredDifficulty: + """ + Compute the 12-dimension static benchmark score. + + This keeps the old call shape while calibration and artifact generation + live in the standalone scorer package. + """ + require_scorable_spec(spec) + scorer_config = config or ScorerConfig.default() + task_validator = validator or TaskValidator(spec) + if solver_output is None: + solver_output = compute_difficulty(spec, validator=task_validator) + if bfs_path is None: + bfs_path = plan_bfs_path(spec) + + fragility = task_validator.compute_fragility() + fragility_value = 0.0 if fragility.min_steps_to_break == -1 else 1.0 / fragility.min_steps_to_break + + width, height = spec.maze.dimensions + grid_size = float(width * height) + wall_density = float(len(spec.maze.walls) / grid_size) if grid_size else 0.0 + + dimensions = [ + float(len(bfs_path.action_labels) if bfs_path.success else 0), + float(bfs_path.states_explored), + _count_backtracking(bfs_path.positions), + fragility_value, + float(spec.dependency_chain.depth if spec.dependency_chain is not None else solver_output.dependency_depth), + _dependency_variety(spec), + float(len(spec.distractors or [])), + _distractor_quality(spec, scorer_config.distractor_type_weights), + grid_size, + wall_density, + _partial_observability(spec), + _irreversibility(spec), + ] + + weight_vector = ( + scorer_config.static_weight_list() + if weights is None + else [float(weight) for weight in weights] + ) + if len(weight_vector) != len(dimensions): + raise ValueError(f"Expected {len(dimensions)} static weights, got {len(weight_vector)}") + composite = float(sum(d * w for d, w in zip(dimensions, weight_vector))) + return ScoredDifficulty( + dimensions=dimensions, + dimension_names=DIMENSION_NAMES.copy(), + composite=composite, + weights=weight_vector, + ) + + +def compute_static_score_artifact( + spec: TaskSpecification, + config: ScorerConfig | None = None, + solver_output: DifficultyReport | None = None, + validator: TaskValidator | None = None, + validation_result: tuple[bool, list[tuple[int, int]] | None, str] | None = None, + bfs_path: PlannedPath | None = None, + greedy_path: PlannedPath | None = None, +) -> StaticScoreArtifact: + """Compute the Stage 2 static score artifact for one task.""" + require_scorable_spec(spec) + scorer_config = config or ScorerConfig.default() + schema_valid, schema_errors = spec.validate() + task_validator = validator or TaskValidator(spec) + if validation_result is None: + validation_result = task_validator.validate() + is_beatable, _, message = validation_result + if solver_output is None: + solver_output = compute_difficulty( + spec, + validator=task_validator, + validation_result=validation_result, + ) + if bfs_path is None: + bfs_path = plan_bfs_path(spec) + if is_beatable != bfs_path.success: + raise ValueError( + "Task validator and canonical BFS disagree on beatability for " + f"{spec.task_id!r}" + ) + score = compute_12d_score( + spec, + solver_output=solver_output, + config=scorer_config, + validator=task_validator, + bfs_path=bfs_path, + ) + + mechanism_necessity_violations: list[str] = [] + distractor_safety_violations: list[str] = [] + chain_ordering_valid = True + if schema_valid: + mechanism_necessity_violations = task_validator.validate_mechanism_necessity() + distractor_safety_violations = task_validator.validate_distractor_safety( + base_beatable=is_beatable + ) + chain_ordering_valid = task_validator.validate_chain_ordering() + + dimensions = score.dimensions_by_name + static_score_unweighted = float(sum(dimensions.values())) + inputs_hash = stable_hash( + { + "task": spec.to_dict(), + "config": scorer_config.to_dict(), + "scorer_version": SCORER_VERSION, + } + ) + + return StaticScoreArtifact( + task_id=spec.task_id, + is_beatable=is_beatable, + message=message, + dimensions=dimensions, + static_score_unweighted=static_score_unweighted, + static_score=score.composite, + weights=dict(scorer_config.static_dimension_weights), + validation={ + "schema_valid": schema_valid, + "schema_errors": schema_errors, + "mechanism_necessity_violations": mechanism_necessity_violations, + "distractor_safety_violations": distractor_safety_violations, + "chain_ordering_valid": chain_ordering_valid, + }, + canonical_agent_features={ + GREEDY_SOLVABILITY_FEATURE: ( + compute_greedy_solvability(spec, greedy_path=greedy_path) + if schema_valid + else None + ), + }, + calibration_version=scorer_config.version, + inputs_hash=inputs_hash, + ) + + +def score_task_file( + task_path: str | Path, + output_dir: str | Path | None = None, + config: ScorerConfig | None = None, +): + """Score a task JSON file and optionally write canonical score artifacts.""" + spec = task_spec_from_payload(load_json(task_path)) + require_scorable_spec(spec) + validator = TaskValidator(spec) + validation_result = validator.validate() + difficulty = compute_difficulty( + spec, + validator=validator, + validation_result=validation_result, + ) + bfs_path = plan_bfs_path(spec) + greedy_path = plan_greedy_path(spec) + canonical_paths = compute_canonical_paths( + spec, + bfs_path=bfs_path, + greedy_path=greedy_path, + ) + static_score = compute_static_score_artifact( + spec, + config=config, + solver_output=difficulty, + validator=validator, + validation_result=validation_result, + bfs_path=bfs_path, + greedy_path=greedy_path, + ) + + if output_dir is not None: + out = Path(output_dir) + out.mkdir(parents=True, exist_ok=True) + dump_json(out / "canonical_paths.json", canonical_paths.to_dict()) + dump_json(out / "scored_static.json", static_score.to_dict()) + + return canonical_paths, static_score diff --git a/scripts/score_json.py b/scripts/score_json.py new file mode 100644 index 0000000..d2277c6 --- /dev/null +++ b/scripts/score_json.py @@ -0,0 +1,172 @@ +#!/usr/bin/env python3 +"""CLI for scoring task and run JSON artifacts.""" + +from __future__ import annotations + +import argparse +from pathlib import Path + +from scorer.io import dump_json, json_files, load_json +from scorer.scoring import ( + ScorerConfig, + compute_runtime_score, + load_scorer_config, + score_runtime_file, + score_task_file, +) + + +def _load_config(args: argparse.Namespace) -> ScorerConfig: + return load_scorer_config(args.config) + + +def _static_target_dirs(files: list[Path], output_root: Path | None) -> list[Path]: + if output_root is None: + return [path.with_suffix("").with_name(f"{path.stem}_score") for path in files] + if len(files) == 1: + return [output_root] + + target_dirs = [output_root / path.stem for path in files] + duplicates = sorted( + { + str(target) + for target in target_dirs + if target_dirs.count(target) > 1 + } + ) + if duplicates: + raise ValueError( + "Static output directories collide for same-stem inputs: " + f"{', '.join(duplicates)}. Score those inputs separately or use distinct filenames." + ) + return target_dirs + + +def _default_runtime_output(run_path: str | Path) -> Path: + path = Path(run_path) + return path.with_name(f"{path.stem}_score.json") + + +def _static(args: argparse.Namespace) -> int: + config = _load_config(args) + files = json_files(args.inputs) + if not files: + raise FileNotFoundError("No JSON files matched the static scoring inputs") + + output_root = Path(args.output_dir) if args.output_dir else None + for task_path, target_dir in zip(files, _static_target_dirs(files, output_root)): + canonical, static_score = score_task_file( + task_path, + output_dir=target_dir, + config=config, + ) + print( + f"{static_score.task_id}: static_score={static_score.static_score:.3f}, " + f"beatable={static_score.is_beatable}, optimal_steps={canonical.optimal_steps} -> {target_dir}" + ) + return 0 + + +def _runtime(args: argparse.Namespace) -> int: + config = _load_config(args) + output_path = Path(args.output) if args.output else _default_runtime_output(args.run) + if (args.static_score is None) != (args.canonical_paths is None): + raise ValueError("--static-score and --canonical-paths must be provided together") + if ( + args.difficulty_max_static_score is None + and config.difficulty_max_static_score is None + ): + raise ValueError( + "Runtime scoring needs a suite maximum. Pass --difficulty-max-static-score " + "or set difficulty_max_static_score in scorer config." + ) + + if args.static_score and args.canonical_paths: + score = score_runtime_file( + args.run, + static_score_path=args.static_score, + canonical_paths_path=args.canonical_paths, + output_path=output_path, + config=config, + difficulty_max_static_score=args.difficulty_max_static_score, + ) + else: + if not args.task: + raise ValueError( + "Runtime scoring needs --static-score and --canonical-paths, " + "or --task so those artifacts can be computed." + ) + canonical, static_score = score_task_file( + args.task, + output_dir=args.artifact_dir, + config=config, + ) + run = load_json(args.run) + score = compute_runtime_score( + run, + static_score=static_score, + canonical_paths=canonical, + config=config, + difficulty_max_static_score=args.difficulty_max_static_score, + ) + dump_json(output_path, score.to_dict()) + + print(f"{score.task_id}: runtime_score={score.composite:.3f} -> {output_path}") + return 0 + + +def build_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser(description="Score MultiNet task and run JSON artifacts.") + parser.add_argument( + "--config", + default=None, + help="Optional scorer config JSON/YAML path. Defaults to scorer/scorer_config.json.", + ) + + subparsers = parser.add_subparsers(dest="command", required=True) + + static_parser = subparsers.add_parser( + "static", + help="Write canonical_paths.json and scored_static.json for task JSON files.", + ) + static_parser.add_argument("inputs", nargs="+", help="Task JSON files or directories.") + static_parser.add_argument( + "--output-dir", + default=None, + help="Directory for score artifacts. Multiple inputs are written under per-file subdirectories.", + ) + static_parser.set_defaults(func=_static) + + runtime_parser = subparsers.add_parser( + "runtime", + help="Write run_score.json for one run/episode JSON artifact.", + ) + runtime_parser.add_argument("run", help="Run or episode JSON file.") + runtime_parser.add_argument("--task", default=None, help="Task JSON file, used when static artifacts are omitted.") + runtime_parser.add_argument("--static-score", default=None, help="Existing scored_static.json path.") + runtime_parser.add_argument("--canonical-paths", default=None, help="Existing canonical_paths.json path.") + runtime_parser.add_argument( + "--artifact-dir", + default=None, + help="Optional directory to write computed static artifacts when --task is used.", + ) + runtime_parser.add_argument("--output", default=None, help="Output run_score.json path.") + runtime_parser.add_argument( + "--difficulty-max-static-score", + type=float, + default=None, + help="Suite max static score for difficulty normalization. Required unless configured.", + ) + runtime_parser.set_defaults(func=_runtime) + + return parser + + +def main(argv: list[str] | None = None) -> int: + parser = build_parser() + args = parser.parse_args(argv) + return int(args.func(args)) + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/tests/test_interface_token_usage.py b/tests/test_interface_token_usage.py new file mode 100644 index 0000000..a5919f1 --- /dev/null +++ b/tests/test_interface_token_usage.py @@ -0,0 +1,101 @@ +from interface.config import ExperimentConfig +from interface.loader import default_maze_path, load_task +from interface.runner import build_runner +from interface.smoke_tests.plans import v01_empty_room_trajectory +from interface.smoke_tests.smoke_llm import _AgentRecorder +from interface.telemetry import normalize_token_usage + + +class UsageReplayAgent: + def __init__(self): + self._actions = iter(v01_empty_room_trajectory()) + self.last_usage = None + + def __call__(self, messages): + self.last_usage = { + "input_tokens": 8, + "output_tokens": 2, + "total_tokens": 10, + } + return f"FINAL_OUTPUT: {next(self._actions)}" + + +class FirstQueryUsageReplayAgent(UsageReplayAgent): + def __init__(self): + super().__init__() + self._calls = 0 + + def __call__(self, messages): + self._calls += 1 + if self._calls == 1: + self.last_usage = { + "input_tokens": 8, + "output_tokens": 2, + "total_tokens": 10, + } + return f"FINAL_OUTPUT: {next(self._actions)}" + + +def test_normalized_usage_accepts_provider_token_keys(): + assert normalize_token_usage({"input_tokens": 8, "output_tokens": 2}) == { + "input_tokens": 8, + "output_tokens": 2, + "total_tokens": 10, + } + + +def test_agent_recorder_forwards_usage_metadata(): + records = [] + recorder = _AgentRecorder(UsageReplayAgent(), records) + + recorder([]) + + assert recorder.last_usage == { + "input_tokens": 8, + "output_tokens": 2, + "total_tokens": 10, + } + assert records[0]["usage"]["total_tokens"] == 10 + + +def test_runner_persists_agent_usage_in_query_transcript(): + maze_path = default_maze_path("V01_empty_room.json") + backend, spec = load_task(maze_path) + runner = build_runner( + ExperimentConfig( + observation="text_only", + context_window="current", + querying="step_by_step", + chat_history="stateless", + ), + backend, + spec, + ) + + result = runner.run(UsageReplayAgent(), verbose=False, maze_path=maze_path) + query_records = [item for item in result["transcript"] if item.get("kind") == "query"] + + assert result["success"] is True + assert query_records + assert query_records[0]["usage"]["total_tokens"] == 10 + + +def test_runner_clears_stale_usage_between_queries(): + maze_path = default_maze_path("V01_empty_room.json") + backend, spec = load_task(maze_path) + runner = build_runner( + ExperimentConfig( + observation="text_only", + context_window="current", + querying="step_by_step", + chat_history="stateless", + ), + backend, + spec, + ) + + result = runner.run(FirstQueryUsageReplayAgent(), verbose=False, maze_path=maze_path) + query_records = [item for item in result["transcript"] if item.get("kind") == "query"] + + assert query_records[0]["usage"]["total_tokens"] == 10 + assert "usage" not in query_records[1] diff --git a/tests/test_scoring_system.py b/tests/test_scoring_system.py new file mode 100644 index 0000000..b463e18 --- /dev/null +++ b/tests/test_scoring_system.py @@ -0,0 +1,549 @@ +import argparse +import json + +import pytest + +from gridworld.actions import MiniGridActions +from gridworld.baselines import plan_bfs_path, trace_planned_actions +from gridworld.task_spec import TaskSpecification +from gridworld.task_validator import TaskValidator +from scorer.artifacts import CanonicalPathReport, ScoredDifficulty +from scorer.config import ( + DEFAULT_CONFIG_PATH, + DEFAULT_DISTRACTOR_TYPE_WEIGHTS, + DEFAULT_RUNTIME_WEIGHTS, + DIMENSION_NAMES, + load_scorer_config, +) +from scorer.scoring import ( + ScorerConfig, + compute_12d_score, + compute_canonical_paths, + compute_runtime_score, + compute_static_score_artifact, + score_task_file, +) +from scripts.score_json import _default_runtime_output, _runtime, _static_target_dirs + + +def make_spec(**overrides): + data = { + "task_id": "scorer_case", + "seed": 7, + "difficulty_tier": 1, + "maze": { + "dimensions": [5, 5], + "walls": [], + "start": [1, 1], + "goal": [3, 1], + }, + "mechanisms": {}, + "rules": {"observability": "full", "view_size": 7}, + "goal": {"type": "reach_position", "target": [3, 1]}, + "max_steps": 20, + } + data.update(overrides) + return TaskSpecification.from_dict(data) + + +def test_canonical_paths_include_bfs_actions_and_positions(): + spec = make_spec() + + report = compute_canonical_paths(spec) + + assert report.success is True + assert report.actions == ["move_forward", "move_forward"] + assert report.positions == [(1, 1), (2, 1), (3, 1)] + assert report.optimal_steps == 2 + assert report.states_explored > 0 + assert report.greedy is not None + assert report.greedy["success"] is True + + +def test_static_score_uses_configurable_weights(): + spec = make_spec() + default_score = compute_12d_score(spec) + config = ScorerConfig.from_dict( + { + "version": "unit", + "static_dimension_weights": { + "optimal_path_length": 2.0, + "grid_size": 0.0, + }, + } + ) + + weighted = compute_12d_score(spec, config=config) + + assert weighted.weights[0] == 2.0 + assert weighted.weights[8] == 0.0 + assert weighted.composite != default_score.composite + + +def test_static_score_rejects_partial_explicit_weight_vectors(): + spec = make_spec() + + with pytest.raises(ValueError, match="Expected 12 static weights"): + compute_12d_score(spec, weights=[1.0, 2.0]) + with pytest.raises(ValueError, match="Expected 12 static weights"): + compute_12d_score(spec, weights=[]) + + +def test_shipped_config_matches_code_defaults(): + config = load_scorer_config(DEFAULT_CONFIG_PATH) + + assert list(config.static_dimension_weights) == DIMENSION_NAMES + assert config.distractor_type_weights == DEFAULT_DISTRACTOR_TYPE_WEIGHTS + assert config.runtime_weights == DEFAULT_RUNTIME_WEIGHTS + + +def test_explicit_missing_config_path_fails(tmp_path): + with pytest.raises(FileNotFoundError, match="Scorer config not found"): + load_scorer_config(tmp_path / "missing_config.json") + + +def test_score_task_file_writes_stage_two_artifacts(tmp_path): + spec = make_spec() + task_path = tmp_path / "task.json" + spec.to_json(str(task_path)) + + canonical, static_score = score_task_file(task_path, output_dir=tmp_path / "artifacts") + + assert canonical.success is True + assert static_score.is_beatable is True + assert (tmp_path / "artifacts" / "canonical_paths.json").exists() + scored_path = tmp_path / "artifacts" / "scored_static.json" + assert scored_path.exists() + with open(scored_path) as f: + payload = json.load(f) + assert payload["task_id"] == spec.task_id + assert "dimensions_12" in payload + assert "dimensions" not in payload + assert "composite" not in payload + assert payload["validation"]["schema_valid"] is True + assert payload["canonical_agent_features"]["greedy_solvability"] == 1.0 + + +def test_score_task_file_reuses_primary_validator_result(tmp_path, monkeypatch): + spec = make_spec() + task_path = tmp_path / "task.json" + spec.to_json(str(task_path)) + calls = 0 + original_validate = TaskValidator.validate + + def count_validate(self, *args, **kwargs): + nonlocal calls + calls += 1 + return original_validate(self, *args, **kwargs) + + monkeypatch.setattr(TaskValidator, "validate", count_validate) + + score_task_file(task_path) + + assert calls == 1 + + +def test_score_task_file_rejects_invalid_schema_before_planning(tmp_path, monkeypatch): + spec = make_spec( + maze={ + "dimensions": [5, 5], + "walls": [], + "start": [1, 1], + "goal": [9, 9], + }, + goal={"type": "reach_position", "target": [9, 9]}, + ) + task_path = tmp_path / "task.json" + spec.to_json(str(task_path)) + + def fail_if_called(*args, **kwargs): + raise AssertionError("planner must not execute for schema-invalid tasks") + + monkeypatch.setattr("scorer.static.plan_bfs_path", fail_if_called) + monkeypatch.setattr("scorer.static.plan_greedy_path", fail_if_called) + + with pytest.raises(ValueError, match="failed schema validation"): + score_task_file(task_path) + + +def test_static_score_uses_canonical_bfs_metrics(): + spec = make_spec() + bfs_path = plan_bfs_path(spec) + score = compute_12d_score(spec, bfs_path=bfs_path) + + assert score.dimensions[0] == len(bfs_path.action_labels) + assert score.dimensions[1] == bfs_path.states_explored + + +def test_runtime_score_from_episode_json_payload(): + spec = make_spec() + canonical = compute_canonical_paths(spec) + static_score = compute_static_score_artifact(spec) + run = { + "task_id": spec.task_id, + "backend": "minigrid", + "adapter": "unit", + "model_id": "unit-model", + "seed": 7, + "success": True, + "steps_taken": 2, + "terminated": True, + "truncated": False, + "total_tokens": 500, + "trajectory": [ + {"state": {"agent_position": [1, 1]}}, + {"state": {"agent_position": [2, 1]}}, + ], + "final_state": {"agent_position": [3, 1], "step_count": 2}, + } + + config = ScorerConfig.from_dict({"runtime_weights": {"greedy_penalty": 0.0}}) + score = compute_runtime_score( + run, + static_score=static_score, + canonical_paths=canonical, + config=config, + difficulty_max_static_score=static_score.static_score, + ) + + assert score.task_id == spec.task_id + assert score.composite == 1.0 + assert score.signals["step_ratio"] == 1.0 + assert score.signals["cell_overlap_bfs"] == 1.0 + assert score.signals["cell_overlap_greedy"] == 1.0 + assert score.signals["token_efficiency"] == 1.0 + assert "path_choice" not in score.signals + assert "distractor_interactions" not in score.signals + + +def test_runtime_score_prefers_interface_state_after_over_row_col_position_after(): + spec = make_spec() + canonical = compute_canonical_paths(spec) + static_score = compute_static_score_artifact(spec) + run = { + "success": True, + "steps_used": 2, + "total_tokens": 100, + "end_reason": "success", + "task_spec": spec.to_dict(), + "initial_state": {"agent_position": [1, 1]}, + "final_state": {"agent_position": [3, 1], "step_count": 2}, + "transcript": [ + { + "kind": "reset", + "state": {"agent_position": [1, 1]}, + }, + { + "kind": "step", + "position_after": [1, 2], + "state_after": {"agent_position": [2, 1]}, + }, + { + "kind": "step", + "position_after": [1, 3], + "state_after": {"agent_position": [3, 1]}, + }, + ], + } + + config = ScorerConfig.from_dict({"runtime_weights": {"greedy_penalty": 0.0}}) + score = compute_runtime_score( + run, + static_score=static_score, + canonical_paths=canonical, + config=config, + difficulty_max_static_score=static_score.static_score, + ) + + assert score.signals["cell_overlap_bfs"] == 1.0 + + +def test_runtime_score_requires_suite_difficulty_normalizer(): + spec = make_spec() + canonical = compute_canonical_paths(spec) + static_score = compute_static_score_artifact(spec) + + with pytest.raises(ValueError, match="difficulty_max_static_score"): + compute_runtime_score( + {"success": True, "steps": 2, "total_tokens": 100}, + static_score=static_score, + canonical_paths=canonical, + ) + + +def test_runtime_score_rejects_suite_max_smaller_than_task_score(): + spec = make_spec() + canonical = compute_canonical_paths(spec) + static_score = compute_static_score_artifact(spec) + + with pytest.raises(ValueError, match="at least the task static score"): + compute_runtime_score( + {"success": True, "steps": 2, "total_tokens": 100}, + static_score=static_score, + canonical_paths=canonical, + difficulty_max_static_score=static_score.static_score - 1, + ) + + +def test_runtime_score_rejects_unevaluated_greedy_solvability(): + spec = make_spec() + canonical = compute_canonical_paths(spec) + static_score = compute_static_score_artifact(spec).to_dict() + static_score["canonical_agent_features"]["greedy_solvability"] = None + + with pytest.raises(ValueError, match="greedy_solvability"): + compute_runtime_score( + {"success": True, "steps": 2, "total_tokens": 100}, + static_score=static_score, + canonical_paths=canonical, + difficulty_max_static_score=static_score["static_score"], + ) + + +def test_runtime_score_rejects_schema_invalid_static_artifact_clearly(): + spec = make_spec() + canonical = compute_canonical_paths(spec) + static_score = compute_static_score_artifact(spec).to_dict() + static_score["validation"]["schema_valid"] = False + + with pytest.raises(ValueError, match="schema-valid"): + compute_runtime_score( + {"success": True, "steps": 2, "total_tokens": 100}, + static_score=static_score, + canonical_paths=canonical, + difficulty_max_static_score=static_score["static_score"], + ) + + +def test_runtime_token_count_does_not_double_count_nested_step_tokens(): + spec = make_spec() + canonical = compute_canonical_paths(spec) + static_score = compute_static_score_artifact(spec) + score = compute_runtime_score( + { + "success": True, + "steps": 2, + "trajectory": [{"tokens": 100, "info": {"tokens": 100}}], + }, + static_score=static_score, + canonical_paths=canonical, + difficulty_max_static_score=static_score.static_score, + ) + + assert score.signals["token_count"] == 100 + + +def test_runtime_token_count_reads_query_transcript_usage(): + spec = make_spec() + canonical = compute_canonical_paths(spec) + static_score = compute_static_score_artifact(spec) + score = compute_runtime_score( + { + "success": True, + "steps": 2, + "transcript": [ + { + "kind": "query", + "usage": {"input_tokens": 80, "output_tokens": 20}, + } + ], + }, + static_score=static_score, + canonical_paths=canonical, + difficulty_max_static_score=static_score.static_score, + ) + + assert score.signals["token_count"] == 100 + + +def test_runtime_hash_ignores_non_scoring_transcript_context(): + spec = make_spec() + canonical = compute_canonical_paths(spec) + static_score = compute_static_score_artifact(spec) + base_run = { + "success": True, + "steps": 2, + "total_tokens": 100, + "transcript": [ + { + "kind": "query", + "agent_messages": [{"role": "user", "content": "first"}], + } + ], + } + changed_context = { + **base_run, + "transcript": [ + { + "kind": "query", + "agent_messages": [{"role": "user", "content": "second"}], + } + ], + } + + first = compute_runtime_score( + base_run, + static_score=static_score, + canonical_paths=canonical, + difficulty_max_static_score=static_score.static_score, + ) + second = compute_runtime_score( + changed_context, + static_score=static_score, + canonical_paths=canonical, + difficulty_max_static_score=static_score.static_score, + ) + + assert first.inputs_hash == second.inputs_hash + + +@pytest.mark.parametrize("token_count", [None, 0]) +def test_runtime_score_rejects_missing_or_zero_token_telemetry(token_count): + spec = make_spec() + canonical = compute_canonical_paths(spec) + static_score = compute_static_score_artifact(spec) + run = {"success": True, "steps": 2} + if token_count is not None: + run["total_tokens"] = token_count + + with pytest.raises(ValueError, match="token"): + compute_runtime_score( + run, + static_score=static_score, + canonical_paths=canonical, + difficulty_max_static_score=static_score.static_score, + ) + + +def test_runtime_score_rejects_missing_step_telemetry(): + spec = make_spec() + canonical = compute_canonical_paths(spec) + static_score = compute_static_score_artifact(spec) + + with pytest.raises(ValueError, match="step telemetry"): + compute_runtime_score( + {"success": True, "total_tokens": 100}, + static_score=static_score, + canonical_paths=canonical, + difficulty_max_static_score=static_score.static_score, + ) + + +def test_zero_step_plans_do_not_inflate_optimal_steps_with_done(): + spec = make_spec( + maze={ + "dimensions": [5, 5], + "walls": [], + "start": [1, 1], + "goal": [1, 1], + }, + goal={"type": "reach_position", "target": [1, 1]}, + ) + + path = plan_bfs_path(spec) + traced_done = trace_planned_actions(spec, [int(MiniGridActions.DONE)]) + + assert path.success is True + assert path.action_labels == [] + assert traced_done.success is True + assert traced_done.action_labels == [] + + +def test_runtime_zero_step_success_gets_full_step_credit(): + spec = make_spec( + maze={ + "dimensions": [5, 5], + "walls": [], + "start": [1, 1], + "goal": [1, 1], + }, + goal={"type": "reach_position", "target": [1, 1]}, + ) + canonical = compute_canonical_paths(spec) + static_score = compute_static_score_artifact(spec) + score = compute_runtime_score( + { + "success": True, + "steps": 0, + "total_tokens": 100, + "initial_state": {"agent_position": [1, 1]}, + "final_state": {"agent_position": [1, 1], "step_count": 0}, + }, + static_score=static_score, + canonical_paths=canonical, + config=ScorerConfig.from_dict({"runtime_weights": {"greedy_penalty": 0.0}}), + difficulty_max_static_score=static_score.static_score, + ) + + assert score.signals["step_ratio"] == 1.0 + assert score.composite == 1.0 + + +def test_static_cli_target_dirs_reject_same_stem_collisions(tmp_path): + files = [tmp_path / "a" / "task.json", tmp_path / "b" / "task.json"] + + with pytest.raises(ValueError, match="collide"): + _static_target_dirs(files, tmp_path / "scores") + + +def test_runtime_cli_default_output_uses_source_stem(tmp_path): + assert _default_runtime_output(tmp_path / "run.json") == tmp_path / "run_score.json" + assert _default_runtime_output(tmp_path / "episode.json") == tmp_path / "episode_score.json" + + +def test_runtime_cli_rejects_half_specified_artifacts(tmp_path): + args = argparse.Namespace( + config=None, + run=str(tmp_path / "episode.json"), + output=None, + static_score=str(tmp_path / "scored_static.json"), + canonical_paths=None, + task=str(tmp_path / "task.json"), + artifact_dir=None, + difficulty_max_static_score=100.0, + ) + + with pytest.raises(ValueError, match="provided together"): + _runtime(args) + + +def test_runtime_cli_explains_missing_suite_maximum(tmp_path): + args = argparse.Namespace( + config=None, + run=str(tmp_path / "episode.json"), + output=None, + static_score=str(tmp_path / "scored_static.json"), + canonical_paths=str(tmp_path / "canonical_paths.json"), + task=None, + artifact_dir=None, + difficulty_max_static_score=None, + ) + + with pytest.raises(ValueError, match="--difficulty-max-static-score"): + _runtime(args) + + +def test_artifact_serialization_returns_detached_data(): + scored = ScoredDifficulty(dimensions=[1.0], dimension_names=["only"], weights=[2.0]) + scored_payload = scored.to_dict() + scored_payload["dimensions"][0] = 9.0 + scored_payload["weights"][0] = 9.0 + + canonical = CanonicalPathReport( + task_id="task", + success=True, + actions=["move_forward"], + positions=[(1, 1), (2, 1)], + optimal_steps=1, + states_explored=2, + message="ok", + greedy={"actions": ["move_forward"]}, + ) + canonical_payload = canonical.to_dict() + canonical_payload["bfs"]["actions"][0] = "mutated" + canonical_payload["greedy"]["actions"][0] = "mutated" + + assert scored.dimensions == [1.0] + assert scored.weights == [2.0] + assert canonical.actions == ["move_forward"] + assert canonical.greedy == {"actions": ["move_forward"]}