refactor: modularize tester.py and incident storage layer by txmed82 · Pull Request #153 · txmed82/retrace

txmed82 · 2026-05-15T01:44:02Z

Overview

This PR addresses three critical gaps identified in the Retrace codebase: architectural technical debt, minimal E2E coverage, and naive AI matching capabilities.

Key Changes

1. Architectural Technical Debt (Refactoring)

Tester Package: Transformed the monolithic 3,000-line src/retrace/tester.py into a cohesive package (src/retrace/tester/). Logic is now distributed into models.py, specs.py, assertions.py, and harness.py.
Incident Storage Layer: Extracted incident, alert, and deploy marker logic from src/retrace/storage/core.py into a standalone IncidentRepository. Established a BaseRepository pattern and moved utility functions to a shared helpers.py.
Backward Compatibility: Maintained the existing public API through a facade pattern in Storage and clean exports in tester/__init__.py.

2. E2E Testing (The "Killer Demo Loop")

New E2E Test: Added tests/e2e/test_killer_demo_loop.py which exercises the entire pipeline: Ingest (Sentry compat) -> Incident Creation -> QA Bridge -> Auto-repro -> Fix Generation -> PR Creation.
Improved Reliability: This suite ensures that "broken pipes" between discrete subsystems are caught before release.

3. AI Capabilities (Semantic Code Search)

AI-Enhanced Scorer: Introduced src/retrace/matching/ai_scorer.py which uses the LLMClient to rerank code candidates found via keyword search. This pass reasons about the relationship between symptoms (stack traces, console logs) and file paths, significantly improving fix accuracy.
Integrated Pass: Updated scorer.py and auto_fix.py to leverage this reranking during the fix proposal flow.

Verification

Test Results: 67 tests passing, including 61 core storage/tester tests, the new 'Killer Demo' E2E test, and the new AI Scorer unit tests.
API Integrity: Confirmed that all existing command-line interfaces and internal integrations remain fully functional.

Summary by CodeRabbit

New Features
- Expanded incident management: alert routes, deploy markers, source-map uploads, lifecycle events, linking/moving failures, and rollups
- New tester workflow: spec create/load/validate, engine selection, richer assertion evaluation (including model-backed consensus), run artifacts/results
- AI-assisted matching: optional LLM reranking for code candidates
Tests
- Added end-to-end demo test and unit tests for AI reranking
Chores
- Safer JSON/ID/slug/time helpers, input normalization, and CLI LLM integration tweaks

- Convert src/retrace/tester.py into a package with domain-specific modules (models, specs, harness, assertions) - Extract incident logic from src/retrace/storage/core.py into IncidentRepository - Implement facade pattern in Storage class to maintain backward compatibility - Verified zero regressions with 61+ passing tests

coderabbitai · 2026-05-15T01:44:15Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 9742b894-d4e9-4777-969c-21fa574a9e8d

📥 Commits

Reviewing files that changed from the base of the PR and between a126850 and 0d3dd55.

📒 Files selected for processing (4)

src/retrace/commands/qa.py
src/retrace/matching/scorer.py
src/retrace/tester/__init__.py
src/retrace/tester/harness.py

🚧 Files skipped from review as they are similar to previous changes (3)

src/retrace/matching/scorer.py
src/retrace/commands/qa.py
src/retrace/tester/init.py

📝 Walkthrough

Walkthrough

Adds an LLM-based candidate reranker and integrates optional LLM usage into repository scoring and fix-proposal flows; extracts tester models/specs/assertions into dedicated modules and updates harness to import them; implements storage helpers and a full IncidentRepository; adds unit and E2E tests for the new flows.

Changes

AI reranking & CLI LLM plumbing

Layer / File(s)	Summary
AI reranker implementation `src/retrace/matching/ai_scorer.py`	Adds `rerank_candidates_with_ai` which builds a system/user prompt, calls `llm.chat_json(...)` for a `ranked_files` JSON result, normalizes paths, maps back to `CodeCandidate`, fills/backfills to `top_n`, and falls back to keyword ordering on errors.
Scorer: optional LLM integration `src/retrace/matching/scorer.py`	`score_repo_for_finding(..., llm: Optional[LLMClient] = None)` now conditionally uses `rerank_candidates_with_ai` when an `llm` is provided; otherwise preserves previous local scoring return path.
CLI and auto-fix LLM plumbing `src/retrace/auto_fix.py`, `src/retrace/commands/qa.py`	`propose_fix_for_incident` gains an optional `llm` parameter and passes it into `score_repo_for_finding`; QA CLI commands instantiate `LLMClient(cfg.llm)` and pass the client into `propose_fix_for_incident`.
Unit and E2E tests for flow `tests/test_matching_ai_scorer.py`, `tests/e2e/test_killer_demo_loop.py`	Adds unit tests verifying reranker ordering and fallback behavior, plus an end-to-end QA auto test that posts a mocked Sentry envelope, asserts incident bridging, runs `retrace qa auto` with patched helpers, and validates generated prompt and QA incident PR/state fields.

Tester refactor: models, specs, assertions

Layer / File(s)	Summary
Package public surface `src/retrace/tester/__init__.py`	Re-exports tester public surface: models, specs, assertions, and harness orchestration helpers.
Tester dataclasses and utilities `src/retrace/tester/models.py`	Adds `TesterSpec`, `TesterRunResult`, assertion/artifact dataclasses, constants, `now_iso`, `slugify`, and data-dir helper functions.
Spec persistence and validation `src/retrace/tester/specs.py`	Implements safe spec path resolution, save/load/list, defaults/migrations, `validate_spec`, `create_spec`, API-only detection, and `select_execution_engine`.
Assertion evaluation & classification `src/retrace/tester/assertions.py`	Implements native and model-backed consensus evaluation, LLM judge invocation/JSON parsing, evidence capture/redaction, and failure classification helpers.
Harness imports updated `src/retrace/tester/harness.py`	Removes in-file model/spec/assertion logic and imports the new modules while preserving orchestration behavior.

Storage helpers and IncidentRepository

Layer / File(s)	Summary
Storage helper primitives `src/retrace/storage/helpers.py`	Adds ID/public-id/time helpers, resilient JSON parsing/merging, replay preview helpers, slug generation, updated `_string_values`, and updated status-normalization behaviors.
Repository base `src/retrace/storage/repositories/base.py`	Adds `BaseRepository` holding a `Storage` ref and exposing `_conn()` for sqlite3 connections.
Incident core APIs `src/retrace/storage/repositories/incidents.py`	Adds `IncidentRepository` upsert/link/move, get/find/list, and failure/evidence listing.
Lifecycle and repair `src/retrace/storage/repositories/incidents.py`	Adds app-error lifecycle transitions, lifecycle-event append/listing, repair-task setting, and row→model mapping helpers.
Alerting, deploys, source-maps `src/retrace/storage/repositories/incidents.py`	Implements alert rules/routes and dispatch recording/dedup, deploy marker persistence/queries, and Source Map v3 validation and listing.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

txmed82/retrace#117: Modifies propose_fix_for_incident orchestration and related fix application paths; related to LLM/plumbing changes here.
txmed82/retrace#113: Also updates propose_fix_for_incident and worktree-based fix application; overlaps the fix-proposal call sites.
txmed82/retrace#6: Touches storage helper status-normalization logic in src/retrace/storage/helpers.py and replay-core components used here.

Poem

"I hopped through code to rank each file,
Prompted an LLM to sort with style,
The tester grew tidy, split into parts,
Storage learned incidents and careful arts,
A rabbit cheers — small hops, big smiles!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the main refactoring efforts: modularizing the monolithic tester module and reorganizing the incident storage layer into separate, focused components.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch refactor/modularize-tester-and-incidents

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/retrace/tester/harness.py (2)
1630-1630: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

DEFAULT_APP_URL is undefined — will raise NameError at runtime.

Line 1630 uses DEFAULT_APP_URL, but this constant is not imported from .models (lines 31–43), even though it is defined there. The import block includes DEFAULT_HARNESS_COMMAND but not DEFAULT_APP_URL, causing a NameError when the code path at line 1630 executes.
🐛 Proposed fix: add missing import
 from .models import (
     DEFAULT_HARNESS_COMMAND,
+    DEFAULT_APP_URL,
     FAILURE_CLASSIFICATIONS,
     SUITE_PROPOSAL_SCHEMA_VERSION,
     EngineSelection,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/retrace/tester/harness.py` at line 1630, The code references
DEFAULT_APP_URL when computing app_url in harness.py but that constant isn't
imported, causing a NameError; update the import block that currently brings in
DEFAULT_HARNESS_COMMAND from .models to also import DEFAULT_APP_URL (i.e., add
DEFAULT_APP_URL to the tuple of symbols imported from .models) so
DEFAULT_APP_URL is defined when used in the expression that sets app_url.
1830-1944: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Remove the local function definitions that shadow imported functions from .assertions.

The functions _classify_failure, _assertion_text_for_classification, _failed_selector_assertion, and _flake_reason_from_classification are imported from .assertions (lines 44–55) but also defined locally (lines 1830–1944), causing the imports to be shadowed.

The implementations are substantially different, not just duplicates:

_classify_failure: Local version adds "test_bug" classification and more auth/selector checks; assertions.py is simpler

_assertion_text_for_classification: Local version extracts more fields (assertion_type, step, assertion) with JSON serialization; assertions.py only extracts message/expected/actual

_failed_selector_assertion: Local version checks assertion_type; assertions.py checks for keywords in message string

_flake_reason_from_classification: Local version returns the classification itself; assertions.py returns human-readable messages

Determine which version is correct for the codebase, then remove the other to eliminate duplication and confusion.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/retrace/tester/harness.py` around lines 1830 - 1944, The local
definitions of _classify_failure, _assertion_text_for_classification,
_failed_selector_assertion, and _flake_reason_from_classification are shadowing
the versions imported from .assertions; remove these four local function
definitions so the module uses the canonical implementations from .assertions
(leave the import lines intact), and then run the test suite and adjust any call
sites if necessary to match the .assertions signatures/behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/retrace/storage/helpers.py`:
- Around line 116-139: This block defines _string_values,
_normalize_github_review_run_status, and _normalize_app_error_incident_status
but the same functions are redefined later, making these implementations dead;
remove the stale duplicate definitions appearing later in the module so only
these correct implementations remain (ensure there is exactly one definition for
each of _string_values, _normalize_github_review_run_status, and
_normalize_app_error_incident_status), update any imports/usages if necessary,
and run tests to confirm comma-separated trace IDs split, unknown GitHub
statuses default to "queued", and "new" app-error status maps to "open".

In `@src/retrace/storage/repositories/incidents.py`:
- Around line 1409-1417: The names field in the source_map validator currently
only checks that names is a list but not that its items are strings; update the
validation in the block handling source_map (the variables source_map and names)
to iterate the list and ensure every element is an instance of str, and if any
element is not a string raise a ValueError like "source_map names must be a list
of strings when provided" so malformed entries (e.g., [1,2]) are rejected before
persisting.
- Around line 1217-1228: The get_deploy_marker method currently falls back to
matching sha (OR sha = ?), which is unsafe because sha is only unique within a
workspace; remove the sha branch from the SQL and the third bind parameter so
the query only checks id OR public_id, and rely on the existing
get_deploy_marker_by_sha() for scoped SHA lookups; update the tuple of
parameters passed to conn.execute and any related docstring/comments in the
get_deploy_marker function to reflect the change.

In `@src/retrace/tester/harness.py`:
- Around line 217-228: There is a duplicate import block that repeats the same
symbols (_assertion_result, _assertion_text_for_classification,
_classify_failure, _evaluate_consensus_assertion,
_evaluate_model_backed_consensus_assertion, _evaluate_native_assertion,
_failed_selector_assertion, _flake_reason_from_classification,
_redacted_response_headers, _response_assertion_evidence); remove the redundant
block so each symbol is only imported once. Locate the repeated import statement
and delete the second occurrence, leaving the original import (the one near the
top of the module) intact to avoid accidental breakage.

In `@src/retrace/tester/specs.py`:
- Around line 24-26: The import of _should_use_playwright inside a TYPE_CHECKING
block is wrong because TYPE_CHECKING is False at runtime; remove the
TYPE_CHECKING guard and any static-only import so that the runtime import
pattern used in _needs_playwright_runtime remains the single source of truth.
Specifically, delete or move the "from .harness import _should_use_playwright"
under the TYPE_CHECKING block in this file (src/retrace/tester/specs.py) and
ensure _needs_playwright_runtime continues to perform the runtime import of
_should_use_playwright (so no NameError occurs when _needs_playwright_runtime
calls _should_use_playwright).

---

Outside diff comments:
In `@src/retrace/tester/harness.py`:
- Line 1630: The code references DEFAULT_APP_URL when computing app_url in
harness.py but that constant isn't imported, causing a NameError; update the
import block that currently brings in DEFAULT_HARNESS_COMMAND from .models to
also import DEFAULT_APP_URL (i.e., add DEFAULT_APP_URL to the tuple of symbols
imported from .models) so DEFAULT_APP_URL is defined when used in the expression
that sets app_url.
- Around line 1830-1944: The local definitions of _classify_failure,
_assertion_text_for_classification, _failed_selector_assertion, and
_flake_reason_from_classification are shadowing the versions imported from
.assertions; remove these four local function definitions so the module uses the
canonical implementations from .assertions (leave the import lines intact), and
then run the test suite and adjust any call sites if necessary to match the
.assertions signatures/behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 2540eeb2-a1b6-42c6-a0ab-0c665cc27839

📥 Commits

Reviewing files that changed from the base of the PR and between c0e873c and 4289a49.

📒 Files selected for processing (10)

src/retrace/storage/core.py
src/retrace/storage/helpers.py
src/retrace/storage/repositories/__init__.py
src/retrace/storage/repositories/base.py
src/retrace/storage/repositories/incidents.py
src/retrace/tester/__init__.py
src/retrace/tester/assertions.py
src/retrace/tester/harness.py
src/retrace/tester/models.py
src/retrace/tester/specs.py

coderabbitai · 2026-05-15T01:48:54Z

+def _string_values(raw: object) -> list[str]:
+    if not raw:
+        return []
+    if isinstance(raw, list):
+        return [str(v) for v in raw if v]
+    if isinstance(raw, str):
+        return [v.strip() for v in raw.split(",") if v.strip()]
+    return [str(raw)]
+
+def _normalize_github_review_run_status(value: object) -> str:
+    GITHUB_REVIEW_RUN_STATUSES = ("queued", "running", "succeeded", "failed", "canceled")
+    s = str(value or "").strip().lower()
+    if s in GITHUB_REVIEW_RUN_STATUSES:
+        return s
+    return "queued"
+
+def _normalize_app_error_incident_status(value: object) -> str:
+    APP_ERROR_INCIDENT_STATUSES = ("open", "triaged", "investigating", "resolved", "ignored")
+    s = str(value or "").strip().lower()
+    if s in APP_ERROR_INCIDENT_STATUSES:
+        return s
+    if s == "new":
+        return "open"
+    raise ValueError(f"invalid app-error incident status: {value!r}")


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Remove the stale redefinitions below this block.

_string_values, _normalize_github_review_run_status, and _normalize_app_error_incident_status are defined again later in this module, so Python keeps the old implementations instead of these ones. That makes the new behavior dead code: comma-separated trace IDs still won't split, unknown GitHub review statuses will still raise, and "new" incident statuses will still fail normalization.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/retrace/storage/helpers.py` around lines 116 - 139, This block defines _string_values, _normalize_github_review_run_status, and _normalize_app_error_incident_status but the same functions are redefined later, making these implementations dead; remove the stale duplicate definitions appearing later in the module so only these correct implementations remain (ensure there is exactly one definition for each of _string_values, _normalize_github_review_run_status, and _normalize_app_error_incident_status), update any imports/usages if necessary, and run tests to confirm comma-separated trace IDs split, unknown GitHub statuses default to "queued", and "new" app-error status maps to "open".

coderabbitai · 2026-05-15T01:48:54Z

+    def get_deploy_marker(self, deploy_id: str) -> Optional[DeployMarkerRow]:
+        with self._conn() as conn:
+            row = conn.execute(
+                """
+                SELECT *
+                FROM deploy_markers
+                WHERE id = ? OR public_id = ? OR sha = ?
+                ORDER BY deployed_at_ms DESC
+                LIMIT 1
+                """,
+                (deploy_id, deploy_id, deploy_id),
+            ).fetchone()


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't fall back to raw SHA in this unscoped getter.

sha is only unique within a workspace. The OR sha = ? branch can return another project's/environment's deploy marker for the same commit, even though get_deploy_marker_by_sha() already provides the scoped lookup.

Suggested fix

def get_deploy_marker(self, deploy_id: str) -> Optional[DeployMarkerRow]: with self._conn() as conn: row = conn.execute( """ SELECT * FROM deploy_markers - WHERE id = ? OR public_id = ? OR sha = ? - ORDER BY deployed_at_ms DESC - LIMIT 1 + WHERE id = ? OR public_id = ? """, - (deploy_id, deploy_id, deploy_id), + (deploy_id, deploy_id), ).fetchone() return self._deploy_marker_from_row(row) if row is not None else None

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/retrace/storage/repositories/incidents.py` around lines 1217 - 1228, The get_deploy_marker method currently falls back to matching sha (OR sha = ?), which is unsafe because sha is only unique within a workspace; remove the sha branch from the SQL and the third bind parameter so the query only checks id OR public_id, and rely on the existing get_deploy_marker_by_sha() for scoped SHA lookups; update the tuple of parameters passed to conn.execute and any related docstring/comments in the get_deploy_marker function to reflect the change.

coderabbitai · 2026-05-15T01:48:54Z

+        names = source_map.get("names")
+        if names is not None and not isinstance(names, list):
+            raise ValueError("source_map names must be a list when provided")
+        source_root = source_map.get("sourceRoot")
+        if source_root is not None and not isinstance(source_root, str):
+            raise ValueError("source_map sourceRoot must be a string when provided")
+        file_value = source_map.get("file")
+        if file_value is not None and not isinstance(file_value, str):
+            raise ValueError("source_map file must be a string when provided")


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject non-string entries in names.

This validator currently accepts names=[1, 2], so malformed source maps can still be persisted even though this method is the write-time schema gate.

Suggested fix

names = source_map.get("names") - if names is not None and not isinstance(names, list): - raise ValueError("source_map names must be a list when provided") + if names is not None: + if not isinstance(names, list): + raise ValueError("source_map names must be a list when provided") + if not all(isinstance(item, str) for item in names): + raise ValueError("source_map names must contain only strings")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/retrace/storage/repositories/incidents.py` around lines 1409 - 1417, The names field in the source_map validator currently only checks that names is a list but not that its items are strings; update the validation in the block handling source_map (the variables source_map and names) to iterate the list and ensure every element is an instance of str, and if any element is not a string raise a ValueError like "source_map names must be a list of strings when provided" so malformed entries (e.g., [1,2]) are rejected before persisting.

- Implement tests/e2e/test_killer_demo_loop.py for full pipeline validation - Introduce AI-enhanced candidate reranking in src/retrace/matching/ai_scorer.py - Update scorer.py and auto_fix.py to use LLM for more accurate culprit identification - Add unit tests for AI scorer

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/retrace/commands/qa.py (1)

176-190: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Keep the loaded config instead of discarding it.

Line 176 binds _, but Line 189 reads cfg.llm, so retrace qa fix will fail before fix generation starts.

🐛 Proposed fix

-    store, _ = _open_store(config_path)
+    store, cfg = _open_store(config_path)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/retrace/commands/qa.py` around lines 176 - 190, The call to _open_store
currently discards the returned cfg by binding it to `_`, but later code uses
cfg.llm (causing a crash); change the assignment so you capture both return
values (e.g., store, cfg = _open_store(config_path)) or otherwise keep the
config returned by _open_store; ensure the rest of the function (references to
cfg, the LLMClient context manager, and propose_fix_for_incident) use that
preserved cfg instead of the discarded value.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/retrace/matching/ai_scorer.py`:
- Around line 1-10: The module-level docstring must appear before any imports
and unused imports should be removed; move the triple-quoted string
"""AI-enhanced code candidate reranking.""" to the very top of
src/retrace/matching/ai_scorer.py, delete the unused stdlib imports json and
typing.Any, and ensure imports are ordered (stdlib -> third-party -> local) so
you only keep logging and the local imports retrace.llm.client.LLMClient and
retrace.matching.scorer.CodeCandidate.
- Around line 68-73: The loop that maps LLM-ranked paths to final_candidates can
append duplicate entries when ranked_paths contains repeats; update the code in
the block that iterates over ranked_paths to maintain a seen set (e.g.,
seen_paths) of cleaned paths (clean_path) and only append
path_to_candidate[clean_path] if clean_path is not in seen_paths, adding
clean_path to seen_paths when appended; this ensures deduplication of LLM-ranked
paths before populating final_candidates and prevents dropping unique candidates
from the top-N.

In `@src/retrace/matching/scorer.py`:
- Line 403: The function signature uses the types Optional and LLMClient (e.g.,
the parameter "llm: Optional[LLMClient] = None") but neither type is imported,
causing undefined-name errors; add the missing imports at module level—import
Optional from typing and import or reference the LLMClient type from its
defining module (or the correct package) and update imports so Optional and
LLMClient are available for the scorer function signature (e.g., ensure "from
typing import Optional" and "from <llm_module> import LLMClient" are present).

In `@tests/e2e/test_killer_demo_loop.py`:
- Around line 15-21: The file contains unused imports "os" and "pytest" (see the
top import block in tests/e2e/test_killer_demo_loop.py); remove those two import
statements so the module no longer triggers F401 from Ruff—keep the other
imports (uuid, HTTPConnection, Path, patch, MagicMock) intact.

In `@tests/test_matching_ai_scorer.py`:
- Around line 3-7: Remove the unused imports in
tests/test_matching_ai_scorer.py: delete the top-level imports of json and
pytest and keep only the needed unittest.mock import (MagicMock); ensure the
file only imports symbols that are referenced to satisfy Ruff F401.

---

Outside diff comments:
In `@src/retrace/commands/qa.py`:
- Around line 176-190: The call to _open_store currently discards the returned
cfg by binding it to `_`, but later code uses cfg.llm (causing a crash); change
the assignment so you capture both return values (e.g., store, cfg =
_open_store(config_path)) or otherwise keep the config returned by _open_store;
ensure the rest of the function (references to cfg, the LLMClient context
manager, and propose_fix_for_incident) use that preserved cfg instead of the
discarded value.

🪄 Autofix (Beta)

✅ Autofix completed

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: e3456ce1-73b8-44dd-81b4-f04a61ab3d2a

📥 Commits

Reviewing files that changed from the base of the PR and between 4289a49 and 54ae193.

📒 Files selected for processing (6)

src/retrace/auto_fix.py
src/retrace/commands/qa.py
src/retrace/matching/ai_scorer.py
src/retrace/matching/scorer.py
tests/e2e/test_killer_demo_loop.py
tests/test_matching_ai_scorer.py

coderabbitai · 2026-05-15T12:32:52Z

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Fixes Applied Successfully

Fixed 12 file(s) based on 5 unresolved review comments.

Files modified:

src/retrace/commands/qa.py
src/retrace/matching/ai_scorer.py
src/retrace/matching/scorer.py
src/retrace/storage/core.py
src/retrace/storage/helpers.py
src/retrace/storage/repositories/base.py
src/retrace/storage/repositories/incidents.py
src/retrace/tester/assertions.py
src/retrace/tester/harness.py
src/retrace/tester/specs.py
tests/e2e/test_killer_demo_loop.py
tests/test_matching_ai_scorer.py

Commit: a1268505f1114c896152e49b9530fe1b2278d4bc

The changes have been pushed to the refactor/modularize-tester-and-incidents branch.

Time taken: 3m 31s

Fixed 12 file(s) based on 5 unresolved review comments. Co-authored-by: CodeRabbit <noreply@coderabbit.ai>

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/retrace/tester/harness.py (2)

1612-1612: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Undefined name DEFAULT_APP_URL causes runtime NameError.

DEFAULT_APP_URL is used but never imported or defined. The constant should be imported from .models alongside DEFAULT_HARNESS_COMMAND, or defined locally if it doesn't exist in the models module.

🐛 Proposed fix: add import from .models

 from .models import (
     DEFAULT_HARNESS_COMMAND,
+    DEFAULT_APP_URL,
     SUITE_PROPOSAL_SCHEMA_VERSION,
     TesterArtifact,

If DEFAULT_APP_URL doesn't exist in .models, define it locally:

+DEFAULT_APP_URL = "http://localhost:3000"
+
 logger = logging.getLogger(__name__)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/retrace/tester/harness.py` at line 1612, The NameError arises because
DEFAULT_APP_URL is referenced in the expression that sets app_url (app_url =
(app_url_override or spec.app_url or DEFAULT_APP_URL).strip()) but is not
defined or imported; fix by importing DEFAULT_APP_URL from .models alongside
DEFAULT_HARNESS_COMMAND (or, if .models doesn't export DEFAULT_APP_URL, add a
local constant DEFAULT_APP_URL with the expected default value) and ensure the
import/definition is placed near other constants so app_url has a valid
fallback.

1812-1926: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Remove redefined functions that are already imported from .assertions.

These four functions are imported from .assertions at lines 41-49 but are also defined locally here. This causes Ruff F811 redefinition errors and is a refactoring artifact that should be cleaned up.

Remove the following local definitions:

_classify_failure (lines 1812-1882)
_assertion_text_for_classification (lines 1885-1903)
_failed_selector_assertion (lines 1906-1915)
_flake_reason_from_classification (lines 1918-1926)

🗑️ Proposed fix: remove redundant local definitions

-def _classify_failure(
-    *,
-    harness_log_path: Path,
-    error: str,
-    assertion_results: list[dict[str, Any]],
-    exit_code: int,
-) -> str:
-    text = ""
-    try:
-        if harness_log_path.exists():
-            text = harness_log_path.read_text(encoding="utf-8", errors="ignore").lower()
-    except Exception:
-        text = ""
-    failed_assertions = [
-        item for item in assertion_results if not bool(item.get("ok", False))
-    ]
-    merged = "\n".join(
-        [
-            text,
-            str(error or ""),
-            _assertion_text_for_classification(failed_assertions),
-        ]
-    ).lower()
-    if any(
-        k in merged
-        for k in [
-            "app did not become reachable",
-            "connection refused",
-            "econnrefused",
-            "net::err_connection_refused",
-            "failed to connect",
-            "could not connect",
-            "server unavailable",
-        ]
-    ):
-        return "environment_failure"
-    if any(k in merged for k in ["timeout", "timed out", "deadline exceeded"]):
-        return "timeout"
-    if any(
-        k in merged
-        for k in [
-            "needs selector",
-            "unsupported_in_playwright",
-            "unsupported native assertion type",
-            "unsupported native step action",
-            "invalid_regex",
-            "unknown action",
-            "malformed",
-        ]
-    ):
-        return "test_bug"
-    if any(k in merged for k in ["401", "403", "unauthorized", "forbidden"]):
-        return "auth_failure"
-    if any(k in merged for k in ["authentication failed", "auth failure", "login failed"]):
-        return "auth_failure"
-    if _failed_selector_assertion(failed_assertions) or any(
-        k in merged
-        for k in [
-            "element not found",
-            "stale element",
-            "waiting for selector",
-            "waiting for locator",
-            "strict mode violation",
-        ]
-    ):
-        return "selector_drift"
-    if failed_assertions:
-        return "app_bug"
-    if int(exit_code) != 0 or error:
-        return "unknown"
-    return "unknown"
-
-
-def _assertion_text_for_classification(items: list[dict[str, Any]]) -> str:
-    chunks: list[str] = []
-    for item in items:
-        for key in (
-            "assertion_type",
-            "message",
-            "actual",
-            "expected",
-            "step",
-            "assertion",
-        ):
-            value = item.get(key)
-            if value is None:
-                continue
-            if isinstance(value, (dict, list)):
-                chunks.append(json.dumps(value, sort_keys=True, default=str))
-            else:
-                chunks.append(str(value))
-    return "\n".join(chunks)
-
-
-def _failed_selector_assertion(items: list[dict[str, Any]]) -> bool:
-    for item in items:
-        assertion_type = str(item.get("assertion_type") or "").lower()
-        if assertion_type in {
-            "selector_visible",
-            "element_visible",
-            "visible",
-        }:
-            return True
-    return False
-
-
-def _flake_reason_from_classification(failure_classification: str) -> str:
-    if failure_classification in {
-        "auth_failure",
-        "environment_failure",
-        "selector_drift",
-        "timeout",
-    }:
-        return failure_classification
-    return ""
-
-
 def load_run_summaries(runs_dir: Path, *, limit: int = 25) -> list[dict[str, Any]]:

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/retrace/tester/harness.py` around lines 1812 - 1926, The four functions
_classify_failure, _assertion_text_for_classification,
_failed_selector_assertion, and _flake_reason_from_classification are duplicates
of the versions imported from .assertions; delete their local definitions in
this file so the module uses the imported implementations (keep any references
to these names elsewhere unchanged), and run tests/linter to confirm the Ruff
F811 redefinition errors are resolved.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/retrace/commands/qa.py`:
- Around line 187-200: propose_fix_for_incident currently always receives an
LLMClient because LLMClient(cfg.llm) is unconditionally constructed in the CLI
handlers; to preserve the optional-LLM contract, only instantiate LLMClient when
LLM is configured and otherwise pass llm=None into propose_fix_for_incident.
Update both call sites that create LLMClient (the block using "with
LLMClient(cfg.llm) as llm" around propose_fix_for_incident and the similar block
at the other handler) to check cfg.llm (or an explicit enabled flag) before
creating the client, and call propose_fix_for_incident(..., llm=llm) or
propose_fix_for_incident(..., llm=None) accordingly so the function’s llm=None
path is reachable.

In `@src/retrace/matching/scorer.py`:
- Around line 446-456: The code currently passes the entire positive-scoring
list `scored` into `rerank_candidates_with_ai(...)`; instead, take a bounded
shortlist (e.g., select the top K entries by existing score) before calling the
LLM to avoid huge prompts—compute K (e.g., `ai_shortlist_size` or reuse
`max(top_n, N)`) and create `shortlist = scored[:K]` (or an explicit sort+slice
if `scored` isn’t pre-ordered), then call `rerank_candidates_with_ai(llm=llm,
candidates=shortlist, title=title, category=category,
evidence_text=evidence_text, top_n=top_n)`.

---

Outside diff comments:
In `@src/retrace/tester/harness.py`:
- Line 1612: The NameError arises because DEFAULT_APP_URL is referenced in the
expression that sets app_url (app_url = (app_url_override or spec.app_url or
DEFAULT_APP_URL).strip()) but is not defined or imported; fix by importing
DEFAULT_APP_URL from .models alongside DEFAULT_HARNESS_COMMAND (or, if .models
doesn't export DEFAULT_APP_URL, add a local constant DEFAULT_APP_URL with the
expected default value) and ensure the import/definition is placed near other
constants so app_url has a valid fallback.
- Around line 1812-1926: The four functions _classify_failure,
_assertion_text_for_classification, _failed_selector_assertion, and
_flake_reason_from_classification are duplicates of the versions imported from
.assertions; delete their local definitions in this file so the module uses the
imported implementations (keep any references to these names elsewhere
unchanged), and run tests/linter to confirm the Ruff F811 redefinition errors
are resolved.

🪄 Autofix (Beta)

✅ Autofix completed

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 14467e64-e00f-4445-8182-4c03d064aa63

📥 Commits

Reviewing files that changed from the base of the PR and between 54ae193 and a126850.

📒 Files selected for processing (12)

src/retrace/commands/qa.py
src/retrace/matching/ai_scorer.py
src/retrace/matching/scorer.py
src/retrace/storage/core.py
src/retrace/storage/helpers.py
src/retrace/storage/repositories/base.py
src/retrace/storage/repositories/incidents.py
src/retrace/tester/assertions.py
src/retrace/tester/harness.py
src/retrace/tester/specs.py
tests/e2e/test_killer_demo_loop.py
tests/test_matching_ai_scorer.py

🚧 Files skipped from review as they are similar to previous changes (7)

src/retrace/storage/repositories/base.py
tests/test_matching_ai_scorer.py
src/retrace/matching/ai_scorer.py
tests/e2e/test_killer_demo_loop.py
src/retrace/storage/helpers.py
src/retrace/tester/specs.py
src/retrace/tester/assertions.py

coderabbitai · 2026-05-15T13:06:08Z

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Fixes Applied Successfully

Fixed 4 file(s) based on 2 unresolved review comments.

Files modified:

src/retrace/commands/qa.py
src/retrace/matching/scorer.py
src/retrace/tester/__init__.py
src/retrace/tester/harness.py

Commit: 0d3dd55cdbafeb62362fde26577be3b5e2ee2235

The changes have been pushed to the refactor/modularize-tester-and-incidents branch.

Time taken: 3m 34s

Fixed 4 file(s) based on 2 unresolved review comments. Co-authored-by: CodeRabbit <noreply@coderabbit.ai>

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

Comment thread src/retrace/matching/ai_scorer.py Outdated

Comment thread src/retrace/matching/ai_scorer.py

Comment thread src/retrace/matching/scorer.py

Comment thread tests/e2e/test_killer_demo_loop.py Outdated

Comment thread tests/test_matching_ai_scorer.py Outdated

fix: apply CodeRabbit auto-fixes

a126850

Fixed 12 file(s) based on 5 unresolved review comments. Co-authored-by: CodeRabbit <noreply@coderabbit.ai>

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

Comment thread src/retrace/commands/qa.py Outdated

Comment thread src/retrace/matching/scorer.py

fix: apply CodeRabbit auto-fixes

0d3dd55

Fixed 4 file(s) based on 2 unresolved review comments. Co-authored-by: CodeRabbit <noreply@coderabbit.ai>

txmed82 merged commit 00c0422 into master May 15, 2026
9 of 11 checks passed

coderabbitai Bot mentioned this pull request May 19, 2026

fix: restore phase 0 qa trust #154

Merged

4 tasks

Conversation

txmed82 commented May 15, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Changes

1. Architectural Technical Debt (Refactoring)

2. E2E Testing (The "Killer Demo Loop")

3. AI Capabilities (Semantic Code Search)

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes Applied Successfully

Uh oh!

coderabbitai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes Applied Successfully

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

txmed82 commented May 15, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 15, 2026 •

edited

Loading

coderabbitai Bot left a comment •

edited

Loading

coderabbitai Bot commented May 15, 2026 •

edited

Loading

coderabbitai Bot left a comment •

edited

Loading

coderabbitai Bot commented May 15, 2026 •

edited

Loading