diff --git a/backend/evaluate/EVALUATION.md b/backend/evaluate/EVALUATION.md index fa2f7f28..a08a6dc2 100644 --- a/backend/evaluate/EVALUATION.md +++ b/backend/evaluate/EVALUATION.md @@ -355,6 +355,131 @@ The judge compares what the chatbot actually said against what it should have sa --- +## Understanding and diagnosing score variance + +### Two noise sources + +Every score you see in an experiment reflects two independent sources of randomness: + +- **Agent variance** — the chatbot gives a slightly different response each time to the same question (LLM temperature > 0). Running the same example ten times will produce ten different outputs, and some will score higher or lower than others. +- **Evaluator variance** — the LLM judge assigns a slightly different score each time it reads the same output. Even a fixed response, shown to the judge five times, may get 1.0 twice and 0.5 three times. + +Total observed variance decomposes as: + +``` +σ²_total = σ²_agent + σ²_evaluator +``` + +This matters because the two sources call for different fixes: agent variance requires more repetitions per example; evaluator variance requires a better judge or a tighter rubric. + +### Measuring evaluator variance + +`measure_evaluator_variance.py` isolates evaluator variance by re-scoring the same fixed agent outputs multiple times. No new agent calls are made, so this is cheap (~10–20× cheaper per sample than a full evaluation run). + +```bash +# Re-score all runs from an experiment 5 times each (default) +uv run python -m evaluate.measure_evaluator_variance \ + --experiment \ + --evaluator "legal correctness" + +# Use more repeats for a tighter estimate; limit to 3 runs per scenario to keep it fast +uv run python -m evaluate.measure_evaluator_variance \ + --experiment \ + --evaluator "legal correctness" \ + -k 7 \ + --runs-per-scenario 3 +``` + +Available evaluator names match the `feedback_key` values: `"legal correctness"` and `"appropriate tone"`. Omit `--evaluator` to run all of them. + +### Example: interpreting the output + +Running the script against a 10-repetition experiment (50 total runs across 5 scenarios) with `-k 5` produced: + +``` +=== Per-Scenario Consistency === + +Evaluator: legal correctness + Scenario mean σ 0.0 0.5 1.0 + -------- ------ ------ ------------------- + S0 0.77 0.25 0 23 27 + S1 0.87 0.24 1 10 35 + S2 0.48 0.26 8 36 6 + S3 0.93 0.17 0 7 43 + S4 0.66 0.23 0 34 16 + +=== Evaluator Variance Summary === +(σ is computed per individual run across k re-evaluations of fixed output) + legal correctness: + mean σ = 0.108 (max = 0.400) +``` + +The consistency table aggregates all scores across the k re-evaluations. The evaluator summary gives the key number: **σ_evaluator = 0.108**. + +To decompose variance per scenario, use σ²_agent = σ²_total − σ²_evaluator: + +| Scenario | σ_total | σ_evaluator | σ_agent | Evaluator share | +|---|---|---|---|---| +| S0 | 0.25 | 0.108 | 0.22 | 19% | +| S1 | 0.24 | 0.108 | 0.21 | 20% | +| S2 | 0.26 | 0.108 | 0.24 | 17% | +| S3 | 0.17 | 0.108 | 0.13 | 40% | +| S4 | 0.23 | 0.108 | 0.20 | 22% | + +**What this tells you:** + +- The agent is the dominant noise source (~75–80% of variance for most scenarios). S3 is the exception — the chatbot is highly consistent there, so the judge itself accounts for 40% of variance. +- A `max = 0.400` evaluator σ is a red flag: at least one fixed output received 0.0 and 1.0 on different judge calls. This points to rubric ambiguity on borderline responses. In this data S2 (notice-giving scenario) was the likely culprit — the lowest mean (0.48) and most 0.5 scores indicate the judge is uncertain. + +### Decision rules + +| Observation | Diagnosis | Fix | +|---|---|---| +| σ_evaluator << σ_total (evaluator share < 15%) | Variance is mostly agent-side | Increase `--num-repetitions` in evaluation runs | +| σ_evaluator ≈ σ_total (evaluator share > 40%) | Judge stochasticity dominates | Use a stronger judge model; tighten the rubric for borderline cases | +| max σ >> mean σ | Some outputs are genuinely borderline | Review judge rationale; add rubric guidance for that failure mode | +| One scenario has much lower σ than the rest | That scenario is well-defined | Good: use it to calibrate expected score ranges | + +### Drilling into a specific scenario + +Once you identify a noisy scenario, use `--scenario` to focus on it: + +```bash +uv run python -m evaluate.measure_evaluator_variance \ + --experiment \ + --evaluator "legal correctness" \ + --scenario 2 +``` + +Multiple scenario IDs are accepted: `--scenario 2 4`. + +### Improving the rubric for borderline cases + +If max σ is high (> 0.3) for a scenario, the most likely cause is rubric ambiguity on borderline responses — the judge is uncertain what the 0.5 vs 1.0 threshold means for that type of question. + +Steps: +1. Run `--scenario -k 7` to collect enough data to see the pattern. +2. Look at the score distribution: lots of 0.5 scores with some 0.0 and 1.0 suggests the judge is genuinely unsure. Many swings between 0.0 and 1.0 on the same output suggests the rubric doesn't distinguish those cases. +3. Edit `evaluate/evaluators/legal_correctness.md` — add a concrete example of the borderline case and specify which tier it should fall in. +4. Re-run the script against the same experiment to confirm σ drops. + +The rubric lives in a plain markdown file — no code change needed. See [Editing evaluator rubrics](#editing-evaluator-rubrics) for the file locations. + +### Sample size guidance + +With σ_agent ≈ 0.20 (typical for this dataset), the 95% confidence interval on a scenario mean is approximately: + +| Repetitions per scenario | 95% CI width | +|---|---| +| 10 | ± 0.12 | +| 25 | ± 0.08 | +| 35 | ± 0.07 | +| 50 | ± 0.06 | + +**To detect a 0.10-point improvement with 80% power** at this σ_agent, you need ~32 repetitions per scenario. Ten repetitions (the default) is sufficient for spotting large regressions but too noisy to confirm incremental gains. + +--- + ## Viewing and comparing results Open https://smith.langchain.com/ → your dataset → **Experiments** tab. @@ -373,6 +498,83 @@ uv run python evaluate/langsmith_dataset.py experiment compare \ --- +## Claude-assisted analysis with `/analyze-experiment` + +The `/analyze-experiment` skill is a Claude Code command that automates the full experiment investigation workflow — from aggregate scores through root-cause diagnosis — in a single step. + +Invoke it from any Claude Code session in this repository: + +``` +/analyze-experiment +``` + +For example: + +``` +/analyze-experiment tfa-2026-04-13 +/analyze-experiment c663e09e-1234-... +``` + +### What it does + +Claude runs the following automatically and synthesizes the findings: + +1. **Overview** — `experiment show` and `experiment stats` in parallel to get aggregate scores and per-scenario consistency tables. Identifies scenarios with low means or high variance. + +2. **Exemplar selection** — For the worst-performing scenario, fetches all runs sorted worst-to-best and selects one 0.0 run and one 1.0 run from the same scenario. This controls for question difficulty — any difference between them is a signal about agent behaviour, not input variation. + +3. **Trace analysis** — Reads the full execution trace for both exemplars, including tool calls, retrieved passages, and the final model output. Looks for which of the following failure modes is present: + + | Failure mode | Signature | + |---|---| + | **Retrieval miss** | The correct statutory text is absent from retrieved passages in both runs; the 1.0 run succeeded on general knowledge rather than retrieval | + | **Query too broad** | The RAG query echoes the user's words verbatim; specific statute language is absent from results | + | **Reasoning failure** | The correct text was retrieved but the model ignored or misapplied it | + | **Instruction conflict** | Two system-prompt rules contradict each other; the model follows one and suppresses the other. Hard signature: empty `content` field with reasoning tokens but no tool calls | + | **Confabulation** | The model asserted specific numbers, dates, or rules not present in retrieved text or system prompt | + | **Misleading retrieval** | Retrieved passages are technically accurate but framed in a way that leads to a wrong inference | + +4. **Recommended fixes** — Matched to the failure mode: + + | Failure mode | Fix | + |---|---| + | Retrieval miss | Corpus update needed; names the missing statute | + | Query too broad | Better RAG query formulation or updated tool description | + | Reasoning failure | System-prompt guidance targeting that reasoning step | + | Instruction conflict | Identifies the conflicting lines; suggests a carve-out or clarified precedence | + | Confabulation | Tighter anti-hallucination instruction | + | Misleading retrieval | System-prompt trigger to search for the protective statute before drawing conclusions | + +### Output format + +Results are presented as: overview table → worst-scenario highlights → root-cause diagnosis → recommended fixes. + +### Comparing two experiments + +If you have a baseline experiment to compare against, include it: + +``` +/analyze-experiment +``` + +Claude will run `experiment compare` in parallel with the overview and highlight which scenarios improved, regressed, or remained unchanged. + +### When to use this vs. the LangSmith UI + +The LangSmith UI is best for browsing individual scores and reading judge rationale for specific examples. The `/analyze-experiment` skill is best when you want a diagnosis — not just the numbers, but a hypothesis about *why* a scenario is failing and what to change. + +A typical workflow after running a new experiment: + +```mermaid +flowchart LR + A["Run evaluation
run_langsmith_evaluation.py"] --> B["/analyze-experiment
in Claude Code"] + B --> C["Root-cause diagnosis
+ recommended fixes"] + C --> D["Edit system_prompt.md
or evaluator rubric"] + D --> A +``` + +--- + ## Typical workflows ### "I want to check quality before a release" @@ -554,6 +756,27 @@ Each file describes what a good answer looks like and the scoring guidelines (1. To refine how the judge scores responses, edit the rubric file and commit. You can also experiment with rubric wording in the LangSmith UI by binding an LLM-as-judge evaluator to your dataset — when you find wording you like, copy it back into the `.md` file and commit so everyone shares the same criteria. +### Testing rubric changes without re-running the agent + +After editing a rubric, use `measure_evaluator_variance.py` to re-score an existing experiment's outputs with the new rubric — no new agent calls needed: + +```bash +# Score every existing run once with the updated rubric +uv run python -m evaluate.measure_evaluator_variance \ + --experiment \ + --evaluator "legal correctness" \ + -k 1 + +# Focus on a specific scenario that was noisy +uv run python -m evaluate.measure_evaluator_variance \ + --experiment \ + --evaluator "legal correctness" \ + --scenario 2 \ + -k 5 +``` + +Use `-k 1` to get a quick read on how the score distribution shifts. Use `-k 5` or higher on a specific scenario to confirm that evaluator σ has dropped — a tighter rubric should produce a lower mean σ across re-evaluations of the same fixed output. + Heuristic evaluators (citation format, tool usage, performance) are Python code in `langsmith_evaluators.py` and require a developer to modify. --- diff --git a/backend/evaluate/evaluators/legal_correctness.md b/backend/evaluate/evaluators/legal_correctness.md index 8e5547ef..5ebb36df 100644 --- a/backend/evaluate/evaluators/legal_correctness.md +++ b/backend/evaluate/evaluators/legal_correctness.md @@ -1,23 +1,35 @@ -A correct answer: -- Legal advice correctly based on Oregon tenant law -- Check if advice aligns with ORS 90 (Landlord-Tenant) -- Verify city-specific rules are correctly applied -- Ensure no false statements about tenant rights -- Provides accurate and complete information -- Contains no factual errors -- Addresses all parts of the question -- Is logically consistent -- Uses precise and accurate terminology +Evaluate whether the response gives legally accurate information under Oregon law +for the tenant's specific situation. -When scoring, you should penalize: -- Factual errors or inaccuracies -- Incomplete or partial answers -- Misleading or ambiguous statements -- Incorrect terminology -- Logical inconsistencies -- Missing key information +**What to focus on:** +- Is the core legal claim correct? (The direct answer to what the tenant asked) +- Are cited statutes and legal rules accurate? +- If city-specific law applies, is it applied correctly? +- Are procedural conditions material to the tenant acting on the advice included? + (e.g., notice must be in writing, delivered by a specific method) -Scoring Guidelines: -- Score 1.0 if response is legally correct and complete -- Score 0.5 if response has minor inaccuracies or omissions -- Score 0.0 if response is legally incorrect or significantly incomplete +**Scoring:** + +Score 1.0 if: +- The core legal claim is correct and cites the relevant Oregon statute +- Procedural conditions the tenant must meet to exercise their rights are included +- No false or misleading statements about tenant rights + +Score 0.5 if: +- The core legal claim is correct but one or more procedural conditions are omitted + (e.g., states the notice period correctly but omits the written-notice or delivery + requirements) +- or the core legal claim is slightly imprecise but not in a way that would harm the + tenant (e.g., cites a related statute rather than the most precise one) +- The response would not lead the tenant to take a harmful or losing action + +Score 0.0 if: +- The core legal claim is wrong (e.g., misstates a deadline, says the tenant cannot + do something they legally can, overstates landlord rights) +- or a false statement is present that could cause the tenant direct harm +- or the response is so incomplete the tenant cannot act on it at all + +**Key distinction between 0.5 and 0.0:** +A wrong *fact* (incorrect law, wrong statute, misstated deadline) → 0.0. +A missing *procedural detail* that doesn't change the core answer → 0.5. +Both correct and complete → 1.0. diff --git a/backend/evaluate/langsmith_evaluators.py b/backend/evaluate/langsmith_evaluators.py index 51cb2314..40c5ad60 100644 --- a/backend/evaluate/langsmith_evaluators.py +++ b/backend/evaluate/langsmith_evaluators.py @@ -13,12 +13,20 @@ from textwrap import dedent from typing import Any, Dict, Final +from langchain_google_genai import ChatGoogleGenerativeAI from openevals import create_llm_as_judge from openevals.types import SimpleEvaluator # NOTE: can (should?) use different models for chatbot LLM & evaluator -# EVALUATOR_MODEL_NAME: Final = "gemini-2.5-pro" -EVALUATOR_MODEL_NAME: Final = "gemini-2.5-flash" +EVALUATOR_MODEL_NAME: Final = "gemini-3-flash-preview" + +# Use vertexai=True so the evaluator uses service account credentials rather +# than a Gemini API key. init_chat_model routes gemini-3-flash-preview to +# us-central1 by default, which returns 404 — constructing the judge directly +# avoids that routing issue. +_EVALUATOR_JUDGE: Final = ChatGoogleGenerativeAI( + model=EVALUATOR_MODEL_NAME, vertexai=True, location="global" +) EVALUATORS_DIR: Final = Path(__file__).parent / "evaluators" @@ -75,9 +83,14 @@ def load_rubric(name: str) -> str: ) +# NOTE: do not pass choices=[...] to create_llm_as_judge with Gemini models. +# Openevals serializes choices as a float enum in the structured output schema, +# but Gemini's protobuf layer requires enum values to be strings, not floats, +# and raises a ParseError at runtime. + # Evaluator: Citation Accuracy (LLM-as-Judge). citation_accuracy_evaluator: SimpleEvaluator = create_llm_as_judge( - model=EVALUATOR_MODEL_NAME, + judge=_EVALUATOR_JUDGE, prompt=load_rubric("citation_accuracy"), feedback_key="citation accuracy", continuous=True, @@ -85,7 +98,7 @@ def load_rubric(name: str) -> str: # Evaluator: Legal Correctness (LLM-as-Judge). legal_correctness_evaluator: SimpleEvaluator = create_llm_as_judge( - model=EVALUATOR_MODEL_NAME, + judge=_EVALUATOR_JUDGE, prompt=load_rubric("legal_correctness"), feedback_key="legal correctness", continuous=True, @@ -93,7 +106,7 @@ def load_rubric(name: str) -> str: # Evaluator: Tone & Professionalism (LLM-as-Judge). tone_evaluator: SimpleEvaluator = create_llm_as_judge( - model=EVALUATOR_MODEL_NAME, + judge=_EVALUATOR_JUDGE, prompt=load_rubric("tone"), feedback_key="appropriate tone", continuous=True, diff --git a/backend/evaluate/measure_evaluator_variance.py b/backend/evaluate/measure_evaluator_variance.py new file mode 100644 index 00000000..030ee402 --- /dev/null +++ b/backend/evaluate/measure_evaluator_variance.py @@ -0,0 +1,312 @@ +"""Measure LLM judge variance on fixed agent outputs. + +Re-runs evaluators k times on the same fixed inputs/outputs from an existing +LangSmith experiment to isolate evaluator stochasticity from agent stochasticity. + +Total observed variance decomposes as: + σ²_total = σ²_agent + σ²_evaluator + +Run this script against an existing experiment to measure σ_evaluator directly. +If σ_evaluator << σ_total, the noise is agent-side and more agent samples are +the right fix. If σ_evaluator is comparable to σ_total, fix the judge first. + +Usage: + uv run python -m evaluate.measure_evaluator_variance --experiment + uv run python -m evaluate.measure_evaluator_variance --experiment -k 10 + uv run python -m evaluate.measure_evaluator_variance --experiment --runs-per-scenario 3 +""" + +import argparse +import statistics +from collections import defaultdict +from typing import Any, Dict, List, Optional, Tuple + +from langsmith import Client + +from evaluate.langsmith_evaluators import ( + legal_correctness_evaluator, + tone_evaluator, +) +from evaluate.results_display import ScenarioResult, print_consistency_stats +from tenantfirstaid.constants import LANGSMITH_API_KEY + +# All available evaluators, keyed by their feedback_key. +_ALL_EVALUATORS = { + "legal correctness": legal_correctness_evaluator, + "appropriate tone": tone_evaluator, +} + + +def _fetch_runs_and_examples( + client: Client, + experiment_name: str, +) -> List[Tuple[Any, Any]]: + """Return (run, example) pairs for all root runs in an experiment.""" + runs = list( + client.list_runs( + project_name=experiment_name, + is_root=True, + ) + ) + + pairs = [] + for run in runs: + if run.reference_example_id is None: + continue + example = client.read_example(run.reference_example_id) + pairs.append((run, example)) + + return pairs + + +def _evaluate_once( + evaluator: Any, + inputs: Dict[str, Any], + outputs: Dict[str, Any], + reference_outputs: Dict[str, Any], +) -> Optional[float]: + """Call an evaluator once and return the float score, or None on failure.""" + try: + result = evaluator( + inputs=inputs, + outputs=outputs, + reference_outputs=reference_outputs, + ) + except Exception as exc: # noqa: BLE001 + print(f" [evaluator error: {exc}]") + return None + + if isinstance(result, dict): + score = result.get("score") + else: + score = getattr(result, "score", None) + + return float(score) if score is not None else None + + +def measure_evaluator_variance( + experiment_name: str, + k: int = 5, + runs_per_scenario: Optional[int] = None, + evaluator_names: Optional[List[str]] = None, + scenario_ids_filter: Optional[List[int]] = None, +) -> None: + """Fetch runs from an experiment, re-evaluate each k times, and report σ. + + Args: + experiment_name: LangSmith project/experiment name to pull runs from. + k: Number of times to re-run each evaluator on each fixed output. + runs_per_scenario: If set, limit how many runs per scenario are probed. + Useful when an experiment has many repetitions and you only need a + representative sample. + evaluator_names: Names of evaluators to run (keys from _ALL_EVALUATORS). + Defaults to all evaluators when None. + scenario_ids_filter: If set, only probe scenarios whose scenario_id is + in this list. Useful for drilling into a single noisy scenario. + """ + if evaluator_names is not None: + unknown = set(evaluator_names) - set(_ALL_EVALUATORS) + if unknown: + raise ValueError( + f"Unknown evaluator(s): {unknown}. Available: {list(_ALL_EVALUATORS)}" + ) + evaluators = {name: _ALL_EVALUATORS[name] for name in evaluator_names} + else: + evaluators = _ALL_EVALUATORS + + client = Client(api_key=LANGSMITH_API_KEY) + + print(f"Fetching runs from experiment: {experiment_name}") + pairs = _fetch_runs_and_examples(client, experiment_name) + + if not pairs: + print("No runs found. Check the experiment name.") + return + + # Group runs by scenario (example_id) and pull scenario metadata. + runs_by_example: Dict[str, List[Any]] = defaultdict(list) + examples_by_id: Dict[str, Any] = {} + for run, example in pairs: + eid = str(example.id) + runs_by_example[eid].append(run) + examples_by_id[eid] = example + + scenario_ids: Dict[str, int] = {} + queries: Dict[str, str] = {} + for eid, example in examples_by_id.items(): + scenario_ids[eid] = (example.metadata or {}).get("scenario_id", 0) + queries[eid] = (example.inputs or {}).get("query", "") + + if scenario_ids_filter is not None: + filter_set = set(scenario_ids_filter) + runs_by_example = { + eid: runs + for eid, runs in runs_by_example.items() + if scenario_ids.get(eid) in filter_set + } + if not runs_by_example: + print(f"No runs found for scenario_id(s) {scenario_ids_filter}.") + return + + total_runs = sum( + min(len(r), runs_per_scenario) if runs_per_scenario else len(r) + for r in runs_by_example.values() + ) + total_evals = total_runs * k * len(evaluators) + print( + f"Found {len(pairs)} runs across {len(runs_by_example)} scenarios. " + f"Will make {total_evals} evaluator calls ({total_runs} runs × {k} repeats × {len(evaluators)} evaluators)." + ) + + # Re-evaluate each run k times and collect per-scenario scores. + scenarios: List[ScenarioResult] = [] + + for eid in sorted(runs_by_example, key=lambda e: scenario_ids.get(e, 0)): + runs = runs_by_example[eid] + example = examples_by_id[eid] + sid = scenario_ids.get(eid, 0) + query = queries.get(eid, "") + label = f'"{query[:68]}{"..." if len(query) > 68 else ""}"' + + if runs_per_scenario is not None: + runs = runs[:runs_per_scenario] + + # per_run_scores[eval_name][run_idx] = [score_1, ..., score_k] + per_run_scores: Dict[str, List[List[float]]] = {name: [] for name in evaluators} + + ref_outputs = example.outputs or {} + + for run_idx, run in enumerate(runs): + run_inputs = run.inputs or {} + run_outputs = run.outputs or {} + print(f" S{sid} run {run_idx + 1}/{len(runs)}", end="", flush=True) + + for eval_name, evaluator in evaluators.items(): + scores_for_run = [] + for _ in range(k): + score = _evaluate_once( + evaluator, run_inputs, run_outputs, ref_outputs + ) + if score is not None: + scores_for_run.append(score) + print(".", end="", flush=True) + per_run_scores[eval_name].append(scores_for_run) + + print() # newline after dots + + # For the consistency table, flatten all k scores per evaluator across + # all probed runs into a single list per scenario. This shows the + # combined evaluator spread across the sampled outputs. + flat_scores: Dict[str, List[float]] = { + name: [s for run_scores in per_run_scores[name] for s in run_scores] + for name in evaluators + } + scenarios.append( + ScenarioResult(label=label, scenario_id=sid, scores=flat_scores) + ) + + # Per-run σ breakdown for this scenario. + print(f" Per-run evaluator σ for S{sid}:") + for eval_name in evaluators: + run_sigmas = [ + statistics.pstdev(run_scores) + for run_scores in per_run_scores[eval_name] + if len(run_scores) >= 2 + ] + if run_sigmas: + mean_sigma = statistics.mean(run_sigmas) + print( + f" {eval_name}: mean σ = {mean_sigma:.3f} (per-run: {[f'{s:.2f}' for s in run_sigmas]})" + ) + + # Reuse the existing consistency display. + print_consistency_stats(scenarios) + + # Summary: mean evaluator σ across all scenarios and runs. + print("\n=== Evaluator Variance Summary ===") + print("(σ is computed per individual run across k re-evaluations of fixed output)") + for eval_name in evaluators: + all_run_sigmas = [] + for scenario in scenarios: + per_run = _per_run_sigmas_from_scenario(scenario, eval_name, k) + all_run_sigmas.extend(per_run) + if all_run_sigmas: + mean_sigma = statistics.mean(all_run_sigmas) + max_sigma = max(all_run_sigmas) + print(f" {eval_name}:") + print(f" mean σ = {mean_sigma:.3f} (max = {max_sigma:.3f})") + print() + print( + "Compare these σ values against the per-scenario σ in your experiment results.\n" + "If evaluator σ << experiment σ, variance is agent-side → increase --num-repetitions.\n" + "If evaluator σ ≈ experiment σ, judge stochasticity dominates → improve the judge." + ) + + +def _per_run_sigmas_from_scenario( + scenario: ScenarioResult, eval_name: str, k: int +) -> List[float]: + """Re-derive per-run σ values from a flattened score list. + + Since the flat list in ScenarioResult interleaves k scores per run in order, + we can recover per-run groups by chunking by k. + """ + flat = scenario.scores.get(eval_name, []) + # Chunk into groups of k (last group may be smaller if some scores failed). + sigmas = [] + for i in range(0, len(flat), k): + chunk = flat[i : i + k] + if len(chunk) >= 2: + sigmas.append(statistics.pstdev(chunk)) + return sigmas + + +if __name__ == "__main__": + parser = argparse.ArgumentParser( + description="Measure LLM judge variance on fixed agent outputs", + formatter_class=argparse.ArgumentDefaultsHelpFormatter, + ) + parser.add_argument( + "--experiment", + required=True, + help="LangSmith experiment name to pull runs from", + ) + parser.add_argument( + "-k", + type=int, + default=5, + help="Number of times to re-run each evaluator per fixed output", + ) + parser.add_argument( + "--runs-per-scenario", + type=int, + default=None, + help="Limit how many runs per scenario are probed (default: all)", + ) + parser.add_argument( + "--evaluator", + dest="evaluators", + nargs="+", + default=None, + metavar="NAME", + help=f"Evaluator(s) to run. Available: {list(_ALL_EVALUATORS)}. Defaults to all.", + ) + parser.add_argument( + "--scenario", + dest="scenarios", + nargs="+", + type=int, + default=None, + metavar="ID", + help="Scenario ID(s) to probe (e.g. --scenario 2). Defaults to all.", + ) + + args = parser.parse_args() + + measure_evaluator_variance( + experiment_name=args.experiment, + k=args.k, + runs_per_scenario=args.runs_per_scenario, + evaluator_names=args.evaluators, + scenario_ids_filter=args.scenarios, + ) diff --git a/backend/tests/test_measure_evaluator_variance.py b/backend/tests/test_measure_evaluator_variance.py new file mode 100644 index 00000000..0e342656 --- /dev/null +++ b/backend/tests/test_measure_evaluator_variance.py @@ -0,0 +1,259 @@ +"""Tests for evaluate/measure_evaluator_variance.py.""" + +import statistics +from typing import Any, Dict, Optional +from unittest.mock import MagicMock, patch + +import pytest + +from evaluate.measure_evaluator_variance import ( + _ALL_EVALUATORS, + _evaluate_once, + _per_run_sigmas_from_scenario, + measure_evaluator_variance, +) +from evaluate.results_display import ScenarioResult + +# ── _evaluate_once ───────────────────────────────────────────────────────────── + + +def _make_evaluator(score: Optional[float]) -> Any: + """Return a callable evaluator that always returns the given score.""" + if score is None: + return MagicMock(return_value={"score": None}) + return MagicMock(return_value={"score": score}) + + +def test_evaluate_once_returns_float_from_dict(): + evaluator = _make_evaluator(0.75) + result = _evaluate_once(evaluator, {"q": "a"}, {"output": "b"}, {"ref": "c"}) + assert result == pytest.approx(0.75) + + +def test_evaluate_once_passes_kwargs_correctly(): + evaluator = MagicMock(return_value={"score": 1.0}) + inputs = {"query": "test"} + outputs = {"output": "answer"} + reference = {"reference": "gold"} + _evaluate_once(evaluator, inputs, outputs, reference) + evaluator.assert_called_once_with( + inputs=inputs, outputs=outputs, reference_outputs=reference + ) + + +def test_evaluate_once_handles_attribute_result(): + """Result objects with a .score attribute (not dict) are supported.""" + result_obj = MagicMock() + result_obj.score = 0.5 + evaluator = MagicMock(return_value=result_obj) + # Make isinstance(result, dict) return False. + result_obj.__class__ = type("FakeResult", (), {}) + score = _evaluate_once(evaluator, {}, {}, {}) + assert score == pytest.approx(0.5) + + +def test_evaluate_once_returns_none_when_score_is_none(): + evaluator = _make_evaluator(None) + result = _evaluate_once(evaluator, {}, {}, {}) + assert result is None + + +def test_evaluate_once_returns_none_on_exception(capsys): + evaluator = MagicMock(side_effect=RuntimeError("boom")) + result = _evaluate_once(evaluator, {}, {}, {}) + assert result is None + assert "evaluator error" in capsys.readouterr().out + + +# ── _per_run_sigmas_from_scenario ────────────────────────────────────────────── + + +def test_per_run_sigmas_uniform_scores(): + """k uniform scores per run → σ = 0.0 for each run.""" + scenario = ScenarioResult( + label='"q"', + scores={"legal correctness": [1.0, 1.0, 1.0, 0.5, 0.5, 0.5]}, + ) + sigmas = _per_run_sigmas_from_scenario(scenario, "legal correctness", k=3) + assert len(sigmas) == 2 + assert all(s == pytest.approx(0.0) for s in sigmas) + + +def test_per_run_sigmas_mixed_scores(): + """Scores that vary within a run produce non-zero σ.""" + scenario = ScenarioResult( + label='"q"', + scores={"legal correctness": [1.0, 0.0, 1.0, 0.0]}, + ) + sigmas = _per_run_sigmas_from_scenario(scenario, "legal correctness", k=2) + assert len(sigmas) == 2 + expected = statistics.pstdev([1.0, 0.0]) + assert all(s == pytest.approx(expected) for s in sigmas) + + +def test_per_run_sigmas_missing_evaluator_returns_empty(): + scenario = ScenarioResult(label='"q"', scores={}) + sigmas = _per_run_sigmas_from_scenario(scenario, "nonexistent", k=3) + assert sigmas == [] + + +def test_per_run_sigmas_single_score_per_run_excluded(): + """Chunks smaller than 2 are skipped (can't compute σ).""" + scenario = ScenarioResult( + label='"q"', + scores={"legal correctness": [1.0]}, + ) + sigmas = _per_run_sigmas_from_scenario(scenario, "legal correctness", k=1) + assert sigmas == [] + + +def test_per_run_sigmas_partial_last_chunk_excluded(): + """If total scores are not divisible by k, the short last chunk is skipped.""" + # 5 scores with k=3 → chunk [0:3] (size 3, included) + [3:5] (size 2, included) + scenario = ScenarioResult( + label='"q"', + scores={"legal correctness": [1.0, 0.0, 1.0, 0.5, 0.5]}, + ) + sigmas = _per_run_sigmas_from_scenario(scenario, "legal correctness", k=3) + # First chunk [1.0, 0.0, 1.0], second chunk [0.5, 0.5] (size 2, >= 2 so included) + assert len(sigmas) == 2 + + +# ── _ALL_EVALUATORS contents ─────────────────────────────────────────────────── + + +def test_all_evaluators_has_legal_correctness(): + assert "legal correctness" in _ALL_EVALUATORS + + +def test_all_evaluators_has_tone(): + assert "appropriate tone" in _ALL_EVALUATORS + + +# ── measure_evaluator_variance (unit, no network) ───────────────────────────── + + +def _fake_run(example_id: str, inputs: Dict, outputs: Dict) -> MagicMock: + run = MagicMock() + run.reference_example_id = example_id + run.inputs = inputs + run.outputs = outputs + return run + + +def _fake_example(example_id: str, scenario_id: int, query: str) -> MagicMock: + example = MagicMock() + example.id = example_id + example.metadata = {"scenario_id": scenario_id} + example.inputs = {"query": query} + example.outputs = {"reference": "gold answer"} + return example + + +@pytest.fixture() +def fake_pairs(): + """Two runs for scenario 1 and one run for scenario 2.""" + e1 = _fake_example("aaa", 1, "Can I withhold rent?") + e2 = _fake_example("bbb", 2, "How much notice must I give?") + run1a = _fake_run("aaa", {"query": "Can I withhold rent?"}, {"output": "Yes"}) + run1b = _fake_run("aaa", {"query": "Can I withhold rent?"}, {"output": "Maybe"}) + run2a = _fake_run("bbb", {"query": "How much notice?"}, {"output": "30 days"}) + return [(run1a, e1), (run1b, e1), (run2a, e2)] + + +def test_measure_evaluator_variance_calls_evaluator_k_times(fake_pairs): + """With k=3 and a single evaluator over 3 runs, expect 9 evaluator calls total.""" + mock_evaluator = MagicMock(return_value={"score": 1.0}) + with ( + patch( + "evaluate.measure_evaluator_variance._fetch_runs_and_examples", + return_value=fake_pairs, + ), + patch( + "evaluate.measure_evaluator_variance._ALL_EVALUATORS", + {"legal correctness": mock_evaluator}, + ), + patch("evaluate.measure_evaluator_variance.Client"), + patch("evaluate.measure_evaluator_variance.print_consistency_stats"), + ): + measure_evaluator_variance("fake-experiment", k=3) + + # 3 runs × 3 repeats = 9 calls. + assert mock_evaluator.call_count == 9 + + +def test_measure_evaluator_variance_unknown_evaluator_raises(): + with pytest.raises(ValueError, match="Unknown evaluator"): + measure_evaluator_variance( + "fake-experiment", evaluator_names=["nonexistent evaluator"] + ) + + +def test_measure_evaluator_variance_scenario_filter(fake_pairs): + """scenario_ids_filter limits evaluation to matching scenarios only.""" + mock_evaluator = MagicMock(return_value={"score": 0.5}) + with ( + patch( + "evaluate.measure_evaluator_variance._fetch_runs_and_examples", + return_value=fake_pairs, + ), + patch( + "evaluate.measure_evaluator_variance._ALL_EVALUATORS", + {"legal correctness": mock_evaluator}, + ), + patch("evaluate.measure_evaluator_variance.Client"), + patch("evaluate.measure_evaluator_variance.print_consistency_stats"), + ): + # Only scenario 1 (2 runs), k=2 → 4 calls. + measure_evaluator_variance("fake-experiment", k=2, scenario_ids_filter=[1]) + + assert mock_evaluator.call_count == 4 + + +def test_measure_evaluator_variance_runs_per_scenario_limit(fake_pairs): + """runs_per_scenario=1 caps the number of runs processed per scenario.""" + mock_evaluator = MagicMock(return_value={"score": 1.0}) + with ( + patch( + "evaluate.measure_evaluator_variance._fetch_runs_and_examples", + return_value=fake_pairs, + ), + patch( + "evaluate.measure_evaluator_variance._ALL_EVALUATORS", + {"legal correctness": mock_evaluator}, + ), + patch("evaluate.measure_evaluator_variance.Client"), + patch("evaluate.measure_evaluator_variance.print_consistency_stats"), + ): + # 2 scenarios × 1 run × k=2 = 4 calls. + measure_evaluator_variance("fake-experiment", k=2, runs_per_scenario=1) + + assert mock_evaluator.call_count == 4 + + +def test_measure_evaluator_variance_no_runs_prints_message(capsys): + with ( + patch( + "evaluate.measure_evaluator_variance._fetch_runs_and_examples", + return_value=[], + ), + patch("evaluate.measure_evaluator_variance.Client"), + ): + measure_evaluator_variance("fake-experiment") + + assert "No runs found" in capsys.readouterr().out + + +def test_measure_evaluator_variance_no_matching_scenario_prints_message( + fake_pairs, capsys +): + with ( + patch( + "evaluate.measure_evaluator_variance._fetch_runs_and_examples", + return_value=fake_pairs, + ), + patch("evaluate.measure_evaluator_variance.Client"), + ): + measure_evaluator_variance("fake-experiment", scenario_ids_filter=[99]) + + assert "No runs found" in capsys.readouterr().out