Skip to content

Latest commit

 

History

History
74 lines (45 loc) · 1.54 KB

File metadata and controls

74 lines (45 loc) · 1.54 KB

SRE Agent Evaluation

This directory contains evaluation suites for the SRE agent.

Scope

Evaluations use intentionally flawed service snippets from:

Evaluations are implemented with Opik.

Structure

  • common: shared helpers used across suites.
  • diagnosis_quality: evaluates diagnosis correctness and fix quality.
  • tool_call: evaluates tool selection and tool call order.

Current suites

The available suites are:

  • tool_call
  • diagnosis_quality

tool_call validates:

  • required tool usage
  • expected tool order
  • optional GitHub usage expectations per case

It uses:

  • real GitHub MCP calls
  • mocked Slack and CloudWatch calls
  • Opik tool spans (task_span) for scoring

diagnosis_quality validates:

  • root cause correctness
  • fix quality and actionability
  • affected services match

It uses:

  • real GitHub MCP calls
  • mocked Slack and CloudWatch calls
  • output-field scoring metrics

Run

If you are running Opik locally, start the Opik platform first:

# Clone the Opik repository
git clone https://github.com/comet-ml/opik.git

# Navigate to the repository
cd opik

# Start the Opik platform
./opik.sh

See comet-ml/opik for details.

When the server is running, open http://localhost:5173/ to view datasets and experiments.

For suite-specific details, see:

  • src/sre_agent/eval/tool_call/README.md
  • src/sre_agent/eval/diagnosis_quality/README.md