74 lines (45 loc) · 1.54 KB

SRE Agent Evaluation

This directory contains evaluation suites for the SRE agent.

Scope

Evaluations use intentionally flawed service snippets from:

sre-agent-eval

Evaluations are implemented with Opik.

Structure

common: shared helpers used across suites.
diagnosis_quality: evaluates diagnosis correctness and fix quality.
tool_call: evaluates tool selection and tool call order.

Current suites

The available suites are:

tool_call
diagnosis_quality

tool_call validates:

required tool usage
expected tool order
optional GitHub usage expectations per case

It uses:

real GitHub MCP calls
mocked Slack and CloudWatch calls
Opik tool spans (task_span) for scoring

diagnosis_quality validates:

root cause correctness
fix quality and actionability
affected services match

It uses:

real GitHub MCP calls
mocked Slack and CloudWatch calls
output-field scoring metrics

Run

If you are running Opik locally, start the Opik platform first:

# Clone the Opik repository
git clone https://github.com/comet-ml/opik.git

# Navigate to the repository
cd opik

# Start the Opik platform
./opik.sh

See comet-ml/opik for details.

When the server is running, open http://localhost:5173/ to view datasets and experiments.

For suite-specific details, see:

src/sre_agent/eval/tool_call/README.md
src/sre_agent/eval/diagnosis_quality/README.md