This directory contains evaluation suites for the SRE agent.
Evaluations use intentionally flawed service snippets from:
Evaluations are implemented with Opik.
common: shared helpers used across suites.diagnosis_quality: evaluates diagnosis correctness and fix quality.tool_call: evaluates tool selection and tool call order.
The available suites are:
tool_calldiagnosis_quality
tool_call validates:
- required tool usage
- expected tool order
- optional GitHub usage expectations per case
It uses:
- real GitHub MCP calls
- mocked Slack and CloudWatch calls
- Opik tool spans (
task_span) for scoring
diagnosis_quality validates:
- root cause correctness
- fix quality and actionability
- affected services match
It uses:
- real GitHub MCP calls
- mocked Slack and CloudWatch calls
- output-field scoring metrics
If you are running Opik locally, start the Opik platform first:
# Clone the Opik repository
git clone https://github.com/comet-ml/opik.git
# Navigate to the repository
cd opik
# Start the Opik platform
./opik.shSee comet-ml/opik for details.
When the server is running, open http://localhost:5173/ to view datasets and experiments.
For suite-specific details, see:
src/sre_agent/eval/tool_call/README.mdsrc/sre_agent/eval/diagnosis_quality/README.md