Summary
While running the ITSM domain tasks (oracle config), we found a verifier SQL query that references a column (applies_to_priority) in the incident_sla table, but that column only exists in the sla_definition table. This causes the verifier to always fail with a SQLite error regardless of agent behavior.
Affected Task
- Task ID:
task_20251212_172511_458_e6427839_47076c15 (ITSM task index 1 in oracle split)
- Verifier name: "Verify if the correct SLA is linked to the incident."
The broken query
SELECT COUNT(*) FROM incident_sla
WHERE incident_id = 'INC_003' AND sla_def_id = 'SLA_002'
AND applies_to_priority = 'high';
Error returned:
(sqlite3.OperationalError) no such column: applies_to_priority
Expected behavior
The applies_to_priority column exists in sla_definition, not incident_sla. The query should either:
- Join to
sla_definition:
SELECT COUNT(*) FROM incident_sla
JOIN sla_definition ON incident_sla.sla_def_id = sla_definition.sla_def_id
WHERE incident_sla.incident_id = 'INC_003'
AND incident_sla.sla_def_id = 'SLA_002'
AND sla_definition.applies_to_priority = 'high';
- Or simply check without the priority column (since
SLA_002 already implies the priority):
SELECT COUNT(*) FROM incident_sla
WHERE incident_id = 'INC_003' AND sla_def_id = 'SLA_002';
Additional note: duplicate verifier names and Python dict behavior
This task has 8 verifiers but only 2 unique names. When stored in a Python dict (as in executor.py line 474: verification_results[verifier_name] = result), duplicate names overwrite each other and only the last one survives.
For this task, the earlier duplicate of "Verify if the correct SLA is linked to the incident." uses the correct query (without applies_to_priority), but it gets overwritten by the last duplicate which has the broken query. This means:
- In practice, this verifier is always failing for every model on the leaderboard
- The dict-based dedup appears unintentional -- is the intended behavior to check all verifiers, or to deduplicate by name?
We'd appreciate clarification on both issues so we can align our scoring correctly.
Environment
- Dataset:
ServiceNow-AI/EnterpriseOps-Gym (HuggingFace, oracle config, itsm split)
- MCP Server:
shivakrishnareddyma225/enterpriseops-gym-mcp-itsm:latest
Summary
While running the ITSM domain tasks (oracle config), we found a verifier SQL query that references a column (
applies_to_priority) in theincident_slatable, but that column only exists in thesla_definitiontable. This causes the verifier to always fail with a SQLite error regardless of agent behavior.Affected Task
task_20251212_172511_458_e6427839_47076c15(ITSM task index 1 in oracle split)The broken query
Error returned:
Expected behavior
The
applies_to_prioritycolumn exists insla_definition, notincident_sla. The query should either:sla_definition:SLA_002already implies the priority):Additional note: duplicate verifier names and Python dict behavior
This task has 8 verifiers but only 2 unique names. When stored in a Python dict (as in
executor.pyline 474:verification_results[verifier_name] = result), duplicate names overwrite each other and only the last one survives.For this task, the earlier duplicate of "Verify if the correct SLA is linked to the incident." uses the correct query (without
applies_to_priority), but it gets overwritten by the last duplicate which has the broken query. This means:We'd appreciate clarification on both issues so we can align our scoring correctly.
Environment
ServiceNow-AI/EnterpriseOps-Gym(HuggingFace, oracle config, itsm split)shivakrishnareddyma225/enterpriseops-gym-mcp-itsm:latest