Skip to content

ITSM verifier bug: applies_to_priority column referenced in incident_sla table #8

@satish860

Description

@satish860

Summary

While running the ITSM domain tasks (oracle config), we found a verifier SQL query that references a column (applies_to_priority) in the incident_sla table, but that column only exists in the sla_definition table. This causes the verifier to always fail with a SQLite error regardless of agent behavior.

Affected Task

  • Task ID: task_20251212_172511_458_e6427839_47076c15 (ITSM task index 1 in oracle split)
  • Verifier name: "Verify if the correct SLA is linked to the incident."

The broken query

SELECT COUNT(*) FROM incident_sla 
WHERE incident_id = 'INC_003' AND sla_def_id = 'SLA_002' 
AND applies_to_priority = 'high';

Error returned:

(sqlite3.OperationalError) no such column: applies_to_priority

Expected behavior

The applies_to_priority column exists in sla_definition, not incident_sla. The query should either:

  1. Join to sla_definition:
SELECT COUNT(*) FROM incident_sla 
JOIN sla_definition ON incident_sla.sla_def_id = sla_definition.sla_def_id
WHERE incident_sla.incident_id = 'INC_003' 
AND incident_sla.sla_def_id = 'SLA_002' 
AND sla_definition.applies_to_priority = 'high';
  1. Or simply check without the priority column (since SLA_002 already implies the priority):
SELECT COUNT(*) FROM incident_sla 
WHERE incident_id = 'INC_003' AND sla_def_id = 'SLA_002';

Additional note: duplicate verifier names and Python dict behavior

This task has 8 verifiers but only 2 unique names. When stored in a Python dict (as in executor.py line 474: verification_results[verifier_name] = result), duplicate names overwrite each other and only the last one survives.

For this task, the earlier duplicate of "Verify if the correct SLA is linked to the incident." uses the correct query (without applies_to_priority), but it gets overwritten by the last duplicate which has the broken query. This means:

  • In practice, this verifier is always failing for every model on the leaderboard
  • The dict-based dedup appears unintentional -- is the intended behavior to check all verifiers, or to deduplicate by name?

We'd appreciate clarification on both issues so we can align our scoring correctly.

Environment

  • Dataset: ServiceNow-AI/EnterpriseOps-Gym (HuggingFace, oracle config, itsm split)
  • MCP Server: shivakrishnareddyma225/enterpriseops-gym-mcp-itsm:latest

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions