Skip to content

Add server-side Slurm-job correlation endpoint#365

Closed
daniel-thom wants to merge 1 commit into
mainfrom
server-slurm-job-correlations
Closed

Add server-side Slurm-job correlation endpoint#365
daniel-thom wants to merge 1 commit into
mainfrom
server-slurm-job-correlations

Conversation

@daniel-thom
Copy link
Copy Markdown
Collaborator

Summary

Tier 2 follow-up to the torc status server-side migration (#363). torc slurm diagnose-logs correlates Slurm job IDs to the Torc jobs they ran. It did this in build_slurm_to_jobs_map by fetching four full lists — scheduled compute nodes, compute nodes, results, and jobs — and joining them through three HashMaps in memory. This was the heaviest client-side join in the codebase.

This PR adds GET /workflows/{id}/slurm_job_correlations, which performs the entire join in a single SQL query:

scheduled_compute_node (scheduler_id = Slurm job ID)
  → compute_node      (linked via the scheduler JSON's scheduler_id)
  → result            (compute_node_id)
  → job

grouped/ordered by (slurm_job_id, job_id) and covering all runs (matching the prior all_runs=true behavior). Each table is narrowed by its workflow_id index before joining — EXPLAIN QUERY PLAN shows it starting from result via idx_result_workflow_id then primary-key lookups, with no table scans.

build_slurm_to_jobs_map now makes a single call and rebuilds the same HashMap<String, Vec<AffectedJob>>, so all consumers (diagnose-logs output) are unchanged. The now-unused pagination imports are removed.

Response shape

// GET /workflows/{id}/slurm_job_correlations
{ "items": [ { "slurm_job_id": "987654", "job_id": 42, "job_name": "train" } ] }

The client groups items by slurm_job_id; the server already deduplicates (GROUP BY) and orders the rows.

Testing

  • New integration tests: test_get_slurm_job_correlations (builds the full SCN→compute_node→result→job chain and asserts the correlation) and test_get_slurm_job_correlations_not_found
  • cargo fmt --check, cargo clippy --all --all-targets --all-features -- -D warnings, dprint check — clean (pre-commit hook)
  • OpenAPI codegen parity check — ok; Rust/Python/Julia clients regenerated

🤖 Generated with Claude Code

`torc slurm diagnose-logs` correlated Slurm job IDs to the Torc jobs they
ran by fetching four full lists (scheduled compute nodes, compute nodes,
results, jobs) and joining them through three HashMaps in
`build_slurm_to_jobs_map` -- the heaviest client-side join in the codebase.

Add `GET /workflows/{id}/slurm_job_correlations`, which performs the whole
join in one SQL query: scheduled_compute_node (scheduler_id = Slurm job ID)
-> compute_node (linked via the scheduler JSON's scheduler_id) -> result ->
job, grouped/ordered by (slurm_job_id, job_id) and covering all runs (matching
the prior all_runs=true behavior). Every table is narrowed by its workflow_id
index before joining; the query plan is index-only with no table scans.

`build_slurm_to_jobs_map` now makes a single call and rebuilds the same
`HashMap<String, Vec<AffectedJob>>`, so all consumers are unchanged. The
now-unused pagination imports are removed.

Adds integration tests for the correlation chain and the 404 path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@daniel-thom
Copy link
Copy Markdown
Collaborator Author

Superseded by #366, which consolidates all the remaining server-side aggregation work (these two commits are the first two there) plus the running-jobs command and the results/compute-node listing improvements into a single PR.

@daniel-thom daniel-thom closed this Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant