Server-side aggregation endpoints + scheduler/job visibility in listings#366
Conversation
`torc slurm diagnose-logs` correlated Slurm job IDs to the Torc jobs they
ran by fetching four full lists (scheduled compute nodes, compute nodes,
results, jobs) and joining them through three HashMaps in
`build_slurm_to_jobs_map` -- the heaviest client-side join in the codebase.
Add `GET /workflows/{id}/slurm_job_correlations`, which performs the whole
join in one SQL query: scheduled_compute_node (scheduler_id = Slurm job ID)
-> compute_node (linked via the scheduler JSON's scheduler_id) -> result ->
job, grouped/ordered by (slurm_job_id, job_id) and covering all runs (matching
the prior all_runs=true behavior). Every table is narrowed by its workflow_id
index before joining; the query plan is index-only with no table scans.
`build_slurm_to_jobs_map` now makes a single call and rebuilds the same
`HashMap<String, Vec<AffectedJob>>`, so all consumers are unchanged. The
now-unused pagination imports are removed.
Adds integration tests for the correlation chain and the 404 path.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The endpoint returned an unbounded `{ items }` list, inconsistent with every
other list endpoint and the project's 10,000-row transfer ceiling. Add the
standard pagination contract: `offset`/`limit` query params resolved via
`authorize_workflow_and_paginate!`, a `total_count` from a wrapping COUNT,
and the full envelope (offset, max_limit, count, total_count, has_more) built
with `paginated_list_response!`. The count query keeps the same index-only
plan.
`build_slurm_to_jobs_map` now pages through correlations (accumulating into the
same map, so per-Slurm-job Vecs stay ordered) instead of assuming a single
response.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds `GET /workflows/{id}/running_jobs` (paginated) and a `torc jobs running`
CLI command that lists each currently-running job with its compute node name
and, when the node was provisioned by a scheduler, that scheduler's job ID.
The design is scheduler-agnostic: the endpoint joins live job state
(job.compute_node_id, set while running) to the compute node and LEFT JOINs
the scheduled compute node it belongs to. `scheduler_type` comes from the
compute node's type (e.g. "slurm", "local") and `scheduler_job_id` is the
external scheduler job ID (the Slurm job ID today), NULL for nodes not managed
by a scheduler. New scheduler types flow through with no code changes.
Unlike slurm_job_correlations (which joins through results and therefore only
covers finished jobs), this uses the live job->compute_node link so it
reflects what is running right now. Server-side join + pagination keeps it off
the N+1 path; the query is index-only via idx_job_workflow_status.
Includes integration tests (running job on a Slurm node, plus the 404 path)
and regenerated Rust/Python/Julia clients.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The results table unconditionally excluded the Run ID column, so even with --all-runs (which fetches results across every run) there was no way to tell which run a row belonged to. Keep Run ID hidden for the default single-run view, but include it when --all-runs is passed. Adds a CLI test asserting the column is absent by default and present with --all-runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The compute-nodes table omitted scheduler info entirely, so the human-facing view couldn't answer "which Slurm job is this node from?" -- yet `-f json` exposed it (compute_node_type plus the scheduler blob). Add two always-on columns: Scheduler (the node's compute_node_type, e.g. slurm/local) and Scheduler Job ID (the external job ID from the scheduler blob; today slurm_job_id, "-" for local/unscheduled nodes). This is a client-only table change -- the data was already in the list payload. Kept generic (type from compute_node_type, ID from the blob) to match the running-jobs work and extend to future scheduler types. Adds a unit test covering local (no job ID) and Slurm (job ID surfaced) rows. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`torc results list` previously fetched every job in the workflow just to label result rows with their job name (get_job_name_map). Add an optional job_name field to ResultModel that the server fills via a LEFT JOIN to job on the read paths (list_results and get_result), so the client reads result.job_name directly and the extra all-jobs fetch is gone. This benefits every consumer of the results API, not just the CLI. job_name is Option<String> + skip_serializing_if and ignored on create/update input. list_results now always aliases `result` as `r` (both the all_runs and workflow_result branches) so the join and sort columns stay unambiguous; get_result moves off the query! macro to a runtime query to avoid LEFT JOIN nullability inference. Clients regenerated. Removes get_job_name_map; asserts job_name in the results list test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR completes the remaining Tier-2 migrations of “client-side joins” into server-side SQL aggregation/list endpoints and surfaces additional scheduler/run visibility in CLI listings, while keeping all list-style responses paginated and aligned with the OpenAPI clients.
Changes:
- Adds two workflow-scoped server endpoints: Slurm job↔Torc job correlations and currently-running jobs (with compute-node + scheduler metadata).
- Denormalizes
results.job_nameonto server read paths (list/get) to eliminate an extra client “fetch all jobs” join. - Improves CLI listings: show Run ID for
results list --all-runs, show scheduler columns forcompute-nodes list, and addstorc jobs running.
Reviewed changes
Copilot reviewed 44 out of 44 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_workflows.rs | Adds integration tests for the new workflow endpoints (correlations + running jobs). |
| tests/test_results.rs | Extends results tests to assert server-populated job_name and Run ID column behavior. |
| src/server/response_types.rs | Exposes new workflow response enums through the response-type re-exports. |
| src/server/live_router.rs | Registers new routes and adds the HTTP handlers + OpenAPI annotations for both endpoints. |
| src/server/http_transport/response_mapping.rs | Adds response mappers for the new response enums. |
| src/server/http_server/workflows_transport.rs | Adds transport-layer methods that authorize workflow access and apply pagination. |
| src/server/http_server.rs | Wires new transport methods into the server trait implementation with tracing instrumentation. |
| src/server/api/workflows.rs | Implements the new SQL-backed server-side aggregation endpoints. |
| src/server/api/results.rs | Adds job_name to result list/get queries via LEFT JOIN job. |
| src/server/api_responses.rs | Introduces GetSlurmJobCorrelationsResponse and GetRunningJobsResponse enums. |
| src/server/api_contract.rs | Adds the new endpoints to the transport API contract. |
| src/openapi_spec.rs | Updates emitted OpenAPI schema/path registration + parity checks for new endpoints and job_name. |
| src/models.rs | Adds ResultModel.job_name and introduces RunningJobs/SlurmCorrelation model + response types. |
| src/lib.rs | Re-exports the new public models from the crate root. |
| src/client/offline_journal.rs | Updates test fixtures to account for the new ResultModel.job_name field. |
| src/client/commands/slurm.rs | Replaces client-side 4-list correlation join with the new server endpoint + paging. |
| src/client/commands/results.rs | Uses server-populated result.job_name and adjusts Run ID column visibility logic. |
| src/client/commands/jobs.rs | Adds torc jobs running command and table output for running jobs. |
| src/client/commands/compute_nodes.rs | Adds scheduler columns and parses scheduler job ID from the scheduler blob for display. |
| src/client/apis/workflows_api.rs | Adds generated Rust client methods + typed errors for the new endpoints. |
| python_client/src/torc/openapi_client/models/slurm_job_correlations_response.py | Generated Python model for correlations list response. |
| python_client/src/torc/openapi_client/models/slurm_job_correlation_model.py | Generated Python model for a correlation row. |
| python_client/src/torc/openapi_client/models/running_jobs_response.py | Generated Python model for running-jobs list response. |
| python_client/src/torc/openapi_client/models/running_job_model.py | Generated Python model for a running job row. |
| python_client/src/torc/openapi_client/models/result_model.py | Adds job_name to the generated Python ResultModel. |
| python_client/src/torc/openapi_client/models/init.py | Exports the newly generated Python models. |
| python_client/src/torc/openapi_client/api/workflows_api.py | Adds generated Python API methods for the new endpoints. |
| python_client/src/torc/openapi_client/init.py | Exports the new Python API surface/types at package root. |
| julia_client/Torc/src/api/models/model_SlurmJobCorrelationsResponse.jl | Generated Julia model for correlations list response. |
| julia_client/Torc/src/api/models/model_SlurmJobCorrelationModel.jl | Generated Julia model for a correlation row. |
| julia_client/Torc/src/api/models/model_RunningJobsResponse.jl | Generated Julia model for running-jobs list response. |
| julia_client/Torc/src/api/models/model_RunningJobModel.jl | Generated Julia model for a running job row. |
| julia_client/Torc/src/api/models/model_ResultModel.jl | Adds job_name to the generated Julia ResultModel. |
| julia_client/Torc/src/api/modelincludes.jl | Includes the new generated Julia model files. |
| julia_client/Torc/src/api/apis/api_WorkflowsApi.jl | Adds generated Julia API functions for the new endpoints. |
| julia_client/julia_client/README.md | Updates Julia client README endpoint + model lists. |
| julia_client/julia_client/docs/WorkflowsApi.md | Documents new workflow endpoints in the Julia client docs. |
| julia_client/julia_client/docs/SlurmJobCorrelationsResponse.md | Adds Julia doc page for correlations response. |
| julia_client/julia_client/docs/SlurmJobCorrelationModel.md | Adds Julia doc page for correlation model. |
| julia_client/julia_client/docs/RunningJobsResponse.md | Adds Julia doc page for running jobs response. |
| julia_client/julia_client/docs/RunningJobModel.md | Adds Julia doc page for running job model. |
| julia_client/julia_client/docs/ResultModel.md | Documents job_name on ResultModel for Julia client. |
| api/openapi.yaml | Adds the new endpoints + schemas and includes job_name on ResultModel. |
| api/openapi.codegen.yaml | Mirrors OpenAPI changes for codegen parity. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) -> Response<Body> { | ||
| match state | ||
| .server | ||
| .get_slurm_job_correlations(id, query.offset, query.limit, &context) | ||
| .await | ||
| { | ||
| Ok(response) => get_slurm_job_correlations_response(response), | ||
| Err(err) => error_response(StatusCode::INTERNAL_SERVER_ERROR, err.0), | ||
| } |
| ) -> Response<Body> { | ||
| match state | ||
| .server | ||
| .get_running_jobs(id, query.offset, query.limit, &context) | ||
| .await | ||
| { | ||
| Ok(response) => get_running_jobs_response(response), | ||
| Err(err) => error_response(StatusCode::INTERNAL_SERVER_ERROR, err.0), | ||
| } |
…or absent scheduler ID - get_running_jobs / get_slurm_job_correlations: remove the explicit `SELECT 1 FROM workflow` existence checks. The transport's authorize_workflow! already returns 404 for a missing workflow (check_workflow_access always verifies existence, even when access control is disabled), so these were a redundant DB round-trip. Not-found behavior and tests are unchanged. - `torc jobs running`: render an absent scheduler_job_id as "-" instead of an empty string, matching the compute-nodes listing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks @copilot — addressed in 2864f4d. Fixed:
Not changing here:
|
|
@daniel-thom I've opened a new pull request, #367, to work on those changes. Once the pull request is ready, I'll request review from you. |
Use print_json_wrapped so `-f json` output matches the other list commands (jobs list, results list, etc.) and works with `jq '.items[]'` uniformly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks @copilot — fixed in the latest commit. |
Add the job's start_time to RunningJobModel (populated from job.start_time on the server) and an "Elapsed" column to the `torc jobs running` table, reusing the same format_elapsed helper as `torc jobs list`. Makes long-running/stuck jobs easy to spot. start_time is Option<String> + skip_serializing_if; clients regenerated; test asserts it's populated for a started job. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two bugs in the --include-logs report path: - The records array serialized under `results`; renamed to `items` (via serde) to match every other list command and the REST API's paginated responses. - `--failed` was ignored: the include-logs branch called generate_results_report without the flag. Thread `failed` through generate_results_report / build_results_report and filter to non-zero return codes, mirroring the non-include-logs path. Test asserts records serialize under `items` (not `results`). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 47 out of 47 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
src/server/api/results.rs:356
- In
list_results,bind_valuesis built up alongsidewhere_conditionsbut is never used to bind the SQL query (bindings are re-applied manually later). This extra collection adds cognitive overhead and small but unnecessary allocations; either removebind_valuesentirely or refactor to bind from it in one place.
SQLite treats a negative LIMIT (e.g. LIMIT -1) as "no limit", so a request with `?limit=-1` bypassed MAX_RECORD_TRANSFER_COUNT and could fetch unbounded results on any paginated endpoint. Reject negative `limit` (and `offset`) in the shared helper so the cap can't be circumvented. Adds unit tests for the negative/oversized/default cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Fixed the negative- |
Consolidates the remaining Tier-2 server-side aggregation work and several CLI listing improvements into one PR. Follows #363 (
torc status) and #364 (torc watch), both already merged.New server endpoints (paginated, scheduler-agnostic where relevant)
GET /workflows/{id}/slurm_job_correlations— correlates Slurm job IDs to the Torc jobs they ran (scheduled_compute_node → compute_node → result → job) in one SQL query. Replaces the 4-full-list client-side join intorc slurm diagnose-logs'sbuild_slurm_to_jobs_map.GET /workflows/{id}/running_jobs+torc jobs running— lists currently-running jobs with their compute node name and, when scheduler-provisioned, the scheduler type + job ID (Slurm today; generic for future schedulers). Uses the livejob.compute_node_idlink, so it reflects what's running now (the correlations endpoint is results-based = finished jobs only).Both join through
workflow_idindexes (index-only plans, no scans) and use the standard pagination envelope.CLI listing improvements
torc results list --all-runsnow shows the Run ID column (it was unconditionally hidden, so runs were indistinguishable).torc compute-nodes listnow always shows Scheduler (type) and Scheduler Job ID columns — the data was already in the payload, just not surfaced in the table.resultsjob_nameis now populated server-side (ResultModel.job_name, filled via aLEFT JOIN jobonlist_results/get_result). Removes the extra all-jobs fetch (get_job_name_map) and benefits every consumer of the results API, not just the CLI.Notes
Testing
cargo fmt --check,cargo clippy --all --all-targets --all-features -D warnings,dprint check, OpenAPI codegen parity — all clean.🤖 Generated with Claude Code