Server-side aggregation endpoints + scheduler/job visibility in listings by daniel-thom · Pull Request #366 · NatLabRockies/torc

daniel-thom · 2026-06-05T21:17:55Z

Consolidates the remaining Tier-2 server-side aggregation work and several CLI listing improvements into one PR. Follows #363 (torc status) and #364 (torc watch), both already merged.

New server endpoints (paginated, scheduler-agnostic where relevant)

GET /workflows/{id}/slurm_job_correlations — correlates Slurm job IDs to the Torc jobs they ran (scheduled_compute_node → compute_node → result → job) in one SQL query. Replaces the 4-full-list client-side join in torc slurm diagnose-logs's build_slurm_to_jobs_map.
GET /workflows/{id}/running_jobs + torc jobs running — lists currently-running jobs with their compute node name and, when scheduler-provisioned, the scheduler type + job ID (Slurm today; generic for future schedulers). Uses the live job.compute_node_id link, so it reflects what's running now (the correlations endpoint is results-based = finished jobs only).

Both join through workflow_id indexes (index-only plans, no scans) and use the standard pagination envelope.

CLI listing improvements

torc results list --all-runs now shows the Run ID column (it was unconditionally hidden, so runs were indistinguishable).
torc compute-nodes list now always shows Scheduler (type) and Scheduler Job ID columns — the data was already in the payload, just not surfaced in the table.
results job_name is now populated server-side (ResultModel.job_name, filled via a LEFT JOIN job on list_results/get_result). Removes the extra all-jobs fetch (get_job_name_map) and benefits every consumer of the results API, not just the CLI.

Notes

Supersedes Add server-side Slurm-job correlation endpoint #365 (its two commits are the first two here).
All four pieces are list-shaped and carry the pagination envelope.
Rust/Python/Julia clients regenerated; OpenAPI parity check passes.

Testing

cargo fmt --check, cargo clippy --all --all-targets --all-features -D warnings, dprint check, OpenAPI codegen parity — all clean.
21 integration/unit tests across the new endpoints, results listing (all-runs/sorting/filters/job_name), and compute-node rows — all passing.

🤖 Generated with Claude Code

`torc slurm diagnose-logs` correlated Slurm job IDs to the Torc jobs they ran by fetching four full lists (scheduled compute nodes, compute nodes, results, jobs) and joining them through three HashMaps in `build_slurm_to_jobs_map` -- the heaviest client-side join in the codebase. Add `GET /workflows/{id}/slurm_job_correlations`, which performs the whole join in one SQL query: scheduled_compute_node (scheduler_id = Slurm job ID) -> compute_node (linked via the scheduler JSON's scheduler_id) -> result -> job, grouped/ordered by (slurm_job_id, job_id) and covering all runs (matching the prior all_runs=true behavior). Every table is narrowed by its workflow_id index before joining; the query plan is index-only with no table scans. `build_slurm_to_jobs_map` now makes a single call and rebuilds the same `HashMap<String, Vec<AffectedJob>>`, so all consumers are unchanged. The now-unused pagination imports are removed. Adds integration tests for the correlation chain and the 404 path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The endpoint returned an unbounded `{ items }` list, inconsistent with every other list endpoint and the project's 10,000-row transfer ceiling. Add the standard pagination contract: `offset`/`limit` query params resolved via `authorize_workflow_and_paginate!`, a `total_count` from a wrapping COUNT, and the full envelope (offset, max_limit, count, total_count, has_more) built with `paginated_list_response!`. The count query keeps the same index-only plan. `build_slurm_to_jobs_map` now pages through correlations (accumulating into the same map, so per-Slurm-job Vecs stay ordered) instead of assuming a single response. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds `GET /workflows/{id}/running_jobs` (paginated) and a `torc jobs running` CLI command that lists each currently-running job with its compute node name and, when the node was provisioned by a scheduler, that scheduler's job ID. The design is scheduler-agnostic: the endpoint joins live job state (job.compute_node_id, set while running) to the compute node and LEFT JOINs the scheduled compute node it belongs to. `scheduler_type` comes from the compute node's type (e.g. "slurm", "local") and `scheduler_job_id` is the external scheduler job ID (the Slurm job ID today), NULL for nodes not managed by a scheduler. New scheduler types flow through with no code changes. Unlike slurm_job_correlations (which joins through results and therefore only covers finished jobs), this uses the live job->compute_node link so it reflects what is running right now. Server-side join + pagination keeps it off the N+1 path; the query is index-only via idx_job_workflow_status. Includes integration tests (running job on a Slurm node, plus the 404 path) and regenerated Rust/Python/Julia clients. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The results table unconditionally excluded the Run ID column, so even with --all-runs (which fetches results across every run) there was no way to tell which run a row belonged to. Keep Run ID hidden for the default single-run view, but include it when --all-runs is passed. Adds a CLI test asserting the column is absent by default and present with --all-runs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The compute-nodes table omitted scheduler info entirely, so the human-facing view couldn't answer "which Slurm job is this node from?" -- yet `-f json` exposed it (compute_node_type plus the scheduler blob). Add two always-on columns: Scheduler (the node's compute_node_type, e.g. slurm/local) and Scheduler Job ID (the external job ID from the scheduler blob; today slurm_job_id, "-" for local/unscheduled nodes). This is a client-only table change -- the data was already in the list payload. Kept generic (type from compute_node_type, ID from the blob) to match the running-jobs work and extend to future scheduler types. Adds a unit test covering local (no job ID) and Slurm (job ID surfaced) rows. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`torc results list` previously fetched every job in the workflow just to label result rows with their job name (get_job_name_map). Add an optional job_name field to ResultModel that the server fills via a LEFT JOIN to job on the read paths (list_results and get_result), so the client reads result.job_name directly and the extra all-jobs fetch is gone. This benefits every consumer of the results API, not just the CLI. job_name is Option<String> + skip_serializing_if and ignored on create/update input. list_results now always aliases `result` as `r` (both the all_runs and workflow_result branches) so the join and sort columns stay unambiguous; get_result moves off the query! macro to a runtime query to avoid LEFT JOIN nullability inference. Clients regenerated. Removes get_job_name_map; asserts job_name in the results list test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR completes the remaining Tier-2 migrations of “client-side joins” into server-side SQL aggregation/list endpoints and surfaces additional scheduler/run visibility in CLI listings, while keeping all list-style responses paginated and aligned with the OpenAPI clients.

Changes:

Adds two workflow-scoped server endpoints: Slurm job↔Torc job correlations and currently-running jobs (with compute-node + scheduler metadata).
Denormalizes results.job_name onto server read paths (list/get) to eliminate an extra client “fetch all jobs” join.
Improves CLI listings: show Run ID for results list --all-runs, show scheduler columns for compute-nodes list, and adds torc jobs running.

Reviewed changes

Copilot reviewed 44 out of 44 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/test_workflows.rs	Adds integration tests for the new workflow endpoints (correlations + running jobs).
tests/test_results.rs	Extends results tests to assert server-populated `job_name` and Run ID column behavior.
src/server/response_types.rs	Exposes new workflow response enums through the response-type re-exports.
src/server/live_router.rs	Registers new routes and adds the HTTP handlers + OpenAPI annotations for both endpoints.
src/server/http_transport/response_mapping.rs	Adds response mappers for the new response enums.
src/server/http_server/workflows_transport.rs	Adds transport-layer methods that authorize workflow access and apply pagination.
src/server/http_server.rs	Wires new transport methods into the server trait implementation with tracing instrumentation.
src/server/api/workflows.rs	Implements the new SQL-backed server-side aggregation endpoints.
src/server/api/results.rs	Adds `job_name` to result list/get queries via `LEFT JOIN job`.
src/server/api_responses.rs	Introduces `GetSlurmJobCorrelationsResponse` and `GetRunningJobsResponse` enums.
src/server/api_contract.rs	Adds the new endpoints to the transport API contract.
src/openapi_spec.rs	Updates emitted OpenAPI schema/path registration + parity checks for new endpoints and `job_name`.
src/models.rs	Adds `ResultModel.job_name` and introduces RunningJobs/SlurmCorrelation model + response types.
src/lib.rs	Re-exports the new public models from the crate root.
src/client/offline_journal.rs	Updates test fixtures to account for the new `ResultModel.job_name` field.
src/client/commands/slurm.rs	Replaces client-side 4-list correlation join with the new server endpoint + paging.
src/client/commands/results.rs	Uses server-populated `result.job_name` and adjusts Run ID column visibility logic.
src/client/commands/jobs.rs	Adds `torc jobs running` command and table output for running jobs.
src/client/commands/compute_nodes.rs	Adds scheduler columns and parses scheduler job ID from the scheduler blob for display.
src/client/apis/workflows_api.rs	Adds generated Rust client methods + typed errors for the new endpoints.
python_client/src/torc/openapi_client/models/slurm_job_correlations_response.py	Generated Python model for correlations list response.
python_client/src/torc/openapi_client/models/slurm_job_correlation_model.py	Generated Python model for a correlation row.
python_client/src/torc/openapi_client/models/running_jobs_response.py	Generated Python model for running-jobs list response.
python_client/src/torc/openapi_client/models/running_job_model.py	Generated Python model for a running job row.
python_client/src/torc/openapi_client/models/result_model.py	Adds `job_name` to the generated Python ResultModel.
python_client/src/torc/openapi_client/models/init.py	Exports the newly generated Python models.
python_client/src/torc/openapi_client/api/workflows_api.py	Adds generated Python API methods for the new endpoints.
python_client/src/torc/openapi_client/init.py	Exports the new Python API surface/types at package root.
julia_client/Torc/src/api/models/model_SlurmJobCorrelationsResponse.jl	Generated Julia model for correlations list response.
julia_client/Torc/src/api/models/model_SlurmJobCorrelationModel.jl	Generated Julia model for a correlation row.
julia_client/Torc/src/api/models/model_RunningJobsResponse.jl	Generated Julia model for running-jobs list response.
julia_client/Torc/src/api/models/model_RunningJobModel.jl	Generated Julia model for a running job row.
julia_client/Torc/src/api/models/model_ResultModel.jl	Adds `job_name` to the generated Julia ResultModel.
julia_client/Torc/src/api/modelincludes.jl	Includes the new generated Julia model files.
julia_client/Torc/src/api/apis/api_WorkflowsApi.jl	Adds generated Julia API functions for the new endpoints.
julia_client/julia_client/README.md	Updates Julia client README endpoint + model lists.
julia_client/julia_client/docs/WorkflowsApi.md	Documents new workflow endpoints in the Julia client docs.
julia_client/julia_client/docs/SlurmJobCorrelationsResponse.md	Adds Julia doc page for correlations response.
julia_client/julia_client/docs/SlurmJobCorrelationModel.md	Adds Julia doc page for correlation model.
julia_client/julia_client/docs/RunningJobsResponse.md	Adds Julia doc page for running jobs response.
julia_client/julia_client/docs/RunningJobModel.md	Adds Julia doc page for running job model.
julia_client/julia_client/docs/ResultModel.md	Documents `job_name` on ResultModel for Julia client.
api/openapi.yaml	Adds the new endpoints + schemas and includes `job_name` on ResultModel.
api/openapi.codegen.yaml	Mirrors OpenAPI changes for codegen parity.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+) -> Response<Body> {
+    match state
+        .server
+        .get_slurm_job_correlations(id, query.offset, query.limit, &context)
+        .await
+    {
+        Ok(response) => get_slurm_job_correlations_response(response),
+        Err(err) => error_response(StatusCode::INTERNAL_SERVER_ERROR, err.0),
+    }


+) -> Response<Body> {
+    match state
+        .server
+        .get_running_jobs(id, query.offset, query.limit, &context)
+        .await
+    {
+        Ok(response) => get_running_jobs_response(response),
+        Err(err) => error_response(StatusCode::INTERNAL_SERVER_ERROR, err.0),
+    }


…or absent scheduler ID - get_running_jobs / get_slurm_job_correlations: remove the explicit `SELECT 1 FROM workflow` existence checks. The transport's authorize_workflow! already returns 404 for a missing workflow (check_workflow_access always verifies existence, even when access control is disabled), so these were a redundant DB round-trip. Not-found behavior and tests are unchanged. - `torc jobs running`: render an absent scheduler_job_id as "-" instead of an empty string, matching the compute-nodes listing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

daniel-thom · 2026-06-05T21:48:31Z

Thanks @copilot — addressed in 2864f4d.

Fixed:

Redundant workflow existence checks (api/workflows.rs running-jobs + correlations) — removed. authorize_workflow! runs in the transport before the impl and AuthorizationService::check_workflow_access always verifies existence first (returning 404) regardless of whether access control is enforced, so the extra SELECT 1 was a duplicate round-trip. 404 behavior and the not-found tests are unchanged.
scheduler_job_id rendering (torc jobs running) — now renders absent IDs as -, matching the compute-nodes listing.

Not changing here:

Pagination errors → HTTP 500 vs 422 (the two live_router handlers). You're right that an oversized limit surfaces as 500, but this is the project-wide behavior for every paginated endpoint: they all go through the shared authorize_workflow_and_paginate! → process_pagination_params (which returns an ApiError) and the live_router maps Err → 500. Returning 422 here would mean changing that shared error mapping for all list endpoints, which is a cross-cutting change out of scope for this PR. Happy to file a follow-up if you'd like it addressed consistently.

Copilot · 2026-06-05T21:48:42Z

@daniel-thom I've opened a new pull request, #367, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot

Pull request overview

Copilot reviewed 44 out of 44 changed files in this pull request and generated 1 comment.

Use print_json_wrapped so `-f json` output matches the other list commands (jobs list, results list, etc.) and works with `jq '.items[]'` uniformly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

daniel-thom · 2026-06-05T22:03:49Z

Thanks @copilot — fixed in the latest commit. torc jobs running -f json now uses print_json_wrapped ({"items": [...]}), matching jobs list/results list and the documented list-command convention.

Add the job's start_time to RunningJobModel (populated from job.start_time on the server) and an "Elapsed" column to the `torc jobs running` table, reusing the same format_elapsed helper as `torc jobs list`. Makes long-running/stuck jobs easy to spot. start_time is Option<String> + skip_serializing_if; clients regenerated; test asserts it's populated for a started job. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two bugs in the --include-logs report path: - The records array serialized under `results`; renamed to `items` (via serde) to match every other list command and the REST API's paginated responses. - `--failed` was ignored: the include-logs branch called generate_results_report without the flag. Thread `failed` through generate_results_report / build_results_report and filter to non-zero return codes, mirroring the non-include-logs path. Test asserts records serialize under `items` (not `results`). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 47 out of 47 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/server/api/results.rs:356

In list_results, bind_values is built up alongside where_conditions but is never used to bind the SQL query (bindings are re-applied manually later). This extra collection adds cognitive overhead and small but unnecessary allocations; either remove bind_values entirely or refactor to bind from it in one place.

SQLite treats a negative LIMIT (e.g. LIMIT -1) as "no limit", so a request with `?limit=-1` bypassed MAX_RECORD_TRANSFER_COUNT and could fetch unbounded results on any paginated endpoint. Reject negative `limit` (and `offset`) in the shared helper so the cap can't be circumvented. Adds unit tests for the negative/oversized/default cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

daniel-thom · 2026-06-06T01:17:28Z

Fixed the negative-limit issue in 91ee045: process_pagination_params now rejects negative limit (and offset), so LIMIT -1 can no longer bypass MAX_RECORD_TRANSFER_COUNT on any paginated endpoint. Added unit tests for the negative/oversized/default cases.

daniel-thom and others added 6 commits June 5, 2026 15:12

daniel-thom mentioned this pull request Jun 5, 2026

Add server-side Slurm-job correlation endpoint #365

Closed

daniel-thom requested a review from Copilot June 5, 2026 21:18

Copilot started reviewing on behalf of daniel-thom June 5, 2026 21:18 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

Copilot AI mentioned this pull request Jun 5, 2026

[WIP] Add server-side aggregation endpoints and job visibility in listings #367

Closed

daniel-thom requested a review from Copilot June 5, 2026 21:53

Copilot started reviewing on behalf of daniel-thom June 5, 2026 21:53 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

Comment thread src/client/commands/jobs.rs

Copilot started work on behalf of daniel-thom June 5, 2026 22:03 View session

Copilot finished work on behalf of daniel-thom June 5, 2026 22:06

daniel-thom and others added 2 commits June 5, 2026 17:46

daniel-thom requested a review from Copilot June 6, 2026 00:45

Copilot started reviewing on behalf of daniel-thom June 6, 2026 00:46 View session

Copilot AI reviewed Jun 6, 2026

View reviewed changes

Comment thread src/server/http_server/workflows_transport.rs

Comment thread src/client/report_models.rs

daniel-thom mentioned this pull request Jun 6, 2026

Make concurrent initialize_jobs dedup test deterministic #369

Merged

daniel-thom merged commit 8e1930b into main Jun 6, 2026
10 of 11 checks passed

daniel-thom deleted the consolidated-tier2 branch June 6, 2026 02:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server-side aggregation endpoints + scheduler/job visibility in listings#366

Server-side aggregation endpoints + scheduler/job visibility in listings#366
daniel-thom merged 11 commits into
mainfrom
consolidated-tier2

daniel-thom commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

daniel-thom commented Jun 5, 2026

Uh oh!

Copilot AI commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

daniel-thom commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

daniel-thom commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

daniel-thom commented Jun 5, 2026

New server endpoints (paginated, scheduler-agnostic where relevant)

CLI listing improvements

Notes

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

daniel-thom commented Jun 5, 2026

Uh oh!

Copilot AI commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

daniel-thom commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

daniel-thom commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants