Skip to content

Server-side aggregation endpoints + scheduler/job visibility in listings#366

Merged
daniel-thom merged 11 commits into
mainfrom
consolidated-tier2
Jun 6, 2026
Merged

Server-side aggregation endpoints + scheduler/job visibility in listings#366
daniel-thom merged 11 commits into
mainfrom
consolidated-tier2

Conversation

@daniel-thom
Copy link
Copy Markdown
Collaborator

Consolidates the remaining Tier-2 server-side aggregation work and several CLI listing improvements into one PR. Follows #363 (torc status) and #364 (torc watch), both already merged.

New server endpoints (paginated, scheduler-agnostic where relevant)

  • GET /workflows/{id}/slurm_job_correlations — correlates Slurm job IDs to the Torc jobs they ran (scheduled_compute_node → compute_node → result → job) in one SQL query. Replaces the 4-full-list client-side join in torc slurm diagnose-logs's build_slurm_to_jobs_map.
  • GET /workflows/{id}/running_jobs + torc jobs running — lists currently-running jobs with their compute node name and, when scheduler-provisioned, the scheduler type + job ID (Slurm today; generic for future schedulers). Uses the live job.compute_node_id link, so it reflects what's running now (the correlations endpoint is results-based = finished jobs only).

Both join through workflow_id indexes (index-only plans, no scans) and use the standard pagination envelope.

CLI listing improvements

  • torc results list --all-runs now shows the Run ID column (it was unconditionally hidden, so runs were indistinguishable).
  • torc compute-nodes list now always shows Scheduler (type) and Scheduler Job ID columns — the data was already in the payload, just not surfaced in the table.
  • results job_name is now populated server-side (ResultModel.job_name, filled via a LEFT JOIN job on list_results/get_result). Removes the extra all-jobs fetch (get_job_name_map) and benefits every consumer of the results API, not just the CLI.

Notes

Testing

  • cargo fmt --check, cargo clippy --all --all-targets --all-features -D warnings, dprint check, OpenAPI codegen parity — all clean.
  • 21 integration/unit tests across the new endpoints, results listing (all-runs/sorting/filters/job_name), and compute-node rows — all passing.

🤖 Generated with Claude Code

daniel-thom and others added 6 commits June 5, 2026 15:12
`torc slurm diagnose-logs` correlated Slurm job IDs to the Torc jobs they
ran by fetching four full lists (scheduled compute nodes, compute nodes,
results, jobs) and joining them through three HashMaps in
`build_slurm_to_jobs_map` -- the heaviest client-side join in the codebase.

Add `GET /workflows/{id}/slurm_job_correlations`, which performs the whole
join in one SQL query: scheduled_compute_node (scheduler_id = Slurm job ID)
-> compute_node (linked via the scheduler JSON's scheduler_id) -> result ->
job, grouped/ordered by (slurm_job_id, job_id) and covering all runs (matching
the prior all_runs=true behavior). Every table is narrowed by its workflow_id
index before joining; the query plan is index-only with no table scans.

`build_slurm_to_jobs_map` now makes a single call and rebuilds the same
`HashMap<String, Vec<AffectedJob>>`, so all consumers are unchanged. The
now-unused pagination imports are removed.

Adds integration tests for the correlation chain and the 404 path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The endpoint returned an unbounded `{ items }` list, inconsistent with every
other list endpoint and the project's 10,000-row transfer ceiling. Add the
standard pagination contract: `offset`/`limit` query params resolved via
`authorize_workflow_and_paginate!`, a `total_count` from a wrapping COUNT,
and the full envelope (offset, max_limit, count, total_count, has_more) built
with `paginated_list_response!`. The count query keeps the same index-only
plan.

`build_slurm_to_jobs_map` now pages through correlations (accumulating into the
same map, so per-Slurm-job Vecs stay ordered) instead of assuming a single
response.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds `GET /workflows/{id}/running_jobs` (paginated) and a `torc jobs running`
CLI command that lists each currently-running job with its compute node name
and, when the node was provisioned by a scheduler, that scheduler's job ID.

The design is scheduler-agnostic: the endpoint joins live job state
(job.compute_node_id, set while running) to the compute node and LEFT JOINs
the scheduled compute node it belongs to. `scheduler_type` comes from the
compute node's type (e.g. "slurm", "local") and `scheduler_job_id` is the
external scheduler job ID (the Slurm job ID today), NULL for nodes not managed
by a scheduler. New scheduler types flow through with no code changes.

Unlike slurm_job_correlations (which joins through results and therefore only
covers finished jobs), this uses the live job->compute_node link so it
reflects what is running right now. Server-side join + pagination keeps it off
the N+1 path; the query is index-only via idx_job_workflow_status.

Includes integration tests (running job on a Slurm node, plus the 404 path)
and regenerated Rust/Python/Julia clients.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The results table unconditionally excluded the Run ID column, so even with
--all-runs (which fetches results across every run) there was no way to tell
which run a row belonged to. Keep Run ID hidden for the default single-run
view, but include it when --all-runs is passed.

Adds a CLI test asserting the column is absent by default and present with
--all-runs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The compute-nodes table omitted scheduler info entirely, so the human-facing
view couldn't answer "which Slurm job is this node from?" -- yet `-f json`
exposed it (compute_node_type plus the scheduler blob). Add two always-on
columns: Scheduler (the node's compute_node_type, e.g. slurm/local) and
Scheduler Job ID (the external job ID from the scheduler blob; today
slurm_job_id, "-" for local/unscheduled nodes).

This is a client-only table change -- the data was already in the list
payload. Kept generic (type from compute_node_type, ID from the blob) to match
the running-jobs work and extend to future scheduler types.

Adds a unit test covering local (no job ID) and Slurm (job ID surfaced) rows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`torc results list` previously fetched every job in the workflow just to label
result rows with their job name (get_job_name_map). Add an optional job_name
field to ResultModel that the server fills via a LEFT JOIN to job on the read
paths (list_results and get_result), so the client reads result.job_name
directly and the extra all-jobs fetch is gone. This benefits every consumer of
the results API, not just the CLI.

job_name is Option<String> + skip_serializing_if and ignored on create/update
input. list_results now always aliases `result` as `r` (both the all_runs and
workflow_result branches) so the join and sort columns stay unambiguous;
get_result moves off the query! macro to a runtime query to avoid LEFT JOIN
nullability inference. Clients regenerated.

Removes get_job_name_map; asserts job_name in the results list test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes the remaining Tier-2 migrations of “client-side joins” into server-side SQL aggregation/list endpoints and surfaces additional scheduler/run visibility in CLI listings, while keeping all list-style responses paginated and aligned with the OpenAPI clients.

Changes:

  • Adds two workflow-scoped server endpoints: Slurm job↔Torc job correlations and currently-running jobs (with compute-node + scheduler metadata).
  • Denormalizes results.job_name onto server read paths (list/get) to eliminate an extra client “fetch all jobs” join.
  • Improves CLI listings: show Run ID for results list --all-runs, show scheduler columns for compute-nodes list, and adds torc jobs running.

Reviewed changes

Copilot reviewed 44 out of 44 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/test_workflows.rs Adds integration tests for the new workflow endpoints (correlations + running jobs).
tests/test_results.rs Extends results tests to assert server-populated job_name and Run ID column behavior.
src/server/response_types.rs Exposes new workflow response enums through the response-type re-exports.
src/server/live_router.rs Registers new routes and adds the HTTP handlers + OpenAPI annotations for both endpoints.
src/server/http_transport/response_mapping.rs Adds response mappers for the new response enums.
src/server/http_server/workflows_transport.rs Adds transport-layer methods that authorize workflow access and apply pagination.
src/server/http_server.rs Wires new transport methods into the server trait implementation with tracing instrumentation.
src/server/api/workflows.rs Implements the new SQL-backed server-side aggregation endpoints.
src/server/api/results.rs Adds job_name to result list/get queries via LEFT JOIN job.
src/server/api_responses.rs Introduces GetSlurmJobCorrelationsResponse and GetRunningJobsResponse enums.
src/server/api_contract.rs Adds the new endpoints to the transport API contract.
src/openapi_spec.rs Updates emitted OpenAPI schema/path registration + parity checks for new endpoints and job_name.
src/models.rs Adds ResultModel.job_name and introduces RunningJobs/SlurmCorrelation model + response types.
src/lib.rs Re-exports the new public models from the crate root.
src/client/offline_journal.rs Updates test fixtures to account for the new ResultModel.job_name field.
src/client/commands/slurm.rs Replaces client-side 4-list correlation join with the new server endpoint + paging.
src/client/commands/results.rs Uses server-populated result.job_name and adjusts Run ID column visibility logic.
src/client/commands/jobs.rs Adds torc jobs running command and table output for running jobs.
src/client/commands/compute_nodes.rs Adds scheduler columns and parses scheduler job ID from the scheduler blob for display.
src/client/apis/workflows_api.rs Adds generated Rust client methods + typed errors for the new endpoints.
python_client/src/torc/openapi_client/models/slurm_job_correlations_response.py Generated Python model for correlations list response.
python_client/src/torc/openapi_client/models/slurm_job_correlation_model.py Generated Python model for a correlation row.
python_client/src/torc/openapi_client/models/running_jobs_response.py Generated Python model for running-jobs list response.
python_client/src/torc/openapi_client/models/running_job_model.py Generated Python model for a running job row.
python_client/src/torc/openapi_client/models/result_model.py Adds job_name to the generated Python ResultModel.
python_client/src/torc/openapi_client/models/init.py Exports the newly generated Python models.
python_client/src/torc/openapi_client/api/workflows_api.py Adds generated Python API methods for the new endpoints.
python_client/src/torc/openapi_client/init.py Exports the new Python API surface/types at package root.
julia_client/Torc/src/api/models/model_SlurmJobCorrelationsResponse.jl Generated Julia model for correlations list response.
julia_client/Torc/src/api/models/model_SlurmJobCorrelationModel.jl Generated Julia model for a correlation row.
julia_client/Torc/src/api/models/model_RunningJobsResponse.jl Generated Julia model for running-jobs list response.
julia_client/Torc/src/api/models/model_RunningJobModel.jl Generated Julia model for a running job row.
julia_client/Torc/src/api/models/model_ResultModel.jl Adds job_name to the generated Julia ResultModel.
julia_client/Torc/src/api/modelincludes.jl Includes the new generated Julia model files.
julia_client/Torc/src/api/apis/api_WorkflowsApi.jl Adds generated Julia API functions for the new endpoints.
julia_client/julia_client/README.md Updates Julia client README endpoint + model lists.
julia_client/julia_client/docs/WorkflowsApi.md Documents new workflow endpoints in the Julia client docs.
julia_client/julia_client/docs/SlurmJobCorrelationsResponse.md Adds Julia doc page for correlations response.
julia_client/julia_client/docs/SlurmJobCorrelationModel.md Adds Julia doc page for correlation model.
julia_client/julia_client/docs/RunningJobsResponse.md Adds Julia doc page for running jobs response.
julia_client/julia_client/docs/RunningJobModel.md Adds Julia doc page for running job model.
julia_client/julia_client/docs/ResultModel.md Documents job_name on ResultModel for Julia client.
api/openapi.yaml Adds the new endpoints + schemas and includes job_name on ResultModel.
api/openapi.codegen.yaml Mirrors OpenAPI changes for codegen parity.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/server/live_router.rs
Comment on lines +3799 to +3807
) -> Response<Body> {
match state
.server
.get_slurm_job_correlations(id, query.offset, query.limit, &context)
.await
{
Ok(response) => get_slurm_job_correlations_response(response),
Err(err) => error_response(StatusCode::INTERNAL_SERVER_ERROR, err.0),
}
Comment thread src/server/live_router.rs
Comment on lines +3836 to +3844
) -> Response<Body> {
match state
.server
.get_running_jobs(id, query.offset, query.limit, &context)
.await
{
Ok(response) => get_running_jobs_response(response),
Err(err) => error_response(StatusCode::INTERNAL_SERVER_ERROR, err.0),
}
Comment thread src/server/api/workflows.rs Outdated
Comment thread src/server/api/workflows.rs Outdated
Comment thread src/client/commands/jobs.rs
…or absent scheduler ID

- get_running_jobs / get_slurm_job_correlations: remove the explicit
  `SELECT 1 FROM workflow` existence checks. The transport's authorize_workflow!
  already returns 404 for a missing workflow (check_workflow_access always
  verifies existence, even when access control is disabled), so these were a
  redundant DB round-trip. Not-found behavior and tests are unchanged.
- `torc jobs running`: render an absent scheduler_job_id as "-" instead of an
  empty string, matching the compute-nodes listing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@daniel-thom
Copy link
Copy Markdown
Collaborator Author

Thanks @copilot — addressed in 2864f4d.

Fixed:

  • Redundant workflow existence checks (api/workflows.rs running-jobs + correlations) — removed. authorize_workflow! runs in the transport before the impl and AuthorizationService::check_workflow_access always verifies existence first (returning 404) regardless of whether access control is enforced, so the extra SELECT 1 was a duplicate round-trip. 404 behavior and the not-found tests are unchanged.
  • scheduler_job_id rendering (torc jobs running) — now renders absent IDs as -, matching the compute-nodes listing.

Not changing here:

  • Pagination errors → HTTP 500 vs 422 (the two live_router handlers). You're right that an oversized limit surfaces as 500, but this is the project-wide behavior for every paginated endpoint: they all go through the shared authorize_workflow_and_paginate!process_pagination_params (which returns an ApiError) and the live_router maps Err → 500. Returning 422 here would mean changing that shared error mapping for all list endpoints, which is a cross-cutting change out of scope for this PR. Happy to file a follow-up if you'd like it addressed consistently.

Copy link
Copy Markdown

Copilot AI commented Jun 5, 2026

@daniel-thom I've opened a new pull request, #367, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 44 out of 44 changed files in this pull request and generated 1 comment.

Comment thread src/client/commands/jobs.rs
Use print_json_wrapped so `-f json` output matches the other list commands
(jobs list, results list, etc.) and works with `jq '.items[]'` uniformly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@daniel-thom
Copy link
Copy Markdown
Collaborator Author

Thanks @copilot — fixed in the latest commit. torc jobs running -f json now uses print_json_wrapped ({"items": [...]}), matching jobs list/results list and the documented list-command convention.

daniel-thom and others added 2 commits June 5, 2026 17:46
Add the job's start_time to RunningJobModel (populated from job.start_time on
the server) and an "Elapsed" column to the `torc jobs running` table, reusing
the same format_elapsed helper as `torc jobs list`. Makes long-running/stuck
jobs easy to spot. start_time is Option<String> + skip_serializing_if; clients
regenerated; test asserts it's populated for a started job.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two bugs in the --include-logs report path:
- The records array serialized under `results`; renamed to `items` (via serde)
  to match every other list command and the REST API's paginated responses.
- `--failed` was ignored: the include-logs branch called generate_results_report
  without the flag. Thread `failed` through generate_results_report /
  build_results_report and filter to non-zero return codes, mirroring the
  non-include-logs path.

Test asserts records serialize under `items` (not `results`).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 47 out of 47 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/server/api/results.rs:356

  • In list_results, bind_values is built up alongside where_conditions but is never used to bind the SQL query (bindings are re-applied manually later). This extra collection adds cognitive overhead and small but unnecessary allocations; either remove bind_values entirely or refactor to bind from it in one place.

Comment thread src/server/http_server/workflows_transport.rs
Comment thread src/client/report_models.rs
SQLite treats a negative LIMIT (e.g. LIMIT -1) as "no limit", so a request with
`?limit=-1` bypassed MAX_RECORD_TRANSFER_COUNT and could fetch unbounded
results on any paginated endpoint. Reject negative `limit` (and `offset`) in
the shared helper so the cap can't be circumvented.

Adds unit tests for the negative/oversized/default cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@daniel-thom
Copy link
Copy Markdown
Collaborator Author

Fixed the negative-limit issue in 91ee045: process_pagination_params now rejects negative limit (and offset), so LIMIT -1 can no longer bypass MAX_RECORD_TRANSFER_COUNT on any paginated endpoint. Added unit tests for the negative/oversized/default cases.

@daniel-thom daniel-thom merged commit 8e1930b into main Jun 6, 2026
10 of 11 checks passed
@daniel-thom daniel-thom deleted the consolidated-tier2 branch June 6, 2026 02:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants