Skip to content

Bulk-create endpoint for ro_crate_entity to scale identifier-bearing file pre-creation #326

@daniel-thom

Description

@daniel-thom

Problem

create_files (client-side, src/client/workflow_spec.rs) loops over every FileSpec that carries a user-supplied identifier and makes one POST per file via create_input_file_entity_with_identifierapis::ro_crate_entities_api::create_ro_crate_entity.

For a workflow with N identifier-bearing input files this is N sequential round-trips with no batching:

// src/client/workflow_spec.rs (excerpt)
for (file_spec, file_model) in files.iter().zip(file_models.iter()) {
    let Some(identifier) = file_spec.identifier.as_deref() else { continue; };
    ...
    crate::client::ro_crate_utils::create_input_file_entity_with_identifier(
        config, workflow_id, &file_with_id, identifier,
    )?;
}

The single-file framing in the docs (one DOI on one reference genome) is fine, but parameterized identifier templates expanding to hundreds of files are an obvious use case and would be noticeably slow. Partial failures mid-loop also leave orphan rows that only get cleaned up via the workflow's ON DELETE CASCADE if/when the higher-level rollback path fires.

Why now

This shipped behind feat/ro-crate-user-supplied-file-identifiers. Today it works correctly for the docs' framing example, but the same code path will be the choke point as soon as anyone writes:

files:
  - name: input_{i}
    path: data/input_{i}.csv
    identifier: \"urn:dataset:run-2026:{i}\"
    parameters:
      i: \"1:1000\"

Better to add the bulk endpoint before users hit it than after.

Proposed solution

Mirror the existing Files API pattern.

Server (src/server/api/ro_crate.rs):

  • New trait method create_ro_crate_entities(body: RoCrateEntitiesModel, context) returning the inserted rows.
  • Implementation: single transaction, one parameterized INSERT per row (or INSERT ... VALUES (...), (...), ... if it fits within SQLite's parameter limit; otherwise chunked).
  • Same authorization as create_ro_crate_entity (workflow_id-scoped via authorize_workflow!).

OpenAPI / generated client:

  • Refresh via cd api && bash sync_openapi.sh all --promote and regenerate clients per CLAUDE.md's add-endpoint checklist.

Client (src/client/workflow_spec.rs + src/client/ro_crate_utils.rs):

  • Replace the per-file loop in create_files with a single bulk call. Collect the identifier-bearing (file_with_id, identifier) pairs into a Vec<RoCrateEntityModel> and POST once.
  • Keep create_input_file_entity_with_identifier for callers that legitimately add one row at a time, or remove it if the loop was its only caller.

Tests:

  • Server integration test that a bulk POST inserts all rows in one transaction and rolls back atomically on any per-row violation (e.g. duplicate (workflow_id, entity_id)).
  • Client test exercising a parameterized identifier template that expands to ≥10 files and verifies a single POST is made.

Out of scope (separate follow-ups)

  • A general bulk-update / bulk-upsert endpoint for ro_crate entities (the init-time entity rebuild in create_ro_crate_entity_for_file has its own N+1 pattern of find → update).
  • Server-side batching of the init-time create_entities_for_input_files path — uses single creates today but isn't on the latency-sensitive workflow-creation hot path.

References

  • Surfaced during /review-api on PR for feat/ro-crate-user-supplied-file-identifiers.
  • Precedent: Files API already exposes create_files (bulk) alongside create_file (single) — src/server/api/files.rs:33.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions