Skip to content

Add blueprint artifact export#379

Merged
txmed82 merged 1 commit into
masterfrom
codex/blueprint-export
Jun 4, 2026
Merged

Add blueprint artifact export#379
txmed82 merged 1 commit into
masterfrom
codex/blueprint-export

Conversation

@txmed82

@txmed82 txmed82 commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add JSONL export helpers for persisted ClinicalBlueprint artifacts with optional CohortPlan context
  • Add casecrawler export-blueprints with dataset and cohort-plan filters
  • Add helper and CLI tests for blueprint artifact exports

Tests

  • .venv/bin/python -m pytest tests/test_blueprint_export.py tests/test_cli_synthetic.py::test_export_blueprints_command_writes_jsonl -q
  • .venv/bin/python -m ruff check src/casecrawler/export/blueprints.py src/casecrawler/cli.py tests/test_blueprint_export.py tests/test_cli_synthetic.py
  • .venv/bin/python -m pytest tests/test_blueprint_export.py tests/test_cli_synthetic.py tests/test_fine_tuning_export.py -q
  • .venv/bin/python -m pytest -q -m "not optional_backend and not network and not slow"

Summary by CodeRabbit

  • New Features

    • Added export-blueprints CLI command to export clinical blueprint artifacts to JSONL format, with optional filtering by dataset and cohort plan.
    • Validates blueprint existence before export and reports total count of exported artifacts.
  • Tests

    • Added comprehensive tests for blueprint export functionality and CLI command integration.

@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds a new export-blueprints CLI command that exports persisted clinical blueprint artifacts to JSONL format. The implementation includes core serialization functions, a CLI entry point with DatasetStore integration, and comprehensive unit and integration tests validating the export pipeline.

Changes

Blueprint Export Feature

Layer / File(s) Summary
Blueprint export serialization and unit tests
src/casecrawler/export/blueprints.py, tests/test_blueprint_export.py
export_blueprint_payload constructs JSON payloads from ClinicalBlueprint with optional CohortPlan context, and export_blueprints_jsonl writes blueprints to NDJSON files with optional plan lookup. Unit tests validate payload inclusion of artifact_type, blueprint ID, and cohort plan fields, plus JSONL line count and JSON parsing.
CLI export-blueprints command and integration test
src/casecrawler/cli.py, tests/test_cli_synthetic.py
export-blueprints command loads blueprints from DatasetStore with optional --dataset-id and --cohort-plan-id filters, validates blueprints exist, calls export_blueprints_jsonl, and prints exported count. Integration test seeds the store with a plan and blueprint, runs the command, and asserts JSONL output structure and payload fields.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • txmed82/case-crawler#371: The PR's serialization logic directly depends on the ClinicalBlueprint and CohortPlan data structures introduced in this PR.
  • txmed82/case-crawler#372: The export-blueprints CLI command depends on new DatasetStore persistence and query methods (e.g., get_cohort_plan, blueprint listing) introduced in this PR.

Poem

🐰 Blueprint bundles bundled neat,
JSONL lines, a data treat,
From store to export, the rabbit's feat,
Cohort plans and blueprints sweet! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.27% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add blueprint artifact export' directly and clearly describes the main change: adding export functionality for blueprint artifacts with JSONL serialization and a new CLI command.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/blueprint-export

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/casecrawler/export/blueprints.py (1)

16-21: ⚡ Quick win

Use model_dump(mode="json") for JSONL export payloads

export_blueprints_jsonl() writes payload via json.dumps(...), while ClinicalBlueprint/CohortPlan include several dict[str, Any] fields (patient, metadata, etc.). Dumping in JSON mode prevents JSONL serialization failures if those Any values (or future schema fields) contain non-JSON-native objects.

Proposed fix
 def export_blueprint_payload(
     blueprint: ClinicalBlueprint,
     *,
     plan: CohortPlan | None = None,
 ) -> dict[str, Any]:
     payload = {
         "artifact_type": "casecrawler_clinical_blueprint",
-        "blueprint": blueprint.model_dump(),
+        "blueprint": blueprint.model_dump(mode="json"),
     }
     if plan is not None:
-        payload["cohort_plan"] = plan.model_dump()
+        payload["cohort_plan"] = plan.model_dump(mode="json")
     return payload
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/casecrawler/export/blueprints.py` around lines 16 - 21, The payload uses
blueprint.model_dump() and plan.model_dump() which can emit non-JSON-native
types; in export_blueprints_jsonl() change these to
blueprint.model_dump(mode="json") and plan.model_dump(mode="json") so the
ClinicalBlueprint and CohortPlan nested dicts (e.g., patient, metadata) are
serialized into JSON-safe primitives before json.dumps writes the JSONL payload;
update the payload assignment where payload = {"artifact_type":
"casecrawler_clinical_blueprint", "blueprint": ...} and the conditional
payload["cohort_plan"] = ... accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/casecrawler/cli.py`:
- Around line 2076-2080: The current call to DatasetStore.list_blueprints in
src/casecrawler/cli.py assigns a one-million hard limit to blueprints, which
silently truncates exports; change the export to paginate instead of relying on
a fixed large limit: either update DatasetStore.list_blueprints to provide an
iterator/generator (e.g., implement a list_blueprints_iter method that yields
pages) or loop calling list_blueprints with explicit offset/limit until no more
rows, and replace the single call that sets blueprints =
store.list_blueprints(...) with a paging loop that collects/yields all rows for
export. Ensure you reference and update the call site where blueprints is used
so the export consumes the paginated iterator or accumulated full result set.
- Around line 2087-2092: The call to export_blueprints_jsonl in the CLI can
raise raw OSError on unwritable output paths; wrap the call in a try/except that
catches OSError (and optionally IOError) around the call to
export_blueprints_jsonl(blueprints, output, plan_lookup=store.get_cohort_plan)
and re-raise as click.ClickException with a clear message including the output
path and the original error text (e.g., f"Failed to write export to {output}:
{err}"), leaving successful behavior (click.echo of count) unchanged.

---

Nitpick comments:
In `@src/casecrawler/export/blueprints.py`:
- Around line 16-21: The payload uses blueprint.model_dump() and
plan.model_dump() which can emit non-JSON-native types; in
export_blueprints_jsonl() change these to blueprint.model_dump(mode="json") and
plan.model_dump(mode="json") so the ClinicalBlueprint and CohortPlan nested
dicts (e.g., patient, metadata) are serialized into JSON-safe primitives before
json.dumps writes the JSONL payload; update the payload assignment where payload
= {"artifact_type": "casecrawler_clinical_blueprint", "blueprint": ...} and the
conditional payload["cohort_plan"] = ... accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: fde7eaad-82a1-4cad-8181-4cbba855030e

📥 Commits

Reviewing files that changed from the base of the PR and between d16d9ef and 85cc423.

📒 Files selected for processing (4)
  • src/casecrawler/cli.py
  • src/casecrawler/export/blueprints.py
  • tests/test_blueprint_export.py
  • tests/test_cli_synthetic.py

Comment thread src/casecrawler/cli.py
Comment on lines +2076 to +2080
blueprints = store.list_blueprints(
dataset_id=dataset_id,
cohort_plan_id=cohort_plan_id,
limit=1_000_000,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid silently truncating exports at one million rows.

DatasetStore.list_blueprints() applies this limit directly in SQL, so this command will export only the first 1,000,000 matches and still report success. For an export path, that becomes silent data loss. Please page through results or add an iterator-based store API so large blueprint sets are exported completely.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/casecrawler/cli.py` around lines 2076 - 2080, The current call to
DatasetStore.list_blueprints in src/casecrawler/cli.py assigns a one-million
hard limit to blueprints, which silently truncates exports; change the export to
paginate instead of relying on a fixed large limit: either update
DatasetStore.list_blueprints to provide an iterator/generator (e.g., implement a
list_blueprints_iter method that yields pages) or loop calling list_blueprints
with explicit offset/limit until no more rows, and replace the single call that
sets blueprints = store.list_blueprints(...) with a paging loop that
collects/yields all rows for export. Ensure you reference and update the call
site where blueprints is used so the export consumes the paginated iterator or
accumulated full result set.

Comment thread src/casecrawler/cli.py
Comment on lines +2087 to +2092
count = export_blueprints_jsonl(
blueprints,
output,
plan_lookup=store.get_cohort_plan,
)
click.echo(f"Exported {count} blueprint artifact(s) to {output}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wrap export write failures in ClickException.

If the output path is unwritable, this command currently bubbles the raw OSError instead of returning a normal CLI error message.

Proposed fix
-    count = export_blueprints_jsonl(
-        blueprints,
-        output,
-        plan_lookup=store.get_cohort_plan,
-    )
+    try:
+        count = export_blueprints_jsonl(
+            blueprints,
+            output,
+            plan_lookup=store.get_cohort_plan,
+        )
+    except OSError as exc:
+        raise click.ClickException(
+            f"Failed to write blueprint export to {output}: {exc}"
+        ) from exc
     click.echo(f"Exported {count} blueprint artifact(s) to {output}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
count = export_blueprints_jsonl(
blueprints,
output,
plan_lookup=store.get_cohort_plan,
)
click.echo(f"Exported {count} blueprint artifact(s) to {output}")
try:
count = export_blueprints_jsonl(
blueprints,
output,
plan_lookup=store.get_cohort_plan,
)
except OSError as exc:
raise click.ClickException(
f"Failed to write blueprint export to {output}: {exc}"
) from exc
click.echo(f"Exported {count} blueprint artifact(s) to {output}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/casecrawler/cli.py` around lines 2087 - 2092, The call to
export_blueprints_jsonl in the CLI can raise raw OSError on unwritable output
paths; wrap the call in a try/except that catches OSError (and optionally
IOError) around the call to export_blueprints_jsonl(blueprints, output,
plan_lookup=store.get_cohort_plan) and re-raise as click.ClickException with a
clear message including the output path and the original error text (e.g.,
f"Failed to write export to {output}: {err}"), leaving successful behavior
(click.echo of count) unchanged.

@txmed82 txmed82 merged commit 7eb4d95 into master Jun 4, 2026
4 checks passed
@txmed82 txmed82 deleted the codex/blueprint-export branch June 4, 2026 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant