Creating a new project (similar to cardiac_cta)

This guide helps you set up a similar FDA 510(k) workflow for a different product area (e.g. another product code, device type, or indication). The cardiac_cta project is the reference implementation; its PROJECT_SUMMARY.md documents every step in detail.

Pipeline overview (order of operations)

Step	What	Script / command
1	Create project dir and pull initial 510(k) data	`uv run prediscope '<query>' --data-dir projects/<my_project>`
2	(Optional) Deduplicate index	Edit or script: dedupe `index.jsonl` by `k_number`
3	Download one summary per applicant (sampling)	`uv run python scripts/download_one_per_applicant.py --data-dir projects/<my_project>`
4	Find devices matching your topic from raw text	Adapt `find_cardiac_from_raw.py` → keywords + output file names (see below)
5	Download all summaries for those applicants	`uv run python scripts/download_by_k_list.py --data-dir projects/<my_project> --k-list projects/<my_project>/k_numbers_all_<topic>_applicants.txt`
6	Build knowledge base (JSONL)	Adapt `build_cardiac_knowledge_base.py` to read your topic’s K-list (see below)
7	(Optional) HTML summary	Adapt `generate_cardiac_summary_html.py` (titles, paths) or skip
8	(Optional) AI/ML deep dive	`uv run python scripts/extract_ai_ml_deep.py --data-dir projects/<my_project>`
9	Build RAG chunks for retrieval / MCP	`uv run python scripts/build_rag_chunks.py --data-dir projects/<my_project>`
10	Use retrieval in Cursor (MCP) or CLI	See MCP_SETUP.md; set `PREDISCOPE_DATA_DIR=projects/<my_project>`

Steps 1, 2, 3, 5, 8, and 9 use generic scripts: they only need --data-dir (and sometimes --k-list). Steps 4, 6, and 7 are cardiac-specific and must be adapted (or replaced) for your topic.

What to adapt for your project

1. Initial pull (generic)

Use your own openFDA query and project path:

mkdir -p projects/my_project
uv run prediscope 'product_code:XXX' --data-dir projects/my_project
# Add more queries if needed (some APIs don’t support OR; run separately and merge/dedupe index)

2. One summary per applicant (generic)

uv run python scripts/download_one_per_applicant.py --data-dir projects/my_project

3. Find devices by topic (adapt)

Script: scripts/find_cardiac_from_raw.py
Why it’s specific: Hardcoded keywords (e.g. cardiac, coronary, CTA) and output file names (cardiac_applicants.txt, cardiac_k_numbers.txt, k_numbers_all_cardiac_applicants.txt, cardiac_report.md).

Options:

Copy and adapt: Copy the script to e.g. find_<topic>_from_raw.py, change the keyword list and all output file names (and the K-list filename used in the next step).
Generic script (future): A script that reads a config file (e.g. keywords.txt and out_prefix) would work for any project; for now, the cardiac script is the template.

Downstream scripts expect a K-number list file (one K per line) for “devices matching my topic” and optionally “all K-numbers from those applicants” for the full download. Name them consistently (e.g. k_numbers_<topic>.txt and k_numbers_all_<topic>_applicants.txt) so the next step can find them.

4. Download by K-list (generic)

uv run python scripts/download_by_k_list.py --data-dir projects/my_project \
  --k-list projects/my_project/k_numbers_all_<topic>_applicants.txt

5. Build knowledge base (adapt)

Script: scripts/build_cardiac_knowledge_base.py
Why it’s specific: It looks for cardiac_k_numbers.txt and writes device records into knowledge_base.jsonl. The logic (read index + K-list + raw, enrich with year_cleared, product_code, has_ai_ml) is reusable.

Options:

Copy and adapt: Copy to e.g. build_<topic>_knowledge_base.py and change the input file from cardiac_k_numbers.txt to your topic’s K-list (e.g. k_numbers_<topic>.txt). Keep the output as knowledge_base.jsonl so downstream scripts (extract_ai_ml_deep, RAG) still work.
Symlink: In your project dir, ln -s k_numbers_<topic>.txt cardiac_k_numbers.txt and run the existing script (quick hack; file names in reports will still say “cardiac”).

6. RAG chunks and MCP (generic)

uv run python scripts/build_rag_chunks.py --data-dir projects/my_project

Then use the MCP server with your project’s chunks:

In Cursor MCP config, set PREDISCOPE_DATA_DIR to projects/my_project (see MCP_SETUP.md).
Or run retrieval from the CLI:
uv run python scripts/retrieve_chunks.py "your question" --data-dir projects/my_project --top-k 10

Reference docs

Full workflow and script reference: projects/cardiac_cta/PROJECT_SUMMARY.md — narrative and tables for every script.
MCP and retrieval in Cursor: MCP_SETUP.md.
FDA AI agent (RAG, embeddings, next steps): FDA_AI_AGENT.md.

Project layout (what ends up in the repo)

Under .gitignore, project data (raw/, summaries/, *.jsonl, *.txt in project dirs) is ignored; docs (README.md, PROJECT_SUMMARY.md, *.md, *.html) are tracked. So you can commit your project’s README and summary without committing PDFs or large JSONL. See repo root .gitignore for the exact patterns.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating a new project (similar to cardiac_cta)

Pipeline overview (order of operations)

What to adapt for your project

1. Initial pull (generic)

2. One summary per applicant (generic)

3. Find devices by topic (adapt)

4. Download by K-list (generic)

5. Build knowledge base (adapt)

6. RAG chunks and MCP (generic)

Reference docs

Project layout (what ends up in the repo)

FilesExpand file tree

CREATING_A_PROJECT.md

Latest commit

History

CREATING_A_PROJECT.md

File metadata and controls

Creating a new project (similar to cardiac_cta)

Pipeline overview (order of operations)

What to adapt for your project

1. Initial pull (generic)

2. One summary per applicant (generic)

3. Find devices by topic (adapt)

4. Download by K-list (generic)

5. Build knowledge base (adapt)

6. RAG chunks and MCP (generic)

Reference docs

Project layout (what ends up in the repo)