Skip to content

Latest commit

 

History

History
97 lines (64 loc) · 5.43 KB

File metadata and controls

97 lines (64 loc) · 5.43 KB

Creating a new project (similar to cardiac_cta)

This guide helps you set up a similar FDA 510(k) workflow for a different product area (e.g. another product code, device type, or indication). The cardiac_cta project is the reference implementation; its PROJECT_SUMMARY.md documents every step in detail.


Pipeline overview (order of operations)

Step What Script / command
1 Create project dir and pull initial 510(k) data uv run prediscope '<query>' --data-dir projects/<my_project>
2 (Optional) Deduplicate index Edit or script: dedupe index.jsonl by k_number
3 Download one summary per applicant (sampling) uv run python scripts/download_one_per_applicant.py --data-dir projects/<my_project>
4 Find devices matching your topic from raw text Adapt find_cardiac_from_raw.py → keywords + output file names (see below)
5 Download all summaries for those applicants uv run python scripts/download_by_k_list.py --data-dir projects/<my_project> --k-list projects/<my_project>/k_numbers_all_<topic>_applicants.txt
6 Build knowledge base (JSONL) Adapt build_cardiac_knowledge_base.py to read your topic’s K-list (see below)
7 (Optional) HTML summary Adapt generate_cardiac_summary_html.py (titles, paths) or skip
8 (Optional) AI/ML deep dive uv run python scripts/extract_ai_ml_deep.py --data-dir projects/<my_project>
9 Build RAG chunks for retrieval / MCP uv run python scripts/build_rag_chunks.py --data-dir projects/<my_project>
10 Use retrieval in Cursor (MCP) or CLI See MCP_SETUP.md; set PREDISCOPE_DATA_DIR=projects/<my_project>

Steps 1, 2, 3, 5, 8, and 9 use generic scripts: they only need --data-dir (and sometimes --k-list). Steps 4, 6, and 7 are cardiac-specific and must be adapted (or replaced) for your topic.


What to adapt for your project

1. Initial pull (generic)

Use your own openFDA query and project path:

mkdir -p projects/my_project
uv run prediscope 'product_code:XXX' --data-dir projects/my_project
# Add more queries if needed (some APIs don’t support OR; run separately and merge/dedupe index)

2. One summary per applicant (generic)

uv run python scripts/download_one_per_applicant.py --data-dir projects/my_project

3. Find devices by topic (adapt)

Script: scripts/find_cardiac_from_raw.py
Why it’s specific: Hardcoded keywords (e.g. cardiac, coronary, CTA) and output file names (cardiac_applicants.txt, cardiac_k_numbers.txt, k_numbers_all_cardiac_applicants.txt, cardiac_report.md).

Options:

  • Copy and adapt: Copy the script to e.g. find_<topic>_from_raw.py, change the keyword list and all output file names (and the K-list filename used in the next step).
  • Generic script (future): A script that reads a config file (e.g. keywords.txt and out_prefix) would work for any project; for now, the cardiac script is the template.

Downstream scripts expect a K-number list file (one K per line) for “devices matching my topic” and optionally “all K-numbers from those applicants” for the full download. Name them consistently (e.g. k_numbers_<topic>.txt and k_numbers_all_<topic>_applicants.txt) so the next step can find them.

4. Download by K-list (generic)

uv run python scripts/download_by_k_list.py --data-dir projects/my_project \
  --k-list projects/my_project/k_numbers_all_<topic>_applicants.txt

5. Build knowledge base (adapt)

Script: scripts/build_cardiac_knowledge_base.py
Why it’s specific: It looks for cardiac_k_numbers.txt and writes device records into knowledge_base.jsonl. The logic (read index + K-list + raw, enrich with year_cleared, product_code, has_ai_ml) is reusable.

Options:

  • Copy and adapt: Copy to e.g. build_<topic>_knowledge_base.py and change the input file from cardiac_k_numbers.txt to your topic’s K-list (e.g. k_numbers_<topic>.txt). Keep the output as knowledge_base.jsonl so downstream scripts (extract_ai_ml_deep, RAG) still work.
  • Symlink: In your project dir, ln -s k_numbers_<topic>.txt cardiac_k_numbers.txt and run the existing script (quick hack; file names in reports will still say “cardiac”).

6. RAG chunks and MCP (generic)

uv run python scripts/build_rag_chunks.py --data-dir projects/my_project

Then use the MCP server with your project’s chunks:

  • In Cursor MCP config, set PREDISCOPE_DATA_DIR to projects/my_project (see MCP_SETUP.md).
  • Or run retrieval from the CLI:
    uv run python scripts/retrieve_chunks.py "your question" --data-dir projects/my_project --top-k 10

Reference docs


Project layout (what ends up in the repo)

Under .gitignore, project data (raw/, summaries/, *.jsonl, *.txt in project dirs) is ignored; docs (README.md, PROJECT_SUMMARY.md, *.md, *.html) are tracked. So you can commit your project’s README and summary without committing PDFs or large JSONL. See repo root .gitignore for the exact patterns.