DAK Extraction Scripts

Overview

This directory contains Python scripts for extracting and processing Digital Adaptation Kit (DAK) content from WHO SMART Guidelines. These scripts form a comprehensive extraction pipeline that transforms various input formats (Excel files, BPMN diagrams, CQL logic, SVG images, etc.) into FHIR-compatible resources and Implementation Guide content.

The extraction pipeline is designed to process L2 (DAK) content and generate the structured artifacts needed for L3 (FHIR Implementation Guide) content, facilitating the creation of computable clinical guidelines.

Quick Start

Prerequisites

Python 3.x
Required Python dependencies (see installation below)
A WHO SMART Guideline FHIR IG repository structure

Installation

Before running the extraction scripts, install the required Python dependencies:

# From the root of any SMART Guideline repository
pip install -r ../smart-base/input/scripts/requirements.txt

Or if running from within the same repository:

pip install -r input/scripts/requirements.txt

Usage

To extract DAK content from any WHO SMART Guideline repository, run the main extraction script from the root of the target guideline repository:

# Example: Extract DAK content from smart-immunizations
gh repo clone WorldHealthOrganization/smart-base
gh repo clone WorldHealthOrganization/smart-immunizations
cd smart-immunizations

# Install dependencies
pip install -r ../smart-base/input/scripts/requirements.txt

# Run the extraction
python ../smart-base/input/scripts/extract_dak.py

The script will orchestrate the entire extraction pipeline, processing all available content types and generating FHIR resources in the appropriate directories of the current working directory (the target guideline repository).

Running Against Different Guidelines

The extract_dak.py script can be run against any WHO SMART Guideline repository. Simply:

Clone the smart-base repository (contains the extraction scripts)
Clone or navigate to your target guideline repository
Install the dependencies from smart-base
Run the extraction script from the target repository, pointing to the smart-base scripts

# Example with different guideline repositories:
# For smart-malaria:
cd smart-malaria
python ../smart-base/input/scripts/extract_dak.py

# For smart-hiv:
cd smart-hiv  
python ../smart-base/input/scripts/extract_dak.py

The extraction will process DAK content from the current directory and generate FHIR resources appropriate for that specific guideline.

File Structure and Functionality

Detailed File Reference

File Name	Goal	Inputs	Outputs
`codesystem_manager.py`	Manages FHIR CodeSystem and ValueSet resources by registering, merging, and rendering codes and properties for DAKs.	Code system IDs, titles, codes, display names, definitions, designations, properties; uses `stringer` for escaping/hashing.	FHIR CodeSystem and ValueSet FSH representations stored in dictionaries or rendered for implementation guides.
`bpmn_extractor.py`	Extracts business process data from BPMN files and transforms them into FHIR FSH format using `bpmn2fhirfsh.xsl`.	BPMN files (`*.bpmn`) from `input/business-processes/`, `bpmn2fhirfsh.xsl`, `installer` object.	FHIR FSH resources (e.g., `SGRequirements`, `SGActor`) stored via `installer.add_resource`, logs transformation success/failure.
`dd_extractor.py`	Extracts data dictionary entries from Excel files, generating FHIR ValueSets linked to business processes, tasks, decision tables, and indicators.	Excel files (`*.xlsx`) from `input/dictionary/`, cover sheet with tab names/descriptions, `installer` object.	FHIR ValueSet FSH representations stored via `installer.add_resource`, logs extraction details.
`DHIExtractor.py`	Extracts digital health intervention (DHI) classifications and categories from text files, creating FHIR CodeSystems, ValueSets, and ConceptMaps.	Text files (`system_categories.txt`, `dhi_v1.txt`) from `input/data/`, `installer` object.	FHIR CodeSystem, ValueSet, ConceptMap FSH representations stored via `installer.add_resource`, logs extraction details.
`extractor.py`	Base class for extracting data from various sources (e.g., Excel, BPMN), providing utility functions for data frame processing and logging.	Input file paths, column mappings, sheet names, `installer` object; subclasses define specific inputs.	Processed data frames with normalized columns, logs, resources stored via `installer` (specific to subclasses).
`extract_dhi.py`	Orchestrates extraction of DHI data using `DHIExtractor`, coordinating with `installer` to process and store results.	Command-line arguments (optional, e.g., `--help`), text files via `DHIExtractor`.	Installed FHIR resources via `installer.install()`, logs success/failure, exits with status code.
`dt_extractor.py`	Extracts decision table logic from Excel and CQL files, generating FHIR ValueSets, PlanDefinitions, ActivityDefinitions, and DMN representations.	Excel files (`.xlsx`) from `input/decision-logic/`, CQL files (`.cql`) from `input/cql/`, `dmn2html.xslt`, `installer` object.	FHIR ValueSet, PlanDefinition, ActivityDefinition FSH, DMN XML, markdown pages stored via `installer.add_resource`/`add_page`, logs details.
`extract_dak.py`	Orchestrates extraction of DAK content by coordinating multiple extractors (data dictionary, BPMN, SVG, requirements, decision tables, personas).	Command-line arguments (optional, e.g., `--help`), files processed by extractors (`dd_extractor`, etc.).	Installed FHIR resources via `installer.install()`, logs success/failure, exits with status code.
`installer.py`	Manages installation of FHIR resources, pages, CQL files, and DMN tables, handling transformations (e.g., via `bpmn2fhirfsh.xsl`, `dmn2html.xslt`, `svg2svg.xsl`) and storage.	FHIR resources, CQL content, markdown pages, DMN XML, XSLT files, `sushi-config.yaml`, `multifile.xsd`, aliases.	Installed files in `input/fsh/`, `input/cql/`, `input/dmn/`, `input/pagecontent/`, logs installation success/failure.
`req_extractor.py`	Extracts functional and non-functional requirements from Excel files, generating FHIR Requirement and ActorDefinition resources.	Excel files (`*.xlsx`) from `input/system-requirements/`, functional/non-functional sheet column mappings, `installer` object.	FHIR Requirement, ActorDefinition FSH stored via `installer.add_resource`, CodeSystem/ValueSet for categories, logs extraction details.
`svg_extractor.py`	Extracts and transforms SVG files from business processes into FHIR-compatible formats using `svg2svg.xsl`.	SVG files (`*.svg`) from `input/business-processes/`, `svg2svg.xsl`, `installer` object.	Transformed SVG files stored in `input/images/`, logs transformation success/failure.
`stringer.py`	Provides utility functions for string manipulation, including escaping, hashing, and ID normalization for FHIR resource generation.	Strings for escaping (XML, markdown, code, rulesets), names for ID conversion, inputs for blank/dash checks.	Escaped strings, hashed IDs, normalized IDs, logs for long ID hashing or errors.
`multifile_processor.py`	Processes multifile XML to apply file changes to a Git repository, handling branching, committing, and pushing.	Multifile XML (`<path_to_multifile.xml>`) with file paths, content, diff formats, Git repository context.	Updated files in repository, Git commits/pushes, logs for parsing and Git operation success/failure.
`generate_valueset_schemas.py`	Generates JSON schemas from FHIR IG publisher expansions.json output, creating enum-based schemas for ValueSet codes.	FHIR expansions.json Bundle with ValueSet resources containing expanded codes.	JSON Schema files with enum constraints for each ValueSet, logs processing details.
`extractpr.py`	Extracts personas/actors content from PDF files containing SMART Guidelines documentation, focusing on Generic Personas and Related Personas tables.	PDF files (`*.pdf`) from `input/personas/`, `installer` object.	FHIR ActorDefinition FSH resources stored via `installer.add_resource`, CodeSystem for persona types, logs extraction details.
`includes/bpmn2fhirfsh.xsl`	Transforms BPMN XML into FHIR FSH, generating resources like Requirements, Actors, Questionnaires, and Decision Tables for business processes.	BPMN XML from `input/business-processes/*.bpmn`, processed via `installer.transform_xml`.	FHIR FSH resources (e.g., `SGRequirements`, `SGActor`) stored via `installer.add_resource`, with links to CodeSystems and StructureDefinitions.
`includes/dmn2html.xslt`	Transforms DMN XML into HTML for displaying decision tables in implementation guides, including decision IDs, rules, triggers, inputs, and outputs.	DMN XML from `installer.add_dmn_table`, processed via `installer.transform_xml`.	HTML files in `input/pagecontent/` (e.g., `<id>.xml`), with links to FHIR CodeSystems, logs transformation details.
`includes/svg2svg.xsl`	Transforms SVG files to ensure compatibility with FHIR implementation guides, likely preserving or modifying business process visualizations.	SVG XML content from `input/business-processes/*.svg`, processed via `installer.transform_xml`.	Transformed SVG files stored in `input/images/`, compatible with FHIR rendering.

Script Categories

Core Extraction Scripts

extract_dak.py - Main orchestrator coordinating all extraction processes
installer.py - Resource manager handling FHIR installation and transformations
extractor.py - Base class providing common functionality for specialized extractors

Specialized Content Extractors

dd_extractor.py - Data Dictionary extraction from Excel files
req_extractor.py - Requirements processing for functional/non-functional specs
bpmn_extractor.py - Business Process transformation from BPMN to FHIR
dt_extractor.py - Decision Tables conversion to computable formats
svg_extractor.py - Graphics processing for IG compatibility
DHIExtractor.py - Digital Health Interventions classification extraction
extractpr.py - Personas extraction from PDF documents

Supporting Utilities

codesystem_manager.py - Terminology management for CodeSystems and ValueSets
stringer.py - String manipulation utilities for FHIR resource generation
multifile_processor.py - Git integration for automated repository workflows

Post-Processing Scripts

generate_valueset_schemas.py - JSON Schema generation from IG publisher expansions.json output
generate_logical_model_schemas.py - JSON Schema generation from StructureDefinition JSON files for logical models

Schema and Validation Files

Directory/File	Purpose
`xsd/`	Contains XSD schema files for DMN and other XML validation
`includes/multifile.xsd`	Schema for multifile XML processing

Content Processing Flow

Data Dictionary Processing (dd_extractor.py): Extracts terminology and value sets from Excel files
Requirements Processing (req_extractor.py): Converts functional requirements into FHIR resources
Business Process Processing (bpmn_extractor.py): Transforms BPMN workflows into FHIR actors and requirements
Decision Logic Processing (dt_extractor.py): Converts decision tables and CQL into executable FHIR resources
Visual Content Processing (svg_extractor.py): Optimizes diagrams for IG presentation
Personas Processing (extractpr.py): Extracts actor definitions from PDF documentation
Resource Installation (installer.py): Coordinates final resource generation and file organization

Output Structure

The extraction process generates content in the following directories:

input/fsh/ - FHIR Shorthand (FSH) resource definitions
input/cql/ - Clinical Quality Language files
input/pagecontent/ - Markdown pages and HTML content
input/images/ - Processed SVG diagrams
input/dmn/ - Decision Model and Notation files

Future Plans

Note: These DAK extraction scripts are currently hosted in this repository as a convenience. They will be migrated to their own dedicated repository in the future to better separate the core FHIR profiles from the extraction tooling.

Additional Utilities

Individual Extractor Scripts

extract_dhi.py - Standalone script for Digital Health Intervention extraction
check_pages.sh - Shell script for page validation

Post-Processing Scripts

generate_valueset_schemas.py - Generate JSON schemas from IG publisher output

ValueSet Schema Generation

The generate_valueset_schemas.py script processes the expansions.json file generated by the FHIR IG publisher and creates JSON schemas for each ValueSet using enum constraints.

Usage:

# Using default paths (output/expansions.json -> output/)
python input/scripts/generate_valueset_schemas.py

# Specifying input file only (output dir defaults to output/)
python input/scripts/generate_valueset_schemas.py path/to/expansions.json

# Specifying both input and output paths
python input/scripts/generate_valueset_schemas.py path/to/expansions.json path/to/output/dir

Output:

Creates three files per ValueSet:
- ValueSet-{id}.schema.json - JSON schema with enum validation
- ValueSet-{id}.displays.json - Display values with multilingual support
- ValueSet-{id}.system.json - System URI mappings
Creates an index.html file with links to all generated schemas
Schema files use enum to constrain values to the expanded codes and reference display/system files
Display files use multilingual structure to support translations
Includes FHIR metadata (ValueSet URL, expansion timestamp, etc.)

Example generated files:

Schema file (ValueSet-example.schema.json):

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "http://smart.who.int/base/ValueSet-example.schema.json",
  "title": "Example ValueSet Schema",
  "description": "JSON Schema for Example ValueSet codes. Generated from FHIR expansions.",
  "type": "string",
  "enum": ["code1", "code2", "code3"],
  "fhir:displays": "http://smart.who.int/base/ValueSet-example.displays.json",
  "fhir:system": "http://smart.who.int/base/ValueSet-example.system.json",
  "fhir:valueSet": "http://smart.who.int/base/ValueSet/example",
  "fhir:expansionTimestamp": "2023-01-01T00:00:00Z"
}

Display file (ValueSet-example.displays.json):

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "http://smart.who.int/base/ValueSet-example.displays.json",
  "title": "Example ValueSet Display Values",
  "description": "Display values for Example ValueSet codes. Generated from FHIR expansions.",
  "fhir:displays": {
    "code1": {"en": "Display One"},
    "code2": {"en": "Display Two"},
    "code3": {"en": "Display Three"}
  },
  "fhir:valueSet": "http://smart.who.int/base/ValueSet/example"
}

Logical Model Schema Generation

The generate_logical_model_schemas.py script processes JSON StructureDefinition files generated by the FHIR IG Publisher for FHIR Logical Models and generates JSON schemas for each Logical Model with support for ValueSet bindings.

Usage:

# Using default paths (output -> output/)
python input/scripts/generate_logical_model_schemas.py

# Specifying input directory only (output dir defaults to current directory)
python input/scripts/generate_logical_model_schemas.py output

# Specifying both input and output paths
python input/scripts/generate_logical_model_schemas.py output/StructureDefinition output/schemas

Features:

Processes JSON StructureDefinition files with "kind": "logical"
Maps FHIR datatypes to JSON Schema types (string, boolean, integer, etc.)
Handles cardinality mapping (1..1 → required, 0..1 → optional, 0..* → array)
Supports choice types with oneOf constraints
Detects ValueSet bindings and creates $ref references to ValueSet schemas
Uses canonical URLs from StructureDefinition for schema $id

Output:

Creates one JSON schema file per Logical Model: StructureDefinition-{model-name}.schema.json
Schema $id uses the base URL with StructureDefinition-{model-name}.schema.json pattern to match FHIR canonicals
Includes FHIR metadata and references to ValueSet schemas where applicable

Example generated schema:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "http://smart.who.int/base/StructureDefinition-Animal.schema.json",
  "title": "Animal",
  "description": "Logical Model for representing animals",
  "type": "object",
  "properties": {
    "name": { "type": "string" },
    "species": { "$ref": "ValueSet-AnimalSpeciesVS.schema.json" },
    "age": { "type": "integer" }
  },
  "required": ["name", "species"],
  "fhir:logicalModel": "http://smart.who.int/base/StructureDefinition/Animal"
}

For questions or issues with the DAK extraction scripts, please refer to the main repository documentation or submit an issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAK Extraction Scripts

Overview

Quick Start

Prerequisites

Installation

Usage

Running Against Different Guidelines

File Structure and Functionality

Detailed File Reference

Script Categories

Core Extraction Scripts

Specialized Content Extractors

Supporting Utilities

Post-Processing Scripts

Schema and Validation Files

Content Processing Flow

Output Structure

Future Plans

Additional Utilities

Individual Extractor Scripts

Post-Processing Scripts

ValueSet Schema Generation

Logical Model Schema Generation

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

DAK Extraction Scripts

Overview

Quick Start

Prerequisites

Installation

Usage

Running Against Different Guidelines

File Structure and Functionality

Detailed File Reference

Script Categories

Core Extraction Scripts

Specialized Content Extractors

Supporting Utilities

Post-Processing Scripts

Schema and Validation Files

Content Processing Flow

Output Structure

Future Plans

Additional Utilities

Individual Extractor Scripts

Post-Processing Scripts

ValueSet Schema Generation

Logical Model Schema Generation