This repository provides an ETL pipeline to transform the MIMIC-IV database into the Common Longitudinal ICU data Format (CLIF). The latest release is v1.0.0 (October 2025) and transforms MIMIC-IV 3.1 into CLIF 2.1.0.
Note
This project is being submitted to PhysioNet. All tables will be available for direct download for MIMIC-credentialed users when the submission is approved. For any future releases, we will upload the download-ready CLIF tables to the PhysioNet project page, but in the event of any lag, please refer to this repository for the code to generate the most up-to-date version
To run the pipeline, first review the change log to find the latest or preferred version; then follow the instructions in the Usage section below to generate the dataset.
For mapping decisions, see the MIMIC-to-CLIF mapping spreadsheet for details.
For issues encountered and decisions made during the mapping process, see the ISSUESLOG.
Before running the pipeline, ensure you have:
-
PhysioNet credentials: Access to the MIMIC-IV dataset on PhysioNet
-
MIMIC-IV 3.1 CSV files downloaded from PhysioNet on your machine
-
Python 3.10.5+ and Git installed on your machine
-
Disk space:
- ~30 GB for MIMIC-IV 3.1 CSV files (compressed)
- ~15 GB for MIMIC-IV Parquet conversion
- ~1 GB for CLIF output tables
If you are an existing user, please git pull the relevant branch and refer to the change log for the updated CLIF tables that need to be re-generated.
If you are a new user, fork your own copy of this repository, and git clone to your local directory.
Copy config/config_template.json to config/config.json and customize the following settings:
-
On the backend, the pipeline requires a copy of the MIMIC data in the parquet format for much faster processing.
- If you have already created a parquet copy of MIMIC before, you can set
"create_mimic_parquet_from_csv": 0and provide the absolute path at which you store your MIMIC parquet files, at"mimic_parquet_dir". - otherwise, if you do not have a copy of MIMIC in parquet yet, set
"create_mimic_parquet_from_csv": 1and change the"mimic_csv_dir"under"default"to the absolute path at which you store the compressed csv files (.csv.gz) you downloaded from PhysioNet. By default, if you leave"mimic_parquet_dir"as a blank", the program would create a/parquetsubdirectory under your"mimic_csv_dir". Optionally, you can also elect to store it anywhere else and the program would create a directory at the alternative path you provided.
- If you have already created a parquet copy of MIMIC before, you can set
-
Specify the CLIF tables you want in the next run, by setting the value of tables you want to be 1 (otherwise 0) under
"clif_tables".- For example, to recreate two tables (
vitalsandlabs) that were recently updated:
{ "clif_tables": { "patient": 0, "hospitalization": 0, "adt": 0, "vitals": 1, "labs": 1, "patient_assessments": 0, "respiratory_support": 0, "medication_admin_continuous": 0, "position": 0 } } - For example, to recreate two tables (
- To enable working across multiple devices or workspaces, you can add more "workspace" along with their respective csv and parquet directory paths. For more details, you can refer to the example below or
/config/config_example.jsonfor how I personally specify file paths under three different workspace set-up: "local," "hpc," and "local_test." This would allow you to seamlessly switch between different devices or environments without having to update file paths every time you do so. Whenever you switch, you just need to update the name of the"current_workspace"accordingly, e.g. specify that"current_workspace": "hpc"as long as you have specified a set of directory paths under a key of the same name, i.e."hpc": {...}.
The following example shows two workspaces: "local" and "hpc", with current workspace set to "local":
-
Since I had already created MIMIC parquet files in my HPC environment, I left
"mimic_csv_dir"as blank""and only provided the location of my parquet files at"mimic_parquet_dir". -
For my local device, I elected to convert from CSV by specifying their location at
"mimic_csv_dir", while leaving"mimic_parquet_dir"blank to use the default setting (creates/parquetsubdirectory under the CSV directory).
{
"current_workspace": "local",
"hpc": {
"mimic_csv_dir": "",
"mimic_parquet_dir": "/some/absolute/path/to/your/project/root/CLIF-MIMIC/data/mimic-data/mimic-iv-3.1/parquet"
},
"local": {
"mimic_csv_dir": "/some/absolute/path/to/your/project/root/CLIF-MIMIC/data/mimic-data/mimic-iv-3.1",
"mimic_parquet_dir": ""
}
}- You can also store multiple versions of the CLIF table outputs by customizing
clif_output_dir_name. If you leave it blank with"", the program would default to naming itf"rclif-{CLIF_VERSION}". Using this default is recommended if you want to access and store multiple CLIF versions at the same time.
After navigating to the project directory, ensure you are on the correct branch:
main- for the latest stable versionrelease/<x.x.x>- for a beta version
To switch to a specific branch (e.g., release/0.2.0):
# Fetch information on all remote branches
git fetch
# Switch to branch release/0.2.0
git switch release/0.2.0
# Verify current branch
git branchuv is a fast Python package manager that simplifies dependency management.
-
Install uv (if not already installed):
# macOS/Linux curl -LsSf https://astral.sh/uv/install.sh | sh # Windows (PowerShell) powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
-
Run the pipeline (once on the correct branch):
uv run python main.py
On the correct branch, run the following line by line:
# Create a virtual environment
python3 -m venv .venv/
# Activate the virtual environment
source .venv/bin/activate
# Install the dependencies
pip install -r requirements.txt
# Run the pipeline
python3 main.pyAfter running the pipeline, you'll find the following outputs:
Generated Parquet files will be in output/rclif-2.1.0/ (or your custom output directory name).
| Table | Typical Size |
|---|---|
vitals |
~265 MB |
labs |
~344 MB |
patient_assessments |
~137 MB |
medication_admin_continuous |
~84 MB |
medication_admin_intermittent |
~48 MB |
adt |
~33 MB |
respiratory_support |
~29 MB |
position |
~24 MB |
hospital_diagnosis |
~20 MB |
hospitalization |
~16 MB |
patient_procedures |
~8.2 MB |
crrt_therapy |
~4.3 MB |
code_status |
~1.1 MB |
patient |
~3 MB |
output/logs/clif_mimic_all.log- All INFO+ messagesoutput/logs/clif_mimic_errors.log- Warnings and errors only
-
π Change Log
-
π Issues Log
To contribute to this open-source project, feel free to:
-
Open an issue for bugs or feature requests
-
Follow branch naming conventions (e.g.,
new-table/dialysis,fix/vitals-mapping) -
Submit a pull request for review
This project is licensed under the MIT License.
Important
The MIMIC-IV dataset is subject to the PhysioNet Data Use Agreement. Users must obtain access through PhysioNet before processing.