- Only set up for T1_only model training at the moment
- For now, curated specifically for the ADNI and fragileX datasets
- The wrapper is split into two separate parts because the MSI modules for freesurfer and ants have conflicts
T1w image files need to be in the format ${age}mo_ds-${dataset}_sub-${sub_id}_0000.${images_ext} while the corresponding label files need to be in the format ${age}mo_ds-${dataset}_sub-${sub_id}.${labels_ext} before feeding them in to the preprocessing wrapper. In order to get files into this format, we need to run the rename_data_files.py script on a BIDS valid input directory. A specific file is used below as an example when filling in the arguments to the script, but this script can work on all files within a set of subjects as long as a participants.tsv is utilized.
Information on the file name fields:
${age}-> Integer value for the participant age in months. This information can be located in the participants.tsv for a given dataset, assuming it is BIDS valid.${dataset}-> A unique identifier of source for a data file${sub_id}-> Unique subject identifier${images_ext}-> The file extension of a given image file. This is likely either going to benii.gzormgz. The preprocessing scripts are currently only set up for one or the other.${labels_ext}-> The file extension of a given label file. This is likely either going to benii.gzormgz. The preprocessing scripts are currently only set up for one or the other.
python ./rename_data_files.py <base_directory> <relative_path_template> <dataset_name> <participants_file> <t1_identifier> <aseg_identifier> <output_directory>
<base_directory>is the path to the project root directory where the parent subject directories live. As an example, if attempting to rename a file with the path of/usr/projects/bids_dataset/sub-1234/ses-1234/anat/sub-1234_ses-1234_acq-1234_T1w.mgzthe<base_directory>argument would be/usr/projects/bids_dataset.<relative_path_template>would be the relative path to the anatomical or segmentation files within each subject folder. The code utilizes theparticipants.tsvto gather the unique subject IDs and can also take into account a placeholder field for session IDs and will refer back to theparticipants.tsvagain to fill that field in automatically based on what sessions are available for a given subject. Building off of the example above, the<relative_path_template>would be something like/ses-{session}/anat.<dataset_name>is a string that will get added as a unique identifier for files from a subject that belongs to a given dataset. For our example we will use the string "test".<participants_file>is the absolute path to theparticipants.tsvdescribing what is available within the base directory. Theparticipants.tsvmust have headers for "participant_id" and "age", while "session_id" is optional. The "age" information must be in years. Using our example above, the inputtedparticipants.tsvwould look something like/usr/projects/bids_dataset/participants.tsvand the contents would look something like:
participant_id session_id age
sub-1234 ses-1234 12
<t1_identifier>is as string that is used to identify which file is the actual anatomical image that needs renaming. Based on the example above, the input to this argument would be the string "T1w.mgz"<aseg_identifier>is an input similar to the argument above, but instead distinguishes the segmentation file in particular from the other files in the directory. An example would be something like "aseg.mgz"<output_directory>would be the absolute path to where the renamed files will get produced. For our example, the input to this argument would be/usr/projects/bids_dataset_renamed, and after running the script, the full path of the renamed file will result in/usr/projects/bids_dataset_renamed/images_test_renamed/144mo_ds-test_sub-1234_ses-1234_0000.mgz
Part 1 of data preprocessing includes converting anatomical images and label files to .nii.gz if needed, reorienting and overwriting the image and label file, then relabeling the segmentations using ./preproc_data_wrapper/relabeling.py. Note that these preprocessing steps are currently only catered for the ADNI and fragileX datasets.
- Within the
preproc_data_wrapperfolder, edit the following variables in themake_run_files_s3_part1.shfile if needed:in_bucketandout_bucket. These likely wont need to be edited though, as the variables are typically just"s3://lifespan-seg-model-data/raw"for both. - Execute the
make_run_files_s3_part1.shscript. This will require one argument identifying the dataset name. So based on our example above, the command will look something like./preproc_data_wrapper/make_run_files_s3_part1.sh test. This will then make a unique run file based on the templatetemplate.preproc_data_part1for whatever file is available within${in_bucket}/labels_${dataset}_renamed/, and the run files will then be stored in a folder called./preproc_data_wrapper/run_files_test.preproc_data_part1. - Check the
./preproc_data_wrapper/resources_preproc_data_part1.shto make sure parameters are properly specified. Parameters that may need editing in particular are--mail-user,-p,-o,-e, and-A. - Submit the jobs with the following command:
./preproc_data_wrapper/submit_preproc_data_part1.sh 0-10 test. The two additional arguments are an example array and example name for the dataset. The final outputs from our example will be here:${out_bucket}/images_test_renamed_relabeled/and${out_bucket}/labels_test_renamed_relabeled/
Part 2 of data preprocessing includes applying spatial transformations with antsApplyTransforms to align segmentations with the anatomical image using nearest neighbor interpolation, then converts the datatypes of the data within both the image and label files to integer values.
- Within the
preproc_data_wrapperfolder, edit the following variables in themake_run_files_s3_part2.shfile if needed:in_bucketandout_bucket. These likely wont need to be edited though, as the variables are typically just"s3://lifespan-seg-model-data/raw"and"s3://lifespan-seg-model-data/preprocessed/T1-only"respectively. - Execute the
make_run_files_s3_part2.shscript. This will require one argument identifying the dataset name. So based on our example above, the command will look something like./preproc_data_wrapper/make_run_files_s3_part2.sh test. This will then make a unique run file based on the templatetemplate.preproc_data_part2for whatever file is available within${in_bucket}/labels_${dataset}_renamed_relabeled/, and the run files will then be stored in a folder called./preproc_data_wrapper/run_files_test.preproc_data_part2. - Check the
./preproc_data_wrapper/resources_preproc_data_part2.shto make sure parameters are properly specified. Parameters that may need editing in particular are--mail-user,-p,-o,-e, and-A. - Submit the jobs with the following command:
./preproc_data_wrapper/submit_preproc_data_part2.sh 0-10 test. The two additional arguments are an example array and example name for the dataset. The final outputs from our example will be here:${out_bucket}/images/and${out_bucket}/labels/
- There are currently 4 python scripts that could be used for relabeling the input segmentations, however, only
relabeling.pyis relevant at the moment. - There is also an additional script for making run files called
make_run_files_based_on_csv.pyand may become relevant if ABCD data is included in the future.
data-for-lssmis used to store open access information on all of the images and labels that could be used for training.notebooks-for-lssmcontains jupyter notebooks used to curate the specific images and labels from the open access data that had exceptional quality scores in which we can use for training.
harmonize_adni_for_curated.shis a script that is used to determine if there are any preprocessed label files that are missing a corresponding image file. If there is a missing image file, then the corresponding label file will be removed. The inputs for this script are a directory containing all available image files for a given dataset and a directory containing all available label files for a given dataset.label_check.shis a script that checks what labels are currently present within a given label file, and what labels are missing from a given label file. The input is a directory containing all label files of interest that need to be checked.