Skip to content

Pre-import processing steps required for IMPACT studies to pass validation#1298

Draft
jamesqo wants to merge 20 commits into
masterfrom
pre-import-processing
Draft

Pre-import processing steps required for IMPACT studies to pass validation#1298
jamesqo wants to merge 20 commits into
masterfrom
pre-import-processing

Conversation

@jamesqo
Copy link
Copy Markdown
Contributor

@jamesqo jamesqo commented Feb 10, 2026

  • Fixed bug where changes to Cancer Type / Cancer Type Detailed are erased by download_from_s3
  • Add function to generate case lists based on datatype (eg. _all, _cna, etc)
  • Make sure case list generation happens as late as possible for each study prior to final upload to S3

@jamesqo jamesqo marked this pull request as draft February 10, 2026 19:13
}

function preImportProcessingSteps {
updateCancerTypeFromOncotreeCode "$3"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this call needs to come before adding the case lists

Copy link
Copy Markdown
Contributor

@sheridancbio sheridancbio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these changes correct an import problem we are experiencing now, then good.

But it seems like if we are experiencing such import errors, the root cause may be that we have gotten confused about where we keep our data and which locations are primary / authoritative and which locations are derived / downstream / intermediate. We are managing multiple copies of the same data stored in several locations. I believe we are in a transition period, but I think we need to be clear (at least with prominent comments) about which locations are in what state and where it is appropriate to be pulling information from during processing stages (like fetch stage and import stage).

if [ $IMPORT_STATUS_IMPACT -eq 0 ] && [ $FETCH_CVR_IMPACT_FAIL -eq 0 ] ; then
addCancerTypeCaseLists $MSK_IMPACT_DATA_HOME "mskimpact" "data_clinical_mskimpact_data_clinical_cvr.txt"
# clinical file has been updated with cancer types
upload_to_s3 "$MSK_IMPACT_DATA_HOME/data_clinical_mskimpact_data_clinical_cvr.txt" "mskimpact/data_clinical_mskimpact_data_clinical_cvr.txt" "mskimpact-databricks"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These filenames correspond (if I remember correctly) to projects which are present in redcap. I think we are in the process of dropping our use of redcap ... led by @mandawilson I think. I think we need to clearly understand (and perhaps document with comments) where the authoritative store (in terms of study import operations) of clinical information for samples is located. It seems a bad idea to me to maintain multiple persistent copies of the authoritative source of fetched clinical information. So if s3 is our store we should stop using redcap asap ... and also not be checking these files into a github repo which could be confused as valid locations to record updates to clinical data. Instead all updates should be made in s3. If redcap is our current authoritative store I would ask why we are pushing these files into s3 at all.

upload_to_s3 "$MSK_IMPACT_DATA_HOME" "mskimpact" "mskimpact-databricks"
addCancerTypeCaseLists $MSK_IMPACT_DATA_HOME "mskimpact" "data_clinical_sample.txt" "data_clinical_patient.txt"
addDataTypeCaseLists $MSK_IMPACT_DATA_HOME "mskimpact"
upload_to_s3 "$MSK_IMPACT_DATA_HOME" "mskimpact" "mskimpact-databricks" # last mandatory upload of mskimpact
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a similar sentiment about case lists ... although I do understand that these are artifacts which are directly imported and so we may want a recorded snapshot of the case lists persisted along with other study data which was imported. I don't fully understand our current use of S3, but I guess it is being used as a bridge to make data available to databricks. I wonder whether S3 currently contains adequate data so that import operations could be conducted directly from S3 without any references to other repositories or pipelines. I think we are moving towards having pipelines which fetch and push modifications into S3. Perhaps there will be further operations within databricks which will transform data in order to make it ready for import. I hope that we are organizing our S3 locations so that it is clear which data is "fetched data flowing from our cvr pipelines" and which data is "ready-to-import study directories" containing data which needs no further transformation in order to be imported.
But my most important question about pushing case lists into S3:
Is our goal to have a fully assembled / importable study directory in S3 so that our importers can operate only by referencing studies in S3? If that is where we are going then we should go all the way to that goal rather than taking incremental steps which make it "more and more" like an importable study directory. That will lead to confusion I think. If the S3 bucket directories are not importable I would recommend that perhaps we should add an artifact with a name like "THESE_FILES_ARE_NOT_VALID_FOR_IMPORT" so we remember that S3 is not ready as a data source for import.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants