Pre-import processing steps required for IMPACT studies to pass validation by jamesqo · Pull Request #1298 · knowledgesystems/cmo-pipelines

jamesqo · 2026-02-10T19:13:45Z

Fixed bug where changes to Cancer Type / Cancer Type Detailed are erased by download_from_s3
Add function to generate case lists based on datatype (eg. _all, _cna, etc)
Make sure case list generation happens as late as possible for each study prior to final upload to S3

jamesqo · 2026-02-12T20:55:19Z

 }

+function preImportProcessingSteps {
+    updateCancerTypeFromOncotreeCode "$3"


this call needs to come before adding the case lists

… comments

sheridancbio

If these changes correct an import problem we are experiencing now, then good.

But it seems like if we are experiencing such import errors, the root cause may be that we have gotten confused about where we keep our data and which locations are primary / authoritative and which locations are derived / downstream / intermediate. We are managing multiple copies of the same data stored in several locations. I believe we are in a transition period, but I think we need to be clear (at least with prominent comments) about which locations are in what state and where it is appropriate to be pulling information from during processing stages (like fetch stage and import stage).

sheridancbio · 2026-02-20T21:41:02Z

    if [ $IMPORT_STATUS_IMPACT -eq 0 ] && [ $FETCH_CVR_IMPACT_FAIL -eq 0 ] ; then
        addCancerTypeCaseLists $MSK_IMPACT_DATA_HOME "mskimpact" "data_clinical_mskimpact_data_clinical_cvr.txt"
+        # clinical file has been updated with cancer types
+        upload_to_s3 "$MSK_IMPACT_DATA_HOME/data_clinical_mskimpact_data_clinical_cvr.txt" "mskimpact/data_clinical_mskimpact_data_clinical_cvr.txt" "mskimpact-databricks"


These filenames correspond (if I remember correctly) to projects which are present in redcap. I think we are in the process of dropping our use of redcap ... led by @mandawilson I think. I think we need to clearly understand (and perhaps document with comments) where the authoritative store (in terms of study import operations) of clinical information for samples is located. It seems a bad idea to me to maintain multiple persistent copies of the authoritative source of fetched clinical information. So if s3 is our store we should stop using redcap asap ... and also not be checking these files into a github repo which could be confused as valid locations to record updates to clinical data. Instead all updates should be made in s3. If redcap is our current authoritative store I would ask why we are pushing these files into s3 at all.

sheridancbio · 2026-02-20T22:23:25Z

-                    upload_to_s3 "$MSK_IMPACT_DATA_HOME" "mskimpact" "mskimpact-databricks"
+                    addCancerTypeCaseLists $MSK_IMPACT_DATA_HOME "mskimpact" "data_clinical_sample.txt" "data_clinical_patient.txt"
+                    addDataTypeCaseLists $MSK_IMPACT_DATA_HOME "mskimpact"
+                    upload_to_s3 "$MSK_IMPACT_DATA_HOME" "mskimpact" "mskimpact-databricks" # last mandatory upload of mskimpact


I have a similar sentiment about case lists ... although I do understand that these are artifacts which are directly imported and so we may want a recorded snapshot of the case lists persisted along with other study data which was imported. I don't fully understand our current use of S3, but I guess it is being used as a bridge to make data available to databricks. I wonder whether S3 currently contains adequate data so that import operations could be conducted directly from S3 without any references to other repositories or pipelines. I think we are moving towards having pipelines which fetch and push modifications into S3. Perhaps there will be further operations within databricks which will transform data in order to make it ready for import. I hope that we are organizing our S3 locations so that it is clear which data is "fetched data flowing from our cvr pipelines" and which data is "ready-to-import study directories" containing data which needs no further transformation in order to be imported.
But my most important question about pushing case lists into S3:
Is our goal to have a fully assembled / importable study directory in S3 so that our importers can operate only by referencing studies in S3? If that is where we are going then we should go all the way to that goal rather than taking incremental steps which make it "more and more" like an importable study directory. That will lead to confusion I think. If the S3 bucket directories are not importable I would recommend that perhaps we should add an artifact with a name like "THESE_FILES_ARE_NOT_VALID_FOR_IMPORT" so we remember that S3 is not ready as a data source for import.

jamesqo added 3 commits February 9, 2026 12:33

generate fresh case list files prior to each s3 upload

c6a5e4d

add logic to generate missing cancer type case lists

21b1f67

update the CANCER_TYPE columns for all studies

9d0f209

jamesqo marked this pull request as draft February 10, 2026 19:13

jamesqo commented Feb 12, 2026

View reviewed changes

jamesqo added 17 commits February 19, 2026 11:56

fix issue with updateCancerType function

ea868f8

fix dmp function

9f4009e

make sure sample file comes before patient file

fa70b4b

move import trigger touch calls after the preproc steps

bbf8d1c

use the old clinical files prior to redcap export

fca4404

add missing variable defns

7cbf27f

no-op change

3be77e1

remove calls prior to redcap export; mark off final upload calls with…

02f9504

… comments

remove more calls

85efec1

remove more calls

637e5b5

add back call

7f8dac4

wip

98be98c

wip

6cbe48e

fix bug: we were not actually updating the cancer types

a5d8058

more refactoring

023a71b

add some validation to addDataTypeCaseLists

813b41f

add validation to function; fix comments

a1cae1b

sheridancbio reviewed Feb 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-import processing steps required for IMPACT studies to pass validation#1298

Pre-import processing steps required for IMPACT studies to pass validation#1298
jamesqo wants to merge 20 commits into
masterfrom
pre-import-processing

jamesqo commented Feb 10, 2026 •

edited

Loading

Uh oh!

jamesqo Feb 12, 2026

Uh oh!

sheridancbio left a comment

Uh oh!

sheridancbio Feb 20, 2026

Uh oh!

sheridancbio Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jamesqo commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jamesqo Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

sheridancbio left a comment

Choose a reason for hiding this comment

Uh oh!

sheridancbio Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

sheridancbio Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jamesqo commented Feb 10, 2026 •

edited

Loading