Pre-import processing steps required for IMPACT studies to pass validation#1298
Pre-import processing steps required for IMPACT studies to pass validation#1298jamesqo wants to merge 20 commits into
Conversation
| } | ||
|
|
||
| function preImportProcessingSteps { | ||
| updateCancerTypeFromOncotreeCode "$3" |
There was a problem hiding this comment.
this call needs to come before adding the case lists
sheridancbio
left a comment
There was a problem hiding this comment.
If these changes correct an import problem we are experiencing now, then good.
But it seems like if we are experiencing such import errors, the root cause may be that we have gotten confused about where we keep our data and which locations are primary / authoritative and which locations are derived / downstream / intermediate. We are managing multiple copies of the same data stored in several locations. I believe we are in a transition period, but I think we need to be clear (at least with prominent comments) about which locations are in what state and where it is appropriate to be pulling information from during processing stages (like fetch stage and import stage).
| if [ $IMPORT_STATUS_IMPACT -eq 0 ] && [ $FETCH_CVR_IMPACT_FAIL -eq 0 ] ; then | ||
| addCancerTypeCaseLists $MSK_IMPACT_DATA_HOME "mskimpact" "data_clinical_mskimpact_data_clinical_cvr.txt" | ||
| # clinical file has been updated with cancer types | ||
| upload_to_s3 "$MSK_IMPACT_DATA_HOME/data_clinical_mskimpact_data_clinical_cvr.txt" "mskimpact/data_clinical_mskimpact_data_clinical_cvr.txt" "mskimpact-databricks" |
There was a problem hiding this comment.
These filenames correspond (if I remember correctly) to projects which are present in redcap. I think we are in the process of dropping our use of redcap ... led by @mandawilson I think. I think we need to clearly understand (and perhaps document with comments) where the authoritative store (in terms of study import operations) of clinical information for samples is located. It seems a bad idea to me to maintain multiple persistent copies of the authoritative source of fetched clinical information. So if s3 is our store we should stop using redcap asap ... and also not be checking these files into a github repo which could be confused as valid locations to record updates to clinical data. Instead all updates should be made in s3. If redcap is our current authoritative store I would ask why we are pushing these files into s3 at all.
| upload_to_s3 "$MSK_IMPACT_DATA_HOME" "mskimpact" "mskimpact-databricks" | ||
| addCancerTypeCaseLists $MSK_IMPACT_DATA_HOME "mskimpact" "data_clinical_sample.txt" "data_clinical_patient.txt" | ||
| addDataTypeCaseLists $MSK_IMPACT_DATA_HOME "mskimpact" | ||
| upload_to_s3 "$MSK_IMPACT_DATA_HOME" "mskimpact" "mskimpact-databricks" # last mandatory upload of mskimpact |
There was a problem hiding this comment.
I have a similar sentiment about case lists ... although I do understand that these are artifacts which are directly imported and so we may want a recorded snapshot of the case lists persisted along with other study data which was imported. I don't fully understand our current use of S3, but I guess it is being used as a bridge to make data available to databricks. I wonder whether S3 currently contains adequate data so that import operations could be conducted directly from S3 without any references to other repositories or pipelines. I think we are moving towards having pipelines which fetch and push modifications into S3. Perhaps there will be further operations within databricks which will transform data in order to make it ready for import. I hope that we are organizing our S3 locations so that it is clear which data is "fetched data flowing from our cvr pipelines" and which data is "ready-to-import study directories" containing data which needs no further transformation in order to be imported.
But my most important question about pushing case lists into S3:
Is our goal to have a fully assembled / importable study directory in S3 so that our importers can operate only by referencing studies in S3? If that is where we are going then we should go all the way to that goal rather than taking incremental steps which make it "more and more" like an importable study directory. That will lead to confusion I think. If the S3 bucket directories are not importable I would recommend that perhaps we should add an artifact with a name like "THESE_FILES_ARE_NOT_VALID_FOR_IMPORT" so we remember that S3 is not ready as a data source for import.
download_from_s3_all,_cna, etc)