Skip to content

Update JUMP profile URLs to new structure #8

@shntnu

Description

@shntnu

This repository may contain references to JUMP profile data that need to be updated to reflect the new directory structure.

Context

The JUMP Cell Painting profiles have been reorganized to a new, cleaner structure. See jump-cellpainting/datasets#155 for details.

Required Changes

Your repository may contain references to the old profile paths that need to be updated:

Old → New Path Mappings

  • /workspace/profiles/jump-profiling-recipe_2024_a917fa7/ORF/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet/workspace/profiles_assembled/ORF/v1.0a/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet

  • /workspace/profiles/jump-profiling-recipe_2024_a917fa7/ORF/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier.parquet/workspace/profiles_assembled/ORF/v1.0a/profiles_wellpos_cc_var_mad_outlier.parquet

  • /workspace/profiles/jump-profiling-recipe_2024_a917fa7/CRISPR/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected.parquet/workspace/profiles_assembled/CRISPR/v1.0a/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected.parquet

  • /workspace/profiles/jump-profiling-recipe_2024_a917fa7/CRISPR/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony_PCA_corrected/profiles_wellpos_cc_var_mad_outlier.parquet/workspace/profiles_assembled/CRISPR/v1.0a/profiles_wellpos_cc_var_mad_outlier.parquet

  • /workspace/profiles/jump-profiling-recipe_2024_a917fa7/COMPOUND/profiles_var_mad_int_featselect_harmony/profiles_var_mad_int_featselect_harmony.parquet/workspace/profiles_assembled/COMPOUND/v1.0/profiles_var_mad_int_featselect_harmony.parquet

  • /workspace/profiles/jump-profiling-recipe_2024_a917fa7/COMPOUND/profiles_var_mad_int_featselect_harmony/profiles_var_mad_int.parquet/workspace/profiles_assembled/COMPOUND/v1.0/profiles_var_mad_int.parquet

  • /workspace/profiles/jump-profiling-recipe_2024_0224e0f/ALL/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet/workspace/profiles_assembled/ALL/v1.0b/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet

  • /workspace/profiles/jump-profiling-recipe_2024_0224e0f/ALL/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier_featselect.parquet/workspace/profiles_assembled/ALL/v1.0b/profiles_wellpos_cc_var_mad_outlier_featselect.parquet

Update Script

The following AWK script by @afermg provides a more comprehensive solution that handles all profile paths generically:

Create a file named update_cpg_location.awk:

# Update the paths of cpg files
# /workspace/profiles/jump-profiling-recipe_2024_a917fa7/ORF/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet
# Is converted to
# /workspace/profiles_assembled/ORF/v1.0a/profiles_wellpos_cc_var_mad_outlier_featselect_sphering_harmony.parquet

BEGIN {
    pattern = "/workspace/profiles/jump-profiling-recipe_2024_[a-z0-9]{7}/([A-Z]+)/.+/(.+[.]parquet)";
}

{
    if (match($0, pattern, captures)){
        version_name = "v1.0";
        if (captures[1]=="ORF" || captures[1]=="CRISPR"){
            version_name = version_name "a";
        };
        
        if (captures[1]=="ALL"){
            version_name = version_name "b";
        };
        replacement = "/workspace/profiles_assembled/" captures[1] "/" version_name "/" captures[2];
        gsub(pattern,replacement);
    };
    print $0
}

To update all relevant files in your codebase:

# Find and update all files containing old profile paths
rg "workspace/profiles/jump-profiling-recipe_2024" -t py -t json -t md -t sh -t org -t csv -t nix -l | xargs awk -i inplace -f update_cpg_location.awk

Note for macOS users: You'll need GNU awk for this script. Install it with brew install gawk and use gawk instead of awk in the command above.

This command:

  • Uses ripgrep (rg) to find files containing the old paths
  • -t selects specific file formats
  • -l provides a list of files only
  • awk -i inplace modifies files in place

Important: After running the AWK script, always review the changes with git diff to ensure the transformations were applied correctly. The script handles most cases, but edge cases or typos in the original paths may require manual adjustment.

Additional Note

If your repository also references manifests/profile_index.csv, note that the format has changed from CSV to JSON. See jump-cellpainting/datasets#152 and jump-cellpainting/datasets#155 for details.

Action Required

Please update your code to use the new profile paths. The old paths will be deprecated.

Feel free to reach out if you have any questions or need assistance with the migration.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions