Table of Contents
This repo contains files, scripts, and analysis related to exploring integration of single-cell and single-nuclei data.
Package dependencies for the analysis workflows in this repository are managed using renv.
For renv to work as intended, you'll need to work within the sc-data-integration.Rproj project in RStudio.
You may need to run renv::restore() upon opening the project to ensure the renv.lock file is synced with the project library.
Each time you install or use new packages, you will want to run renv::snapshot() to update the renv.lock file with any added package and dependencies necessary to run the analyses and scripts in this repo.
If there are dependencies you want to include that are not captured automatically by renv::snapshot(), add them to components/dependencies.R with a call to library() and an explanatory comment.
For example, if dplyr were recommended but not required by a package and you wanted to make sure to include it in the lockfile, you would add library(dplyr) to components/dependencies.R.
Then rerun renv::snapshot().
The main workflow for the integration scripts is written with Snakemake, which will handle most dependencies internally, including the renv environment.
You will need the latest version of snakemake and the peppy python package.
The easiest way to install these is with conda and/or mamba, which you will want to set up to use the bioconda and conda-forge channels using the following series of commands:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
You can then install snakemake and peppy into your preferred environment with:
mamba install snakemake peppy
(Use conda install if you do not have mamba installed.)
Python-based environments will be built automatically by Snakemake when the workflow is run, but the environment for R should be built before running the workflow.
To create or update the necessary environment for the R scripts, which includes an isolated version of R, pandoc, and the renv package installation, run the following command from the base of the repository:
bash setup_envs.sh
This script will use Snakemake to install all necessary components for the workflow in an isoloated Conda enviroment. If you are on an Apple Silicon (M1/M2/Arm) Mac, this should properly handle setting up R to use an Intel-based build for compatibility with Bioconductor packages.
This installation may take up to an hour, as all of the R packages will likely have to be compiled from scratch. However, this should be a one-time cost, and ensures that you have all of the tools for the workflow installed and ready.
To use the environment you have just created, you will need to run Snakemake with the --use-conda flag each time.
If there are updates to the renv.lock file only, those can be applied with the following command (on any system):
snakemake --use-conda -c2 build_renv
For exploring data integration, we used test datasets that were obtained from the Human Cell Atlas (HCA) Data Portal, the Single-cell Pediatric Cancer Atlas(ScPCA) Portal, and simulated single-cell data published in Luecken et al., (2022).
All data from the HCA that we are using can be found in the private S3 bucket, s3://sc-data-integration/human_cell_atlas_data.
The simulated data can be downloaded directly from figshare.
All gene expression data used for benchmarking is stored in the private S3 bucket, s3://sc-data-integration.
In order to access these files, you must be a Data Lab staff member and have credentials setup for AWS.
Inside the sample-info folder is metadata related to datasets used for testing data integration.
<project_name>-project-metadata.tsv: This file contains information about each of the projects that are being used for testing integration from a given area (e.g., HCA, simulated, ScPCA). Each row in this file corresponds to a project, dataset, or group of libraries that should be integrated together. Allproject-metadata.tsvfiles must contain aproject_namecolumn, but may also contain other relevant project information such as the following:
| column_id | contents |
|---|---|
project_name |
The shorthand project name |
source_id |
Unique ID associated with the project |
tissue_group |
Tissue group the project belongs to (e.g. blood, brain, or kidney) |
files_directory |
files directory on S3 |
metadata_filename |
name of metadata file obtained from the HCA |
celltype_filename |
file name corresponding to file containing cell type information as found on HCA |
celltype_filetype |
format of cell type file availble on HCA |
hca-library-metadata.tsvThis file contains information about each library from every project that is being used as a test dataset from the HCA. Each row in this file corresponds to a library and contains the following columns:
| column_id | contents |
|---|---|
sample_biomaterial_id |
Unique ID associated with the individual sample |
library_biomaterial_id |
Unique ID associated with the individual library that was sequenced |
bundle_uuid |
UUID for the individual folder containing each loom file |
project_name |
The shorthand project name assigned by the HCA |
source_id |
Unique ID associated with the project |
tissue_group |
Tissue group the project belongs to (e.g. blood, brain, or kidney) |
technology |
Sequencing/library technology used (e.g. 10Xv2, 10Xv3, etc.) |
seq_unit |
Sequencing unit (cell or nucleus) |
diagnosis |
Indicates if the sample came from diseasead or normal tissue |
organ |
Specified tissue by the HCA where the sample was obtained from |
organ_part |
Specified tissue region by the HCA where the sample was obtained from |
selected_cell_types |
Identifies the group of cells selected for prior to sequencing, otherwise NA |
s3_files_dir |
files directory on S3 |
loom_file |
loom file name in the format tissue_group/project_name/bundle_uuid/filename |
<project_name>-processed-libraries.tsv: This file contains the list of libraries from each project that are being used for testing data integration. This metadata file is required for most scripts, including runningscpca-downstream-analysesusing01-run-downstream-analyses.shand for running the integration workflow. This file must contain the following columns, but may also contain additional columns related to a given dataset:
| column_id | contents |
|---|---|
sample_biomaterial_id |
Unique ID associated with the individual sample |
library_biomaterial_id |
Unique ID associated with the individual library that was sequenced |
project_name |
The shorthand project name |
integration_input_dir |
The directory containing the SingleCellExperiment objects to be used as input to the data integration snakemake workflow |
hca-celltype-info.tsv: This file is not available on the repo and is stored in the private S3 bucket,s3://sc-data-integration/sample-info. This file contains all available cell type information for projects listed inhca-project-metadata.tsv. This file was created using thescripts/00a-reformat-celltype-info.Rwhich takes as input the cell type information available for each project from the Human Cell Atlas Data Portal. The cell type information for each project, in its original format, can be stored ins3://sc-data-integration/human_cell_atlas_data/celltype. Each row corresponds to a single cell and contain the following information:
| column_id | contents |
|---|---|
sample_biomaterial_id |
Unique ID associated with the individual sample |
library_biomaterial_id |
Unique ID associated with the individual library that was sequenced |
project |
The shorthand project name assigned by the HCA |
barcode |
The unique cell barcode |
celltype |
The assigned cell type for a given barcode, obtained from cell type data stored in s3://sc-data-integration/human_cell_atlas_data/celltype |
All data and intermediate files are stored in the private S3 bucket, s3://sc-data-integration.
The following data can be found in the above S3 bucket within the human_cell_atlas_data folder:
- The
loomfolder contains the original loom files downloaded from the Human Cell Atlas data portal for each test dataset. Here loom files are nested bytissue_group,project_name, andbundle_uuid. - The
scefolder contains the unfilteredSingleCellExperimentobjects saved as RDS files. TheseSingleCellExperimentobjects have been converted from the loom files using the00-obtain-sce.Rscript in thescriptsdirectory in this repo. Here RDS files are nested bytissue_groupandproject_name.
The following data can be found in the S3 bucket within the scib_simulated_data folder:
- The
hdf5folder contains the originalhdf5files for simulated data obtained from figshare. - The
scefolder contains the individualSingleCellExperimentobjects stored asrdsfiles after runningscripts/00b-obtain-sim-sce.Randscripts/00c-create-sim1-subsets.R
A separate reference-files folder contains any reference files needed for processing dataset, such as the gtf file needed to generate the mitochondrial gene list found in the reference-files folder in the repository.
In order to access these files, you must be a Data Lab staff member and have credentials set up for AWS. Additionally, some of the scripts in this repository require use of AWS command line tools. We have previously written up detailed instructions on installing the AWS command line tools and configuring your credentials that can be used as a reference.
After AWS command line tools have been set up, the SingleCellExperiment objects found in s3://sc-data-integration/human_cell_atlas_data/sce can be copied to your local computer by running the 00-obtain-sce.R script with the --copy_s3 flag.
Rscript scripts/00-obtain-sce.R --copy_s3
This will copy any SingleCellExperiment objects for libraries listed in hca-processed-libraries.tsv that have already been converted from loom files.
If any libraries listed in hca-processed-libraries.tsv do not have corresponding SingleCellExperiment objects, running the 00-obtain-sce.R will also convert those loom files.
The human_cell_atlas_results/scpca-downstream-analyses folder contains all processed SingleCellExperiment objects and the output from running the core workflow in scpca-downstream-analyses.
Within this folder each library that has been processed has its own folder that contains both the processed SingleCellExperiment object and an html summary report.
The SingleCellExperiment objects in this folder have both been filtered to remove empty droplets and run through scpca-downstream-analyses using the scripts/01-run-downstream-analyses.sh script.
This means they contain a logcounts assay with the normalized counts matrix, both PCA and UMAP embeddings, and clustering assignments that can be found in the louvain_10 column of the colData.
The SingleCellExperiment objects present in human_cell_atlas_results/scpca-downstream-analyses should be the objects used as input for integration methods.
These files were produced and synced to S3 using the following script:
Note: To run the below script, you must have available in your path R (v4.1.2), Snakemake and pandoc.
pandoc must be version 1.12.3 or higher, which can be checked using the pandoc -v command.
bash scripts/01-run-downstream-analyses.sh \
--downstream_repo <full path to scpca-downstream-analyses-repo> \
--s3_bucket "s3://sc-data-integration/human_cell_atlas_results/scpca-downstream-analyses"
Note: If you wish to run ScPCA data through the integration workflow, rather than HCA data, please see the special guidelines for preparing ScPCA data.
To run the integration workflow, invoke snakemake from the sc-data-integration directory with the following command:
snakemake -c4 --use-conda
You can adjust the number of cores used by adjusting the -c4 flag with however many cores you want to use where the given number represents the number of desired cores (here, 4).
Note that you will want to have set up the R conda environment already, especially if you are on an Apple Silicon Mac.
To run the workflow for development, you may wish to specify the config-test.yaml file, which will only run one project through the pipeline to save time:
snakemake -c4 --use-conda --configfile config-test.yaml
Finally, to run the scib_simulated data through the pipeline, use:
snakemake -c4 --use-conda --configfile config-scib_simulated.yaml