This repository contains code related to the manuscript "Integration of 168,000 samples reveals global patterns of the human gut microbiome" by Abdill, Graham et al.
- The
/pipelinedirectory contains code used for retrieving, processing and consolidating the raw data from the Sequence Read Archive. /visualization/setup.Rcontains helper functions used in the scripts below./analysis/make_filtered_data.Rrequires one input file,taxonomic_table.csv. This can be downloaded from Zenodo (astaxonomic_table.csv.gz), decompressed, and used without modification. This script generates the data files used in all the other scripts.
The /analysis directory contains code used to generate and evaluate data for the project.
pcoa.Rcontains the code used for the principal coordinates analysis in Figure 2.rarefaction.Rcontains the code used for the taxonomic discovery rate analysis in Figure 1.rarefaction_diversity.Rcontains the code used for the rarefaction analysis of Shannon diversity described in Figure 2.cluster_evaluation.Rcontains the code used for the bootstrap analysis of clustering strength described in the manuscript.pca.Rcontains the code used for the principal components analysis used to determine regional signatures described in the manuscript. It relies on one external file,sample_metadata.tsv, available in the paper's associated Zenodo repository.country_inference_check.Rcontains the code used for the manual evaluation of the accuracy of the world region inference steps. The power calculation is first, followed by the procedure used to generate the randomly selected samples to validate.phylogenetic.shdescribes generating the Greengenes2-based classifications- The gain analysis illustrated in Supplementary Figure 5 has several files:
gain_setup.shdoes the data preparationgain_iteration.shperforms a proportion of the permutationsgain.Rplots the data as seen in the figure.
evident.Rshows the script used to calculate the effect sizes illustrated in Figure 3G.- Several files show the process for the PERMANOVA analysis described in the results section about Figure 3:
filter_dist.pydoes the data preparationpermanova.Ris the script used to run the analysis
The /visualization directory contains the R code used to generate the figures in our manuscript.
map_setup.Rlists the steps for installing the dependencies for generating the map in Figure 2A.setup.Rloads helper functions used in the generation of several figures.figure1.Rgenerates the panels in Figure 1 and associated supplementary material.- It requires one external file,
rarefaction.rds, that is stored in the/datadirectory.
- It requires one external file,
figure2.Rgenerates the panels in Figure 2 and associated supplementary material. It requires several external files:- In the
data/directory:rarefaction_diversity.rdsregions.csv
- From the paper's associated Zenodo repository:
sample_metadata.tsv
- Generated by
pcoa.R:nmds.rdspcoa_points.rds
- In the
figure3.Rgenerates Figure 3 and its supplements. It requires several external files:sra_samples.tsv, available from the publication as Supplementary Table 7.tech.txtin the/datadirectorydiff_abundance_results_20240705.tsvin the/datadirectory
- Figure 4 and its supplements are generated by code across several files:
figure4A.Rgenerates the panels in Figure 4A, and calls the code for generating figures 4B and 4C. It requires several external files:- Figure 4A requires one external file,
unfiltered_rarefaction_by_read.rds, that is stored in the/datadirectory. - Figure 4C requires
sample_metadata.tsvfrom Zenodo - Figures 4C–F require
metadata_from_rpackage.rdsfrom the/datadirectory.
- Figure 4A requires one external file,
figure4D.Randfigure4EF.Rgenerate the remaining panels.- These require
taxa_names.txtfrom the/datadirectory.
- These require
figure5.Rgenerates the panels in Figure 5. It requires several external files, all available in the/datadirectory:diff_taxa_counts_for_5A.rdsfig5A_labels.rdsdiff_abundant_pvalues_for_5B.rdsmetadata_for_diffAbundance.rdstaxon_names.tsv
figure6.Rgenerates the panels in Figure 6 and its supplements. It requires several eternal files, all available in the/datadirectory:compendium_metadata.csvcompendium_pca.csvcountry_cluster_bootstrap.100min.rdscountry_cluster_bootstrap.100min.REAL.rds
-- If you have any questions, please contact corresponding author Ran Blekhman at blekhman (at) uchicago.edu. Thanks.