-
Notifications
You must be signed in to change notification settings - Fork 2
2. Data Preprocessing and Normalization
In RNA-seq experiments, the number of sequenced reads mapped to a gene depends on multiple factors:
- The gene's own expression level
- The gene's length
- The sequencing depth
- The expression of all other genes within the sample
To accurately compare gene expression across different conditions, normalization is essential to account for the variability in read counts. This variability arises from differences in total RNA content, library complexity, and contamination. Normalization aims to remove systematic effects not associated with the biological differences of interest.
This section covers the steps needed to transform a read count table into data ready for differential expression (DE) analysis using limma, DESeq2, and edgeR.
- 2.0 Creating a Read Counts Matrix
- 2.1 Data Import
- 2.2 Handling Missing Data
- 2.3 Data Transformation
- 2.4 Pattern Exploration
- 2.5 Data Preparation for DGE Analysis
- 2.6 Handling and Converting Gene IDs

A read counts matrix is the essential starting point for DGE analysis. This matrix contains counts of sequencing reads mapped to each gene across multiple samples. Each row represents a gene, and each column corresponds to a sample. The values in the matrix indicate the number of reads aligning to each gene, providing a measure of expression.
Warning
The read counts matrix serves as the foundation for DGE analysis, as it quantifies expression levels that statistical models can use to test for significant differences between conditions. Proper generation of the counts matrix is crucial for accurate results, as it ensures that the data reflects genuine biological variation rather than technical noise.
Loading Libraries
# Load necessary libraries
library(DESeq2)
library(edgeR)
library(limma)
library(readr)Importing Read Count Table
# Import your read count table from Github
urlfile="https://raw.githubusercontent.com/merlinis12/RNA-Seq-Data-Analysis-in-R/refs/heads/main/data/airway_scaledcounts.csv"
count_data_full <- read_csv(url(urlfile))
# View data dimensions and initial rows
dim(count_data_full)
head(count_data_full)
#gene names as rownames
count_data <- count_data_full[,-1]
rownames(count_data) <- count_data_full[[1]]Caution
Ensure that the rows represent genes and columns represent samples.
Importing Metadata
#read metadata from github
urlfile="https://raw.githubusercontent.com/merlinis12/RNA-Seq-Data-Analysis-in-R/refs/heads/main/data/airway_metadata.csv"
meta_data <- read_csv(url(urlfile))
meta_data$dex <- as.factor(meta_data$dex)
# View data dimensions and initial rows
dim(meta_data)
head(meta_data)
# Define experimental groups
ids <- colnames(count_data)
group <- meta_data[which(meta_data$id == ids),][['dex']]# Modify as per your setup
design <- model.matrix(~group)RNA-Seq data can sometimes contain missing values in the read count table due to technical or biological reasons. These missing values can interfere with differential expression analysis if not handled appropriately. This section explains best practices and methods for handling missing data, ensuring the reliability and robustness of downstream analyses
Understanding Missing Data in RNA-Seq Common Causes of Missing Data
- Low Expression Levels: Genes with very low or no expression in certain samples may have missing or zero counts.
- Technical Failures: Issues during sequencing, library preparation, or data processing can lead to missing values.
- Bioinformatics Processing Errors: Errors in mapping or counting can also result in missing values.
Warning
Proper handling is necessary to ensure accurate results. Missing values can lead to biased estimates of gene expression and affect the power of differential expression analysis.
Several strategies are available to handle missing data in RNA-Seq read count tables:
Tip
Best Practices for Missing Values
- Avoid Imputation: Imputation can introduce bias, particularly in sparse RNA-Seq data. But, if you want to impute your data, compare a model with >no imputation with the model with imputed data.
- Replace with Zeros Where Appropriate: If missing values correspond to no expression, replacing with zero can be suitable.
- Filter Genes with High Proportions of Missing Values: Removing genes with excessive missing values can reduce noise in the analysis.
- Inspect and Justify Decisions: Document and justify your approach to handling missing data, as different datasets may require different methods.
Imputation of Missing Values
Imputation involves replacing missing values with estimated values. However, imputation is generally avoided in RNA-Seq, as it can introduce bias, especially with highly variable genes. Simple imputation methods include filling in missing counts with the mean or median of non-missing values, but more complex methods (like K-nearest neighbors or predictive models) might also be used for certain analyses.
Note
Imputation is not commonly recommended for RNA-Seq data because the sparsity of low-count genes can be biological, and replacing these values may interfere with the accuracy of differential expression analysis.
Removing Genes with Too Many Missing Values
A common approach is to filter out genes with excessive missing values. For example, you might remove genes that are missing in more than 20% of samples. This approach ensures that low-quality genes do not skew the results.
Zero Replacement for Missing Counts
In cases where missing values correspond to zero expression (i.e., genes are not expressed in certain conditions), it may be appropriate to replace missing values with zeros. Many RNA-Seq analysis tools, such as DESeq2 and edgeR, expect zero values instead of missing entries, as they interpret zero counts as evidence of no expression.
Warning
Since log(0)=−∞, you might try using pseudocounts, i. e. y=log(n+1) or more generally, y=log(n+ n0), where n represents the count values and n0 is some positive constant. More info here.
Practical Steps to Handle Missing Data in R
# Load the required libraries
library(DESeq2)
library(edgeR)
library(limma)
# Load your data
read_counts <- read.table("read_counts.txt", header = TRUE, row.names = 1)
# Check for missing values
missing_counts <- sum(is.na(read_counts))
cat("Number of missing values in the read count table:", missing_counts)If you decide to replace missing values with zeros (assuming missing values represent no expression), use the following code:
# Replace Missing Values with Zero
# Replace NA with zero
read_counts[is.na(read_counts)] <- 0Alternatively, you can filter out genes with too many missing values. For instance, remove genes missing in more than 20% of samples:
# Filter Genes with Excessive Missing Values
# Define a threshold for missing data (20%)
threshold <- 0.2
# Calculate the proportion of missing values per gene
missing_proportion <- rowMeans(is.na(read_counts))
# Filter out genes with missing values above the threshold
filtered_counts <- read_counts[missing_proportion <= threshold, ]In differential gene expression (DGE) analysis, transforming normalized read counts before analysis is often beneficial. Transformations, such as log transformations, help stabilize variance across samples, making downstream analyses like clustering or visualization more interpretable. This section covers several transformation methods and provides code examples in R.
Overview of Read Count Transformation
Most DGE tools operate on raw or normalized count values. However, transformations are often applied to facilitate downstream analyses. Common transformations include:
- Log transformation: This helps reduce the skew in data, especially for high-count genes.
-
Variance-stabilizing transformations (VST): Methods like regularized log transformation (
rlog()) in DESeq2 help address heteroskedasticity by shrinking variance for low-count genes.
Log2 Transformation
Log transformations are commonly applied using base 2 (log2), which is easier to interpret in biological terms (doubling, halving, etc.). When using log transformation on count data, a pseudocount (e.g., +1) is added to avoid taking the log of zero.
# Log2 transformation with pseudocount
log.norm.counts <- log2(counts.sf_normalized + 1)
# Visualize transformed data
par(mfrow=c(2,1))
boxplot(counts.sf_normalized, notch=TRUE, main="Untransformed read counts", ylab="Read counts")
boxplot(log.norm.counts, notch=TRUE, main="Log2-transformed read counts", ylab="Log2(read counts)")This transformation is particularly useful for creating visualizations, as it compresses the range of read counts, making patterns more discernible.
Visualizing Transformed Counts and Checking for Variance Homogeneity
Heteroskedasticity occurs when variance changes with the mean, often observed in RNA-Seq data. Visual checks for heteroskedasticity are helpful before choosing a transformation method.
# Pairwise scatter plot to inspect count similarity across samples
plot(log.norm.counts[,1:2], cex=.1, main="Normalized log2(read counts)")
# Mean-SD plot to check for variance dependency
library(vsn)
library(ggplot2)
msd_plot <- meanSdPlot(log.norm.counts, ranks=FALSE, plot=FALSE)
msd_plot$gg + ggtitle("Sequencing depth normalized log2(read counts)") + ylab("Standard deviation")An upward trend on the left side of the mean-SD plot indicates that low-count genes show higher variability, suggesting the need for variance stabilization.
Regularized Log Transformation (rlog)
DESeq2's rlog() function applies regularized log transformation, which shrinks the variance of low-count genes. This transformation is effective for datasets with highly variable counts across genes, as it uses the entire dataset's variance-mean trend for adjustment.
Note: Set the
blindparameter toFALSEif you expect strong differences between experimental conditions; otherwise, set it toTRUEto avoid overestimating dispersion.
# Regularized log transformation
DESeq.rlog <- rlog(DESeq.ds, blind=TRUE)
rlog.norm.counts <- assay(DESeq.rlog)
# Mean-SD plot for rlog-transformed data
msd_plot <- meanSdPlot(rlog.norm.counts, ranks=FALSE, plot=FALSE)
msd_plot$gg + ggtitle("rlog-transformed read counts") + ylab("Standard deviation")This approach provides more stable variance across low-expression genes, making it suitable for downstream clustering and visualization tasks.
Variance-Stabilizing Transformation (VST)
The varianceStabilizingTransformation() function in DESeq2 provides an alternative variance-stabilizing transformation, especially useful for large datasets.
# Apply VST transformation
vst.counts <- varianceStabilizingTransformation(DESeq.ds, blind=TRUE)
# Visualize with a mean-SD plot
msd_plot <- meanSdPlot(assay(vst.counts), ranks=FALSE, plot=FALSE)
msd_plot$gg + ggtitle("VST-transformed read counts") + ylab("Standard deviation")Summary of Transformations
Each transformation has its advantages:
- Log2 Transformation: Quick, simple, and effective for visualizations.
- Regularized Log Transformation (rlog): Useful for smaller datasets with high variability.
- Variance-Stabilizing Transformation (VST): Suitable for larger datasets, producing a stable variance profile.
Tip
- Apply transformations on sequencing-depth-normalized counts.
- Visual checks, such as mean-SD plots, are essential for assessing variance homogeneity.
- Choose transformation methods based on dataset size and variability.
Before conducting DGE analysis, it’s important to explore patterns across your samples. This step helps you understand sample-to-sample variability, identify any outliers, and assess the similarity between replicates. Three common methods for examining global patterns in RNA-Seq data include:
- Pairwise Correlation
- Hierarchical Clustering
- Principal Component Analysis (PCA)
Warning
Pattern exploration of the read counts should be done on normalized and preferably transformed read counts, so that the high variability of low read counts does not occlude potentially informative data trends.
Pairwise Correlation Analysis
Pairwise correlation is a straightforward method to assess similarity between samples. High correlations between replicates indicate consistency, while low correlations might signal technical issues or biological differences.
Steps to Calculate Pairwise Correlation:
- Prepare normalized data: Make sure your read counts are normalized for sequencing depth.
- Calculate correlation matrix: Compute correlations for all pairs of samples using a suitable method (e.g., Pearson or Spearman correlation).
- Visualize correlation matrix: Use a heatmap to visualize correlations, which allows easy identification of high or low similarity between samples.
Example Code in R
# Install required packages if necessary
if (!requireNamespace("pheatmap", quietly = TRUE)) install.packages("pheatmap")
# Load library
library(pheatmap)
# Assume `norm_counts` is a matrix of normalized counts (e.g., log2 transformed)
# Calculate pairwise correlation
cor_matrix <- cor(norm_counts, method="pearson")
# Plot heatmap of the correlation matrix
pheatmap(cor_matrix, main="Pairwise Sample Correlation", display_numbers=TRUE)This heatmap provides a quick overview of which samples are similar or dissimilar based on their gene expression profiles. Values close to 1 indicate high similarity, while lower values suggest potential differences.
Hierarchical Clustering
Hierarchical clustering groups samples based on their similarity, providing a tree-like representation (dendrogram) that visually illustrates relationships among samples. This is particularly useful for detecting clusters of similar samples, identifying outliers, and understanding sample grouping.
Steps for Hierarchical Clustering:
- Calculate a distance matrix: Convert the correlation matrix or normalized data to a distance matrix.
-
Cluster samples: Use hierarchical clustering methods such as
hclust()with a specified linkage method (e.g., average or complete linkage). - Plot a dendrogram: Visualize the hierarchical relationships between samples.
Example Code in R
# Convert correlation matrix to distance matrix
dist_matrix <- as.dist(1 - cor_matrix) # 1 - correlation to get distances
# Perform hierarchical clustering
hclust_res <- hclust(dist_matrix, method="average")
# Plot dendrogram
plot(hclust_res, main="Hierarchical Clustering of Samples", sub="", xlab="", ylab="Distance")Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a dimensionality reduction technique that summarizes the variation across samples in a few components. Each principal component represents a portion of the total variance in the data, and plotting samples based on the first two components often reveals sample groupings and potential outliers.
Steps for Performing PCA
- Prepare the data: Normalize and, if needed, log-transform the read count matrix.
-
Run PCA: Use a function like
prcomp()to perform PCA on the samples. - Plot PCA results: A scatterplot of the first two principal components shows the major sources of variation across samples.
Example Code in R
# Perform PCA
pca_res <- prcomp(t(norm_counts), center=TRUE, scale.=TRUE) # Transpose to analyze samples
# Plot PCA
library(ggplot2)
pca_data <- as.data.frame(pca_res$x)
pca_data$Sample <- rownames(pca_data)
ggplot(pca_data, aes(x=PC1, y=PC2, label=Sample)) +
geom_point(size=3) +
geom_text(vjust=1.5) +
labs(title="PCA of Samples", x="Principal Component 1", y="Principal Component 2") +
theme_minimal()In the PCA plot, samples with similar expression profiles will cluster together. The variance explained by each principal component helps determine which components capture the most variation in the data.
Tip
- Pairwise Correlation reveals sample similarity at a high level.
- Hierarchical Clustering groups similar samples and identifies outliers.
- PCA reduces dimensionality, summarizing the main sources of variation.
Important
Each method requires different normalization and transformation approaches to prepare the data for DE analysis:
-
limma: Use
voomon aDGEListobject to convert raw counts into log2 CPM. -
DESeq2: Create a
DESeqDataSetand apply size factor normalization. -
edgeR: Create a
DGEList, apply TMM normalization, and filter lowly expressed genes.
Data preparation for limma: Converting Counts to CPM and Using voom
limma is typically used for microarray analysis but can also handle RNA-Seq data by first applying voom, which transforms count data into log2 counts-per-million (CPM) with associated precision weights.
# Apply edgeR’s DGEList to organize data
dge <- DGEList(counts = count_data)
# Filter out lowly expressed genes (optional but recommended)
keep <- filterByExpr(dge, design)
dge <- dge[keep, ]
# Use voom transformation to convert to log2 CPM and estimate mean-variance relationship
v <- voom(dge, design)
# Inspect the transformed data
head(v$E)Important
The object v now contains log2-transformed counts with associated weights, suitable for DE analysis in limma.
Data Preparation for DESeq2
DESeq2 directly models raw counts without prior transformation, applying its own size factor-based normalization.
#Creating a DESeq2 Dataset
# Define the sample information as a data frame
col_data <- data.frame(
row.names = colnames(count_data),
condition = factor(c("Control", "Control", "Treatment", "Treatment")) # Modify as per your setup
)
# Create DESeq2 dataset object
dds <- DESeqDataSetFromMatrix(countData = count_data, colData = col_data, design = ~condition)
#Normalizing Counts
# Apply DESeq’s normalization
dds <- estimateSizeFactors(dds)
# Inspect normalized counts (optional)
norm_counts <- counts(dds, normalized = TRUE)
head(norm_counts)Important
The dds object is now prepared for DE analysis. DESeq2 will automatically handle normalization during DE testing, but the size factors are already estimated in this step.
Data Preparation for edgeR
edgeR is a count-based method that applies TMM (trimmed mean of M-values) normalization.
#Creating a DGEList Object and Applying TMM Normalization
# Organize count data into a DGEList
dge <- DGEList(counts = count_data, group = group)
# Apply TMM normalization
dge <- calcNormFactors(dge)
# Inspect normalized factors
dge$samples
#Filtering is recommended to remove genes with low counts across samples.
# Filter lowly expressed genes
keep <- filterByExpr(dge)
dge <- dge[keep, ]
# Recalculate normalization factors after filtering
dge <- calcNormFactors(dge)Important
The dge object is now ready for DE analysis using edgeR. Normalization factors have been calculated and applied, making it suitable for downstream analysis.
Other Data Normalization Approaches
Image source: Dillies et al.
The image shows the effects of different normalization approached for read
count differences due to library sizes (TC, total count; UQ, upper quartile; Med, median; DESeq, size factor; TMM,
Trimmed Mean of M-values; Q, quantile) or gene lengths (RPKM).
Caution
As mentioned in Intro2RNAseq, RPKM and total count normalization should be avoided in the context of DE analysis, no matter how often you see them applied in published studies. RPKM, FPKM etc. are only needed if expression values need to be compared between different genes within the same sample for which the different gene lengths must be taken into consideration.
RNA-Seq data may include gene identifiers (GENE IDs) in various formats (e.g., Ensembl IDs, gene symbols), and these may not always match the requirements of specific analysis tools or annotation databases. This section explains how to handle gene IDs in a read count table and convert them to the appropriate format for tools such as DESeq2, edgeR, and limma.
Most differential expression analysis tools require a consistent gene ID format to:
- Ensure accurate gene mapping and annotations.
- Perform pathway or gene ontology enrichment analysis.
- Compare results with other datasets or external databases.
Common Gene ID Types and Formats:
-
Ensembl Gene IDs: e.g.,
ENSG00000141510for the TP53 gene. -
NCBI Gene IDs: e.g.,
7157for the TP53 gene. -
Gene Symbols: e.g.,
TP53, a more recognizable name, especially for human-readable purposes. -
RefSeq IDs: e.g.,
NM_001126114for specific transcript isoforms of TP53. - Custom IDs: Some datasets might use project-specific IDs or require converting IDs for cross-species analysis.
Tip
Best Practices for Gene ID Handling
- Check the Gene ID Format: Ensure the format aligns with downstream databases and tools.
- Convert Using Reliable Packages: biomaRt and AnnotationDbi are recommended for converting gene IDs.
- Handle Missing Mappings: Remove or address genes without mappings, as they can complicate the analysis.
- Document the Conversion Process: Keeping a record of how IDs were converted can be useful for reproducibility and data sharing.
Practical Steps for Inspecting and Converting Gene IDs
First, load your read count table and inspect the gene ID column to understand the format in use.
# Load the read count table
read_counts <- read.table("read_counts.txt", header = TRUE, row.names = 1)
# Preview the first few rows to inspect gene IDs
head(rownames(read_counts))If the IDs are in a format that doesn’t match your annotation database, they will need to be converted.
Using Bioconductor's biomaRt Package
The biomaRt package provides an interface to access Ensembl’s BioMart database, which allows converting gene IDs between formats. Here’s how to use biomaRt to convert Ensembl IDs to gene symbols (or vice versa).
-
Install and Load
biomaRtif (!requireNamespace("biomaRt", quietly = TRUE)) { install.packages("BiocManager") BiocManager::install("biomaRt") } library(biomaRt)
-
Set Up BioMart
Configure the connection to the Ensembl database and specify your dataset (e.g., human genes).
# Connect to Ensembl BioMart ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl") # Human genes
-
Convert Gene IDs
Use the
getBM()function to retrieve the desired mappings. For example, to convert Ensembl IDs to gene symbols:# List of Ensembl IDs to convert ensembl_ids <- rownames(read_counts) # Retrieve mapping information gene_mapping <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"), filters = "ensembl_gene_id", values = ensembl_ids, mart = ensembl) # Preview the mapping table head(gene_mapping)
-
Merge with Original Data
Merge the gene mapping table with your read count data:
# Merge the mapping table with read counts read_counts_mapped <- merge(gene_mapping, read_counts, by.x = "ensembl_gene_id", by.y = "row.names", all.y = TRUE) # Set gene symbols as row names rownames(read_counts_mapped) <- read_counts_mapped$hgnc_symbol read_counts_mapped <- read_counts_mapped[, -c(1:2)] # Remove ID columns if not needed
Using the AnnotationDbi and org.Hs.eg.db Packages
For human data, AnnotationDbi and org.Hs.eg.db provide another reliable way to convert IDs without needing an internet connection.
-
Install and Load Packages
if (!requireNamespace("AnnotationDbi", quietly = TRUE) || !requireNamespace("org.Hs.eg.db", quietly = TRUE)) { BiocManager::install(c("AnnotationDbi", "org.Hs.eg.db")) } library(AnnotationDbi) library(org.Hs.eg.db)
-
Convert Gene IDs
Suppose the IDs are in Ensembl format, and you want to convert them to gene symbols:
# Convert Ensembl IDs to Gene Symbols gene_symbols <- mapIds(org.Hs.eg.db, keys = rownames(read_counts), column = "SYMBOL", keytype = "ENSEMBL", multiVals = "first") # Update row names in read count data rownames(read_counts) <- gene_symbols
Handling Missing Mappings
Not all Ensembl IDs will have a corresponding gene symbol, which can result in NA values. You may wish to filter out rows with NA symbols to avoid errors in downstream analysis:
# Remove rows with missing gene symbols
read_counts <- read_counts[!is.na(rownames(read_counts)), ]Important
Gene ID Conversion Tools
- biomaRt (R package): provides an interface to Ensembl’s BioMart database, allowing conversion between various gene ID formats, including Ensembl IDs, gene symbols, RefSeq IDs, and more. It also supports batch conversion and offers flexible querying options.
- AnnotationDbi and Organism-Specific Packages (R packages): use AnnotationDbi along with organism-specific databases (e.g., org.Hs.eg.db for humans) to map Ensembl IDs, gene symbols, and other types of identifiers.
- DAVID (Database for Annotation, Visualization, and Integrated Discovery): an online tool for functional annotation and gene ID conversion. DAVID can convert various gene ID types and provides options for enrichment analysis.
- g:Profiler: an online tool that offers batch conversion of gene IDs across multiple species and data types. It also includes functional profiling features for downstream analysis.
- UniProt ID Mapping: allows conversion of protein-related identifiers across a wide range of formats, including Ensembl, NCBI, and UniProt IDs. Recommended for users focused on protein-coding genes or for projects involving proteomics.
- mygene.info: provides a RESTful API for real-time gene ID conversion and annotation, supporting various ID types and offering easy integration with web applications and scripts.
- BioDBnet (Biological Database Network): offers a web interface for ID conversion and retrieval of functional annotations from multiple biological databases. It supports batch processing and provides a range of output options.
- Ensembl BioMart (Web Interface): official web interface for accessing Ensembl’s BioMart, allowing ID conversions and data retrieval across species. It includes customizable output options. No need for programming skills.
- Love, M., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12), 550.
- Robinson, M. D., McCarthy, D. J., & Smyth, G. K. (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140.
- Ritchie, M. E., et al. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research, 43(7), e47.