Introduction

Batch effects refer to sources of unwanted variation that are unrelated to and can obscure the biological factors of interest. Handling batch effects is a necessary step to improve the reproducibility of research studies. However, methods developed for batch effects often rely on strong assumptions about data characteristics and the types of batch effects.

This hands-on practice provides guidelines for batch effect detection, management, and evaluation of method effectiveness through visual and numerical approaches. Although the practice will illustrate the workflow using microbiome data, the concepts and techniques presented are broadly applicable to any type of omics data.

Case study description

Anaerobic Digestion This study explored microbial indicators that could improve the efficacy of the Anaerobic Digestion (AD) bioprocess and prevent its failure (Chapleur et al. 2016). Samples were treated with two different ranges of phenol concentrations (effect of interest) and processed on five different dates (batch effect). This study exhibits a clear and strong batch effect with an approx. balanced batch x treatment design.

Packages installation and loading

First, let’s load the packages necessary for the analysis, and check the version of each package.

# CRAN
cran.pkgs <- c('pheatmap', 'vegan', 'ruv', 'ggplot2', 
               'performance', 'gridExtra')

# gridExtra:grid.arrange

# install.packages(cran.pkgs)

# Bioconductor
bioc.pkgs <- c('mixOmics', 'sva', 'limma', 'Biobase', 'metagenomeSeq', 
               'PLSDAbatch')

# if(!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
# BiocManager::install(bioc.pkgs)  

# load packages 
suppressMessages(suppressWarnings(sapply(c(cran.pkgs, bioc.pkgs), require, 
                                         character.only = TRUE)))

##      pheatmap         vegan           ruv       ggplot2   performance 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##     gridExtra      mixOmics           sva         limma       Biobase 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
## metagenomeSeq    PLSDAbatch 
##          TRUE          TRUE

# print package versions
sapply(c(cran.pkgs, bioc.pkgs), package.version)

##      pheatmap         vegan           ruv       ggplot2   performance 
##      "1.0.12"      "2.6-10"     "0.9.7.1"       "3.5.2"      "0.13.0" 
##     gridExtra      mixOmics           sva         limma       Biobase 
##         "2.3"      "6.30.0"      "3.52.0"      "3.60.6"      "2.64.0" 
## metagenomeSeq    PLSDAbatch 
##      "1.46.0"       "1.1.1"

Load data

Before detecting batch effects, the data should be properly preprocessed. The preprocessing steps may vary depending on the type of omics data.

load(file = './example_ADdata.rda')

ad.clr = example_ADdata$data
ad.trt = example_ADdata$treatment
ad.batch = example_ADdata$batch

dim(ad.clr)

## [1]  75 231

The microbial data include 231 OTUs and 75 samples.

Exercise 1: How many samples are there within each treatment and batch group?

Batch effect detection

Principal component analysis (PCA)

To visualise the data, let’s apply pca() function from mixOmics package and Scatter_Density() function from PLSDAbatch to represent the PCA sample plot with densities.

ad.pca.before <- pca(ad.clr, ncomp = 3, scale = TRUE)

Scatter_Density(object = ad.pca.before, batch = ad.batch, trt = ad.trt, 
                title = 'AD data', trt.legend.title = 'Phenol conc.')

PCA sample plot with densities for the AD data.

Exercise 2: Interpret the PCA plot created above

Boxplots and density plots

We can also visualise individual variables (OTUs in this case). For example, we can select the top OTU driving the major variance in PCA using selectVar() in mixOmics package. We can then plot this OTU as boxplots and density plots using box_plot() and density_plot() in PLSDAbatch.

ad.OTU.name <- selectVar(ad.pca.before, comp = 1)$name[1]

ad.OTU_batch <- data.frame(value = ad.clr[,ad.OTU.name], batch = ad.batch)
head(ad.OTU_batch)

box_plot(df = ad.OTU_batch, title = paste(ad.OTU.name, '(AD data)'), 
         x.angle = 30)

Boxplots of sample values for “OTU28” in the AD data.

density_plot(df = ad.OTU_batch, title = paste(ad.OTU.name, '(AD data)'))

Density plots of sample values for “OTU28” in the AD data.

The boxplot and density plot indicated a strong date effect because of the differences between “14/04/2016”, “21/09/2017” and the other dates in the “OTU28”.

To assess statistical significance, we can apply a linear regression model to “OTU28” using linear_regres() from PLSDAbatch with batch and treatment effects as covariates. To compare “14/04/2016” and “21/09/2017” to the other batches, we need to set them as the reference levels, respectively, using relevel() from stats.

# reference level: 14/04/2016
ad.batch <- relevel(x = ad.batch, ref = '14/04/2016')

ad.OTU.lm <- linear_regres(data = ad.clr[,ad.OTU.name], 
                           trt = ad.trt, 
                           batch.fix = ad.batch, 
                           type = 'linear model')
summary(ad.OTU.lm$model$data)

## 
## Call:
## lm(formula = data[, i] ~ trt + batch.fix)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9384 -0.3279  0.1635  0.3849  0.9887 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.8501     0.2196   3.871 0.000243 ***
## trt1-2               -1.6871     0.1754  -9.617 2.27e-14 ***
## batch.fix09/04/2015   1.5963     0.2950   5.410 8.55e-07 ***
## batch.fix01/07/2016   2.0839     0.2345   8.886 4.82e-13 ***
## batch.fix14/11/2016   1.7405     0.2480   7.018 1.24e-09 ***
## batch.fix21/09/2017   1.2646     0.2690   4.701 1.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7033 on 69 degrees of freedom
## Multiple R-squared:  0.7546, Adjusted R-squared:  0.7368 
## F-statistic: 42.44 on 5 and 69 DF,  p-value: < 2.2e-16

# reference level: 21/09/2017
ad.batch <- relevel(x = ad.batch, ref = '21/09/2017')

ad.OTU.lm <- linear_regres(data = ad.clr[,ad.OTU.name], 
                           trt = ad.trt, 
                           batch.fix = ad.batch, 
                           type = 'linear model')
summary(ad.OTU.lm$model$data)

## 
## Call:
## lm(formula = data[, i] ~ trt + batch.fix)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9384 -0.3279  0.1635  0.3849  0.9887 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.1147     0.2502   8.453 2.97e-12 ***
## trt1-2               -1.6871     0.1754  -9.617 2.27e-14 ***
## batch.fix14/04/2016  -1.2646     0.2690  -4.701 1.28e-05 ***
## batch.fix09/04/2015   0.3317     0.3139   1.056  0.29446    
## batch.fix01/07/2016   0.8193     0.2573   3.185  0.00218 ** 
## batch.fix14/11/2016   0.4759     0.2705   1.760  0.08292 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7033 on 69 degrees of freedom
## Multiple R-squared:  0.7546, Adjusted R-squared:  0.7368 
## F-statistic: 42.44 on 5 and 69 DF,  p-value: < 2.2e-16

We observed P < 0.001 for the regression coefficients associated with all other batches when “14/04/2016” was set as the reference level. This confirms the differences between the samples from batch “14/04/2016” and those from the other batches, as previously observed in the plots.

Exercise 3: Interpret the result when “21/09/2017” was set as the reference level

Heatmap

We can also use a heatmap to visualise the data using pheatmap package. The data first need to be scaled across both OTUs and samples.

# scale the clr data on both OTUs and samples
ad.clr.s <- scale(ad.clr, center = TRUE, scale = TRUE)
ad.clr.ss <- scale(t(ad.clr.s), center = TRUE, scale = TRUE)

ad.anno_col <- data.frame(Batch = ad.batch, Treatment = ad.trt)
ad.anno_colors <- list(Batch = color.mixo(seq_len(5)), 
                       Treatment = pb_color(seq_len(2)))
names(ad.anno_colors$Batch) = levels(ad.batch)
names(ad.anno_colors$Treatment) = levels(ad.trt)

pheatmap(ad.clr.ss, 
         cluster_rows = FALSE, 
         fontsize_row = 4, 
         fontsize_col = 6,
         fontsize = 8,
         clustering_distance_rows = 'euclidean',
         clustering_method = 'ward.D',
         treeheight_row = 30,
         annotation_col = ad.anno_col,
         annotation_colors = ad.anno_colors,
         border_color = 'NA',
         main = 'AD data - Scaled')

Hierarchical clustering of samples in the AD data.

In the heatmap, samples from the batch dated “14/04/2016” clustered together and were distinct from the other samples, indicating a clear batch effect.

Partial redundancy analysis (pRDA)

To quantitatively evaluate batch effects, we can apply pRDA with varpart() function from vegan R package.

ad.factors.df <- data.frame(trt = ad.trt, batch = ad.batch)
head(ad.factors.df)

ad.rda.before <- varpart(ad.clr, ~ trt, ~ batch, 
                         data = ad.factors.df, scale = TRUE)
ad.rda.before$part$indfract

In the result, X1 and X2 represent the first and second covariates fitted in the model. [a] and [b] indicate the independent proportions of variance explained by X1 and X2, respectively, while [c] represents the shared variance between X1 and X2. In the AD data, the variance explained by batch (X2) was larger than that explained by treatment (X1), with some intersection (shared variance) reflected in [c] (Adj.R.squared = 0.013). A greater shared variance suggests a more unbalanced batch × treatment design. Therefore, in this study, we considered the design to be approximately balanced.

Managing batch effects

Accounting for batch effects

The methods we use to account for batch effects include those specifically designed for microbiome data, such as the zero-inflated Gaussian mixture model (see the section “To go further”), as well as methods adapted for microbiome data, including linear regression, Surrogate variable analysis (SVA; see “To go further”) and Remove unwanted variation in 4 steps (RUV4). Among these, SVA and RUV4 are designed to handle unknown batch effects.

Linear regression: linear model (LM) and linear mixed model (LMM)

Linear regression will be conducted using linear_regres() function in PLSDAbatch. We integrated the performance package, which assesses the performance of regression models, into function linear_regres(). Therefore, we can apply check_model() from performance to the outputs of linear_regres() to diagnose the validity of models fitted with treatment and batch effects for each variable (LÃŒdecke et al. 2020).

We can also extract performance metrics such as adjusted R2, RMSE, RSE, AIC and BIC for models fitted with and without batch effects, which are included in the outputs of linear_regres().

Let’s apply type = "linear model" to the AD data, given its balanced batch x treatment design.

ad.lm <- linear_regres(data = ad.clr, 
                       trt = ad.trt, 
                       batch.fix = ad.batch, 
                       type = 'linear model',
                       p.adjust.method = 'fdr')

# p values adjusted for batch effects
ad.p.adj <- ad.lm$adj.p 

check_model(ad.lm$model$OTU12)

Diagnostic plots for the model fitted with batch effects for “OTU12” in the AD data.

To assess the validity of the model fitted with both treatment and batch effects, we can use diagnostic plots to check whether the assumptions are met for each microbial variable. For example, for “OTU12”, the assumptions of linearity (or homoscedasticity) and homogeneity of variance were not satisfied (top panel). No sample was classified as an outlier with a Cook’s distance greater than or equal to 0.9, however, samples “57”, “39”, “47”, “44” and “16” were relatively close to the contour lines. The correlation between batch (batch.fix) and treatment (trt) effects was very low (< 5), indicating a well-fitted model with low collinearity (middle panel). The distribution of residuals was very close to normal (bottom panel). For microbial variables that violate some model assumptions, their results should be interpreted with caution.

For the performance metrics of models fitted with or without batch effects, we show results for a subset of variables as an example only.

head(ad.lm$adj.R2)

The adjusted $R^2$ of the model with both treatment and batch effects was higher for all the listed OTUs compared to the model with treatment effects only, suggesting that including batch effects explained more variance in the data and resulted in a better-fitting model.

We can also compare the AIC of models fitted with and without batch effects.

head(ad.lm$AIC)

A lower AIC indicates a better fit, as seen here for the model fitted with batch effects across all OTUs.

Both results strongly suggest that batch effects should be included in the linear model.

Remove unwanted variation in 4 steps (RUV4)

Before applying RUV4 (RUV4() from ruv package), we need to specify negative control variables and the number of batch factors to estimate.

Empirical negative controls that are not significantly differentially abundant (adjusted P > 0.05), based on a linear regression using treatment information as the only covariate, can be used here. However, in this case, we assume that batch and treatment are not correlated; otherwise, the selection of empirical negative controls will be biased, and the final batch effect adjustment may be spurious.

Therefore, we will use a loop to fit a linear regression for each microbial variable and adjust the p values of the treatment effects for multiple comparisons using p.adjust() from stats. The empirical negative controls can then be identified based on the adjusted p values.

# empirical negative controls
ad.empir.p <- c()
for(e in seq_len(ncol(ad.clr))){
  ad.empir.lm <- lm(ad.clr[,e] ~ ad.trt)
  ad.empir.p[e] <- summary(ad.empir.lm)$coefficients[2,4]
}
ad.empir.p.adj <- p.adjust(p = ad.empir.p, method = 'fdr')
ad.nc <- ad.empir.p.adj > 0.05

The number of batch factors k can be determined using getK() function.

# estimate k
ad.k.res <- getK(Y = ad.clr, X = ad.trt, ctl = ad.nc)
ad.k <- ad.k.res$k

After all required parameters are estimated, let’s apply RUV4() with the known treatment variables, the selected negative control variables, and the estimated number of batch factors k. The resulting p values should also be adjusted for multiple comparisons.

# RUV4
ad.ruv4 <- RUV4(Y = ad.clr, X = ad.trt, ctl = ad.nc, k = ad.k) 
ad.ruv4.p <- ad.ruv4$p
ad.ruv4.p.adj <- p.adjust(ad.ruv4.p, method = "fdr")

Note: A package named RUVSeq has been developed for count data. It provides RUVg() which uses negative control variables, as well as other functions such as RUVs() and RUVr(), which use sample replicates (Moskovicz et al. 2020) or residuals from regression on treatment effects to estimate and account for latent batch effects. However, for CLR-transformed data, we still recommend using the standard ruv package.

Another method, SVA, which accounts for unknown batch effects, can be found in the section “To go further”.

Correcting for batch effects

The methods we will use to correct for batch effects include ComBat, PLSDA-batch and Remove Unwanted Variation-III (RUVIII). Among these, RUVIII is designed to correct for unknown batch effects. Other methods, such as removeBatchEffect, sPLSDA-batch and percentile normalisation (PN), can be found in the section “To go further”.

ComBat

The ComBat() function (from sva package) supports both parametric and non-parametric correction, controlled by the option par.prior. For parametric adjustment, the model’s validity can be assessed by setting prior.plots = T (Leek et al. 2012).

For AD data, we apply a non-parametric correction (par.prior = FALSE), using the input batch grouping information (batch) and the treatment design matrix (mod) to calculate the batch effect corrected data ad.ComBat. We chose the non-parametric approach based on the assumption that microbial abundance data, even after CLR transformation, do not follow a standard distribution.

# the treatment design matrix
ad.mod <- model.matrix( ~ ad.trt)
ad.ComBat <- t(ComBat(t(ad.clr), batch = ad.batch, 
                      mod = ad.mod, par.prior = FALSE))

## Found5batches

## Adjusting for1covariate(s) or covariate level(s)

## Standardizing Data across genes

## Fitting L/S model and finding priors

## Finding nonparametric adjustments

## Adjusting the Data

PLSDA-batch

Before applying PLSDA-batch, we need to specify the optimal number of components related to treatment (ncomp.trt) and batch effects (ncomp.bat).

To determine ncomp.trt, we will use the plsda() from mixOmics with only the treatment grouping information to estimate the optimal number of treatment-related components to retain.

# estimate the number of treatment components
ad.trt.tune <- plsda(X = ad.clr, Y = ad.trt, ncomp = 5)
ad.trt.tune$prop_expl_var #1

## $X
##      comp1      comp2      comp3      comp4      comp5 
## 0.18619506 0.07873817 0.08257396 0.09263497 0.06594757 
## 
## $Y
##      comp1      comp2      comp3      comp4      comp5 
## 1.00000000 0.33857374 0.17315267 0.10551296 0.08185822

We should choose the number of components that explain 100% of the variance in the outcome matrix Y. From the result, 1 component was sufficient to preserve the treatment information.

To determine ncomp.bat, we will use the PLSDA_batch() function (PLSDAbatch package) with both treatment and batch grouping information, along with the specified number of treatment-related components, to estimate the optimal number of batch components to remove.

# estimate the number of batch components
ad.batch.tune <- PLSDA_batch(X = ad.clr, 
                             Y.trt = ad.trt, Y.bat = ad.batch,
                             ncomp.trt = 1, ncomp.bat = 10)
ad.batch.tune$explained_variance.bat

## $X
##      comp1      comp2      comp3      comp4      comp5      comp6      comp7 
## 0.17470922 0.11481264 0.10122717 0.07507395 0.03919879 0.03703988 0.03446079 
##      comp8      comp9     comp10 
## 0.02810058 0.02097436 0.01398721 
## 
## $Y
##       comp1       comp2       comp3       comp4       comp5       comp6 
## 0.247465374 0.261574081 0.230138238 0.260822307 0.230187383 0.267943709 
##       comp7       comp8       comp9      comp10 
## 0.256012373 0.241124438 0.004732098 0.234942402

sum(ad.batch.tune$explained_variance.bat$Y[seq_len(4)]) #4

## [1] 1

Using the same criterion as for selecting treatment components, we will choose the number of batch-related components that explain 100% of the variance in the batch outcome matrix (Y.bat). According to the result, 4 components are required to remove the batch effects.

Then let’s correct for batch effects by applying PLSDA_batch() with the input treatment and batch grouping information, along with the estimated optimal numbers of related components.

ad.PLSDA_batch.res <- PLSDA_batch(X = ad.clr, 
                                  Y.trt = ad.trt, Y.bat = ad.batch,
                                  ncomp.trt = 1, ncomp.bat = 4)
ad.PLSDA_batch <- ad.PLSDA_batch.res$X.nobatch

Note: Comparatively, PLSDA-batch (PLSDAbatch package) is more suitable for weak batch effects, while from the same package, sparse PLSDA-batch is better suited for strong batch effects (see section “To go further”), weighted PLSDA-batch is specifically designed for unbalanced but not nested batch x treatment designs.

Remove Unwanted Variation-III (RUVIII)

The RUVIII() function is from ruv package. Similar to RUV4(), it requires empirical negative control variables and the number of unwanted factors (k) to remove. We will use those estimated in the RUV4 section. In addition, it requires sample replicates, which should be structured into a mapping matrix using replicate.matrix(). With these elements, we can then obtain the batch effect corrected data by applying RUVIII().

ad.metadata <- example_ADdata$metadata
ad.replicates <- ad.metadata$sample_name.data.extraction
head(table(ad.replicates, ad.batch))

##              ad.batch
## ad.replicates 21/09/2017 14/04/2016 09/04/2015 01/07/2016 14/11/2016
##        E1aJ0           0          0          0          1          0
##        E1aJ16          0          0          1          0          0
##        E1aJ40          1          1          0          0          0
##        E1aJ5           0          1          0          0          0
##        E1bJ16          0          0          0          1          0
##        E1bJ40          0          0          0          1          0

ad.replicates.matrix <- replicate.matrix(ad.replicates)

ad.RUVIII <- RUVIII(Y = ad.clr, M = ad.replicates.matrix, 
                    ctl = ad.nc, k = ad.k)
rownames(ad.RUVIII) <- rownames(ad.clr)

Assessing batch effect correction

Methods that detect batch effects

PCA

First, we can compare the PCA sample plots before and after batch effect correction using different methods.

ad.pca.before <- pca(ad.clr, ncomp = 3, scale = TRUE)
ad.pca.ComBat <- pca(ad.ComBat, ncomp = 3, scale = TRUE)
ad.pca.PLSDA_batch <- pca(ad.PLSDA_batch, ncomp = 3, scale = TRUE)
ad.pca.RUVIII <- pca(ad.RUVIII, ncomp = 3, scale = TRUE)

# order batches
ad.batch = factor(ad.metadata$sequencing_run_date, 
                  levels = unique(ad.metadata$sequencing_run_date))

ad.pca.before.plot <- Scatter_Density(object = ad.pca.before, 
                                      batch = ad.batch, 
                                      trt = ad.trt, 
                                      title = 'Before correction')

ad.pca.ComBat.plot <- Scatter_Density(object = ad.pca.ComBat, 
                                      batch = ad.batch, 
                                      trt = ad.trt, 
                                      title = 'ComBat')

ad.pca.PLSDA_batch.plot <- Scatter_Density(object = ad.pca.PLSDA_batch, 
                                           batch = ad.batch, 
                                           trt = ad.trt, 
                                           title = 'PLSDA-batch')

ad.pca.RUVIII.plot <- Scatter_Density(object = ad.pca.RUVIII, 
                                      batch = ad.batch, 
                                      trt = ad.trt, 
                                      title = 'RUVIII')

PCA sample plots with densities before and after batch effect correction in the AD data.

Exercise 4: Interpret the PCA plots created above

We can also compare boxplots and density plots of key variables identified by PCA, as well as heatmaps that highlight distinct patterns before and after batch effect correction. However, due to time limitations, we will not show the results here.

pRDA

As a quantitative measure, we can calculate the global explained variance across all microbial variables using pRDA. The results can be visualised using partVar_plot() from PLSDAbatch package.

# arrange data before and after batch effect correction into a list
ad.corrected.list <- list(`Before correction` = ad.clr, 
                          ComBat = ad.ComBat, 
                          `PLSDA-batch` = ad.PLSDA_batch, 
                          RUVIII = ad.RUVIII)

ad.prop.df <- data.frame(Treatment = NA, Batch = NA, 
                         Intersection = NA, 
                         Residuals = NA) 

# run rda in a loop
for(i in seq_len(length(ad.corrected.list))){
  rda.res = varpart(ad.corrected.list[[i]], ~ trt, ~ batch,
                    data = ad.factors.df, scale = TRUE)
  ad.prop.df[i, ] <- rda.res$part$indfract$Adj.R.squared}

rownames(ad.prop.df) = names(ad.corrected.list)

# change the order of explained variance
ad.prop.df <- ad.prop.df[, c(1,3,2,4)]

# remove values less than zero, and recalculate the proportions
ad.prop.df[ad.prop.df < 0] = 0
ad.prop.df <- as.data.frame(t(apply(ad.prop.df, 1, 
                                    function(x){x/sum(x)})))

partVar_plot(prop.df = ad.prop.df)

Global explained variance before and after batch effect correction in the AD data.

As shown in the figure above, the intersection between batch and treatment variance was small (1.3%), indicating that the batch x treatment design is not highly unbalanced. As a result, unweighted PLSDA-batch remained applicable, and the weighted version was not used. Among all methods, the data corrected with PLSDA-batch showed the best performance, explaining a higher proportion of treatment variance compared to the others. Notably, replicates in AD data are not present across all batches, highlighting the critical role of sample replicates in the effectiveness of RUVIII. Therefore, we calculated pseudo replicates and recalculated the batch effect corrected data. The result showed clear improvement, which can be found in the section “To go further”.

Other methods

$\mathbf{R^2}$

We can also calculate the $R^2$ from one-way ANOVA for each variable to evaluate the proportion of explained variance, using lm() from stats package. To ensure comparability of $R^2$ values across variables, we will scale the corrected data prior to the $R^2$ calculation. The results will then be visualised by ggplot() from ggplot2 R package.

# scale
ad.corr_scale.list <- lapply(ad.corrected.list, 
                             function(x){apply(x, 2, scale)})

# perform one-way ANOVA on each variable within each dataset
ad.r_values.list <- list()
for(i in seq_len(length(ad.corr_scale.list))){
  # for each dataset
  ad.r_values <- data.frame(trt = NA, batch = NA)
  for(c in seq_len(ncol(ad.corr_scale.list[[i]]))){
    # for each variable
    ad.fit.res.trt <- lm(ad.corr_scale.list[[i]][,c] ~ ad.trt)
    ad.r_values[c,1] <- summary(ad.fit.res.trt)$r.squared
    ad.fit.res.batch <- lm(ad.corr_scale.list[[i]][,c] ~ ad.batch)
    ad.r_values[c,2] <- summary(ad.fit.res.batch)$r.squared
  }
  ad.r_values.list[[i]] <- ad.r_values
}
names(ad.r_values.list) <- names(ad.corr_scale.list)
head(ad.r_values.list$`Before correction`)

# generate boxplots for each dataset
ad.boxp.list <- list()
for(i in seq_len(length(ad.r_values.list))){
  ad.boxp.list[[i]] <- 
    data.frame(r2 = c(ad.r_values.list[[i]][ ,'trt'],
                      ad.r_values.list[[i]][ ,'batch']), 
               Effects = as.factor(rep(c('Treatment','Batch'), 
                                       each = 231)))
}
names(ad.boxp.list) <- names(ad.r_values.list)
head(ad.boxp.list$`Before correction`)

ad.r2.boxp <- rbind(ad.boxp.list$`Before correction`,
                    ad.boxp.list$ComBat,
                    ad.boxp.list$`PLSDA-batch`,
                    ad.boxp.list$RUVIII)

ad.r2.boxp$methods <- rep(c('Before correction', 
                            'ComBat','PLSDA-batch',
                            'RUVIII'), each = 462)

ad.r2.boxp$methods <- factor(ad.r2.boxp$methods, 
                             levels = unique(ad.r2.boxp$methods))
head(ad.r2.boxp)

ggplot(ad.r2.boxp, aes(x = Effects, y = r2, fill = Effects)) +
  geom_boxplot(alpha = 0.80) +
  theme_bw() + 
  theme(text = element_text(size = 18),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_text(angle = 60, hjust = 1, size = 18),
        axis.text.y = element_text(size = 18),
        panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank(),
        legend.position = "right") + facet_grid( ~ methods) + 
  scale_fill_manual(values=pb_color(c(12,14)))

$R^2$ values for each microbial variable before and after batch effect correction.

As shown in the plots, the data corrected using ComBat still contained a few variables with a large proportion of batch variance. In the case of RUVIII, the corrected data exhibited a greater proportion of batch variance than treatment variance.

Alignment scores

Before applying alignment_score() function from PLSDAbatch, we need to specify the proportion of data variance to explain (var), the number of nearest neighbours (k) and the number of PCs to calculate (ncomp) for the internal PCA. We can then use ggplot() function from ggplot2 to visualise the results.

# calculate the alignment scores
ad.scores <- c()
names(ad.batch) <- rownames(ad.clr)
for(i in seq_len(length(ad.corrected.list))){
  res <- alignment_score(data = ad.corrected.list[[i]], 
                         batch = ad.batch, 
                         var = 0.95, 
                         k = 8, 
                         ncomp = 50)
  ad.scores <- c(ad.scores, res)
}
head(ad.scores)

## [1] 0.3766892 0.7179054 0.6689189 0.5304054

# rearrange the data for ggplot
ad.scores.df <- data.frame(scores = ad.scores, 
                           methods = names(ad.corrected.list))

ad.scores.df$methods <- factor(ad.scores.df$methods, 
                               levels = rev(names(ad.corrected.list)))

head(ad.scores.df)

ggplot() + geom_col(aes(x = ad.scores.df$methods, 
                        y = ad.scores.df$scores)) + 
  geom_text(aes(x = ad.scores.df$methods, 
                y = ad.scores.df$scores/2, 
                label = round(ad.scores.df$scores, 3)), 
            size = 3, col = 'white') + 
  coord_flip() + theme_bw() + ylab('Alignment Scores') + 
  xlab('') + ylim(0,0.85)

Alignment scores before and after batch effect correction using different methods for the AD data.

The alignment scores complement the PCA results, especially when batch effect removal is difficult to evaluate from PCA sample plots alone. For example, in the previous PCA plots (see section “PCA”), we observed that samples from different batches appeared more intermixed after batch effect correction, regardless of the method used. However, comparing the performance of different methods remained challenging.

Since a higher alignment score indicates better mixing of samples across batches, the bar plot above shows that ComBat achieved superior performance compared to the other methods. However, the $R^2$ analysis revealed that the data corrected with ComBat still contained a few variables with a large proportion of batch variance (see section “$R^2$”). This highlights the importance of evaluating correction effectiveness using multiple techniques to ensure an unbiased assessment.

In this example, the lower alignment score observed for the data corrected using PLSDA-batch compared to ComBat may be due to differences in the PCA sample projections. Specifically, the data corrected with ComBat exhibited greater variance in its PCA projection, whereas the data with PLSDA-batch showed smaller variance. Smaller variance in the projection can lead to lower alignment scores, as it increases the likelihood that samples from different batches appear as nearest neighbours. Nonetheless, the pRDA results (see section “pRDA”) quantitatively confirmed that PLSDA-batch effectively removed batch variance entirely (Wang and Lê Cao 2023).

Discussion and conclusions

This workshop presents a comprehensive framework for managing batch effects, using microbiome data as an illustrative example. The input for this workflow is preprocessed data. For different types of omics data, the preprocessing steps may vary.

To detect batch effects, both visual tools (e.g., PCA, boxplots, density plots, and heatmaps) and quantitative methods such as pRDA can be applied. If batch effects appear negligible, for instance, when pRDA shows only a minimal proportion of variance explained by batch or when PCA does not reveal clear batch driven clusters, batch effect management may not be necessary. However, when batch effects are substantial, two strategies can be considered: accounting for batch effects during modeling, or removing them from the data prior to analysis.

Both strategies assume a balanced batch × treatment design. If the design is nested, batch effects can only be accounted for using a linear mixed model. If the design is unbalanced but not nested, batch effects can either be controlled using standard linear models or removed using weighted PLSDA-batch.

In most cases, batch grouping information is assumed to be known. When it is not, batch estimation methods such as RUV4, RUVIII and SVA can be employed. RUV-based methods rely on negative control variables and/or sample replicates, while SVA estimates batch effects from variables that are least affected by treatment (see section “To go further”). However, for RUV methods, these negative controls and sample replicates must capture the full spectrum of batch variation; otherwise, batch effects may not be completely addressed. We emphasised this issue in the section “Assessing batch effect correction - pRDA”, where limited replicates posed challenges for batch effect removal. Moreover, some estimation methods (e.g., SVA) assume that batch effects are independent of treatment. When batch and treatment are correlated, the batch effect cannot be accurately estimated and may lead to spurious associations.

Additionally, many methods for batch effect management assume systematic batch effects across variables, and some models are highly influenced by this assumption (e.g., SVA, ComBat and RUV methods). Therefore, we recommend validating the model before applying it. For an appropriate choice of method, see the figure below: Decision tree for method selection.

The next critical step is to evaluate the effectiveness of batch effect management. While such evaluation is often implicit in methods that account for batch effects (e.g., linear regression), it is essential for methods that remove them. This can be done by comparing the data before and after correction using the same tools employed for batch effect detection. Additionally, $R^2$ values can be calculated for each microbial variable to quantify the variance explained by batch and treatment, and alignment scores can be used to assess how well samples from different batches are mixed. However, individual evaluation tools may have limitations. For example, PCA relies on visual interpretation, while alignment scores focus solely on distances in the PCA projection space. Therefore, a robust conclusion should be based on multiple complementary evaluation methods.

Once batch effects have been adequately addressed, downstream analyses, such as multivariate discriminant analysis or univariate differential analysis, can be performed. Among these, multivariate methods are often more appropriate for microbiome data, given the natural correlations among microbial variables arising from biological interactions. Different types of methods may be suitable for different omics data.

Decision tree for method selection.

To go further

Due to time limitations, the methods shown in the decision tree that we don’t have time to cover in practice are presented in the following section.

Accounting for batch effects

Methods designed for microbiome data

Zero-inflated Gaussian mixture model (CSS+ZIG)

To use the ZIG model, we first need to create an MRexperiment object by applying newMRexperiment() (from metagenomeSeq package) to the microbiome count data, along with annotated dataframes containing metadata and taxonomic information, respectively, which are generated using AnnotatedDataFrame() from Biobase package.

# Creating a MRexperiment object (make sure no NA in metadata)
## metadata
AD.phenotypeData = AnnotatedDataFrame(data = as.data.frame(ad.metadata))

## taxonomic info
AD.taxaData = AnnotatedDataFrame(data = as.data.frame(example_ADdata$taxa))

AD.obj = newMRexperiment(counts = t(example_ADdata$count), 
                         phenoData = AD.phenotypeData, 
                         featureData = AD.taxaData)
AD.obj

## MRexperiment (storageMode: environment)
## assayData: 567 features, 75 samples 
##   element names: counts 
## protocolData: none
## phenoData
##   sampleNames: E1aJ16_A E5aJ16_A ... E8cJ96_E (75 total)
##   varLabels: sample_name.data.extraction analysis_name ...
##     initial_phenol_concentration.regroup (7 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: OTU12 OTU29 ... OTU17710 (567 total)
##   fvarLabels: Kingdom Phylum ... Species (7 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

The AD count data are then filtered using filterData() function (from metagenomeSeq package).

To extract the count data from the MRexperiment object, we can use MRcounts() function.

# filtering data to maintain a threshold of minimum depth or OTU presence

## before filtering
dim(MRcounts(AD.obj))

## [1] 567  75

AD.obj = filterData(obj = AD.obj, present = 20, depth = 5)

## after filtering
dim(MRcounts(AD.obj))

## [1] 289  75

After filtering, the AD count data were reduced to 289 OTUs and 75 samples.

We will then calculate the percentile for CSS normalisation using cumNormStatFast() function (from metagenomeSeq package). CSS normalisation can be applied with cumNorm() and the normalised data can be exported using MRcounts() with norm = TRUE. The normalisation scaling factors for each sample, which are the sum of counts up to the calculated percentile, can be accessed using normFactors(). We can calculate the log-transformed scaling factors by diving them with their median, which are preferred over the default scaling factors that are divided by 1000 (log2(normFactors(obj)/1000 + 1)).

# calculate the percentile for CSS normalisation
AD.pctl = cumNormStatFast(obj = AD.obj)

## Default value being used.

# CSS normalisation
AD.obj <- cumNorm(obj = AD.obj, p = AD.pctl)

# export normalised data
AD.norm.data <- MRcounts(obj = AD.obj, norm = TRUE)

# normalisation scaling factors for each sample 
AD.normFactor = normFactors(object = AD.obj)

# log-transformed scaling factors
AD.normFactor = log2(AD.normFactor/median(AD.normFactor) + 1)

After that, let’s create a design matrix with the treatment variable (phenol_conc), batch variable (seq_run) and the log-transformed scaling factors using model.matrix(). We can then apply the ZIG model using fitZig() function. The argument useCSSoffset = FALSE should be set to avoid using the default scaling factors, as we have already included our customised scaling factor (AD.normFactor) in the design matrix.

# treatment variable
phenol_conc = pData(object = AD.obj)$initial_phenol_concentration.regroup

# batch variable
seq_run = pData(object = AD.obj)$sequencing_run_date

# build a design matrix
AD.mod.full = model.matrix(~ phenol_conc + seq_run + AD.normFactor)

# settings for the fitZig() function
AD.settings <- zigControl(maxit = 10, verbose = TRUE)

# apply the ZIG model
ADfit <- fitZig(obj = AD.obj, 
                mod = AD.mod.full, 
                useCSSoffset = FALSE, 
                control = AD.settings)

## it= 0, nll=123.44, log10(eps+1)=Inf, stillActive=289
## it= 1, nll=134.33, log10(eps+1)=0.04, stillActive=10
## it= 2, nll=134.64, log10(eps+1)=0.01, stillActive=1
## it= 3, nll=134.80, log10(eps+1)=0.00, stillActive=0

head(ADfit@fit$coefficients)

##        (Intercept) phenol_conc1-2 seq_run09/04/2015 seq_run14/04/2016
## OTU12   0.81493490     0.68096987        -0.3914984       -0.46251653
## OTU29   0.13034146    -0.74964792         3.2036768        1.73851963
## OTU77  -0.03529492     0.03456233        -0.4279860       -0.69409045
## OTU86   0.69888725    -1.18545474        -0.9872078       -0.74385441
## OTU97  -0.23371589    -0.21160945        -1.0906897        0.02215167
## OTU106 -0.62766893     0.13556615         2.9251996        0.21997853
##        seq_run14/11/2016 seq_run21/09/2017 AD.normFactor
## OTU12         0.03486566         0.2163230     4.3931417
## OTU29         1.58298972         5.6148999     1.3016901
## OTU77         0.59730201         1.4620328     2.0975373
## OTU86         0.28063764         0.6643703     2.6490927
## OTU97         0.14623346         0.9857519     2.3412736
## OTU106        0.15511965         4.1640912     0.5575642

The OTUs with the top 50 smallest p values can be extracted using MRcoefs(). Let’s set eff = 0.5 so that only OTUs with at least the “0.5” quantile (50%) number of effective samples (i.e., positive samples + estimated undersampling zeroes) are extracted.

We also set the arguments coef = 2 to focus on the second column of the coefficient matrix, which corresponds to the treatment effects, and group = 3 to sort the OTUs by their p values in increasing order.

ADcoefs <- MRcoefs(ADfit, coef = 2, 
                   group = 3, number = 50, 
                   eff = 0.5)
head(ADcoefs)

Methods adapted for microbiome data

Surrogate variable analysis (SVA)

SVA only accounts for unknown batch effects. Therefore, we assume that the batch grouping information in the AD data is unknown, and that the batch effects are systematic.

We first need to build two design matrices with (ad.mod) and without (ad.mod0) treatment grouping information using model.matrix() function from stats. We will then use num.sv() from sva package to determine the number of surrogate or batch-related variables n.sv to estimate in function sva().

# the design with treatment
ad.mod <- model.matrix( ~ ad.trt)
# the design without treatment
ad.mod0 <- model.matrix( ~ 1, data = ad.trt)

# estimate the number of batch-related variables
ad.sva.n <- num.sv(dat = t(ad.clr), mod = ad.mod, method = 'leek')

# estimate batch-related variables
ad.sva <- sva(t(ad.clr), 
              ad.mod, ad.mod0, 
              n.sv = ad.sva.n)

## Number of significant surrogate variables is:  4 
## Iteration (out of 5 ):1  2  3  4  5

The estimated batch effects are then input into f.pvalue() to calculate the p values of treatment effects. In SVA, batch effects are assumed to be independent of treatment effects. If they are correlated due to study design or biological reasons, the estimation of batch-related variables will be biased and may affect the accuracy of downstream linear regression results. However, some correlation may exist between the estimated batch effects and treatment due to how SVA estimates batch-related variables from those least affected by treatment. These variables may still carry some treatment information, leading to estimated batch effects that are weakly correlated with treatment effects. (Wang and LêCao 2020).

# include estimated batch-related variables in the linear model
ad.mod.batch <- cbind(ad.mod, ad.sva$sv)
ad.mod0.batch <- cbind(ad.mod0, ad.sva$sv)

# linear model
ad.sva.p <- f.pvalue(t(ad.clr), ad.mod.batch, ad.mod0.batch)
ad.sva.p.adj <- p.adjust(ad.sva.p, method = 'fdr')

Correcting for batch effects

removeBatchEffect

Before applying removeBatchEffect, we need to prepare a design matrix (ad.mod) that includes the treatment grouping information, which is used to preserve treatment effects during batch effect correction. This design matrix can be generated using model.matrix() function from stats.

Then we can use removeBatchEffect() function (limma package) with the batch grouping information (batch) and the treatment design matrix (design) as inputs to calculate the batch effect corrected data ad.rBE.

# generate treatment design matrix
ad.mod <- model.matrix( ~ ad.trt)

ad.rBE <- t(removeBatchEffect(t(ad.clr), 
                              batch = ad.batch, 
                              design = ad.mod))

sPLSDA-batch

Applying sPLSDA-batch is similar to applying PLSDA-batch. We will use the same function PLSDA_batch(), but need to specify the number of variables to select on each component, typically for the treatment-related components (keepX.trt). To determine the optimal number of variables to select, we will use tune.splsda() function from mixOmics package (Rohart et al. 2017), testing a range of possible values for each component (test.keepX).

# estimate the number of variables to select per treatment component
set.seed(777)

# generate a range of possible values
ad.test.keepX = c(seq(1, 10, 1), seq(20, 100, 10), 
                  seq(150, 231, 50), 231)

ad.trt.tune.v <- tune.splsda(X = ad.clr, 
                             Y = ad.trt, 
                             ncomp = 1, 
                             test.keepX = ad.test.keepX, 
                             validation = 'Mfold', 
                             folds = 4, 
                             nrepeat = 50)
ad.trt.tune.v$choice.keepX #100

Here, the optimal number of variables to select for the treatment component was 100. Since we have adjusted the amount of treatment variation to preserve, we need to re-estimate the optimal number of components related to batch effects using the same criterion described in the section “PLSDA-batch”.

# estimate the number of batch components
ad.batch.tune <- PLSDA_batch(X = ad.clr, 
                             Y.trt = ad.trt, 
                             Y.bat = ad.batch,
                             ncomp.trt = 1, keepX.trt = 100,
                             ncomp.bat = 10)
ad.batch.tune$explained_variance.bat #4

## $X
##      comp1      comp2      comp3      comp4      comp5      comp6      comp7 
## 0.17420018 0.11477097 0.09813477 0.07894965 0.03744028 0.03625205 0.03696400 
##      comp8      comp9     comp10 
## 0.02622186 0.02347569 0.01481321 
## 
## $Y
##     comp1     comp2     comp3     comp4     comp5     comp6     comp7     comp8 
## 0.2474775 0.2606716 0.2301081 0.2617429 0.2327513 0.2550725 0.2468812 0.1405583 
##     comp9    comp10 
## 0.1247367 0.2547368

sum(ad.batch.tune$explained_variance.bat$Y[seq_len(4)])

## [1] 1

According to the result, we needed 4 batch-related components to remove batch variance from the data using function PLSDA_batch().

ad.sPLSDA_batch.res <- PLSDA_batch(X = ad.clr, 
                                   Y.trt = ad.trt, Y.bat = ad.batch,
                                   ncomp.trt = 1, keepX.trt = 100,
                                   ncomp.bat = 4)
ad.sPLSDA_batch <- ad.sPLSDA_batch.res$X.nobatch

Note: For an unbalanced batch x treatment design (except in the case of a nested design), we can set balance = FALSE in PLSDA_batch() function to apply weighted PLSDA-batch.

Percentile Normalisation (PN)

For PN correction, let’s apply percentile_norm() function from PLSDAbatch package and specify a control group (ctrl.grp).

ad.PN <- percentile_norm(data = ad.clr, batch = ad.batch, 
                         trt = ad.trt, ctrl.grp = '0-0.5')

RUVIII with pseudo replicates

In the previous RUVIII application using the AD study’s original sample replicates, the result was poor due to uneven representation of replicates across batches. To address this, we will generate pseudo-replicates by calculating the sample mean of each batch within each treatment group. Since the means across batches are expected to be similar for the same treatment, these pseudo-replicates can serve as substitutes for actual replicates.

# add group means as the pseudo-replicates
ad_group_mean <- matrix(ncol = ncol(ad.clr))
ad_mean_trt_lab <- c()
ad_mean_batch_lab <- c()
for(te in levels(ad.trt)){
  for(be in levels(ad.batch)){
    ad_group_mean <- rbind(ad_group_mean,apply(ad.clr[ad.trt == te & ad.batch == be, ], 2, mean))
    ad_mean_trt_lab <- c(ad_mean_trt_lab, te)
    ad_mean_batch_lab <- c(ad_mean_batch_lab, be)
  }
}
ad_group_mean <- ad_group_mean[-1,]
ad_new_ind <- rep(c("mean1", "mean2"), each = 5)


ad.replicates.new <- as.factor(c(as.character(ad.metadata$sample_name.data.extraction), ad_new_ind))
ad.replicates.matrix.new <- replicate.matrix(ad.replicates.new)
ad.clr.new <- rbind(ad.clr, ad_group_mean)
ad.trt.new <- as.factor(c(as.character(ad.trt), ad_mean_trt_lab))
ad.batch.new <- as.factor(c(as.character(ad.batch), ad_mean_batch_lab))

# recalculate empirical negative controls
ad.empir.p.new <- c()
for(e in seq_len(ncol(ad.clr.new))){
  ad.empir.lm <- lm(ad.clr.new[,e] ~ ad.trt.new)
  ad.empir.p.new[e] <- summary(ad.empir.lm)$coefficients[2,4]
}
ad.empir.p.adj.new <- p.adjust(p = ad.empir.p.new, method = 'fdr')
ad.nc.new <- ad.empir.p.adj.new > 0.05

# estimate k
ad.k.res.new <- getK(Y = ad.clr.new, X = ad.trt.new, ctl = ad.nc.new)
ad.k.new <- ad.k.res.new$k

# RUVIII
ad.RUVIII.new <- RUVIII(Y = ad.clr.new, 
                        M = ad.replicates.matrix.new, 
                        ctl = ad.nc.new, 
                        k = ad.k.new)

rownames(ad.RUVIII.new) <- rownames(ad.clr.new)

Assessing batch effect correction

PCA

Here, we compare the PCA sample plots before and after batch effect correction using all methods presented in the workshop.

ad.pca.before <- pca(ad.clr, ncomp = 3, scale = TRUE)
ad.pca.rBE <- pca(ad.rBE, ncomp = 3, scale = TRUE)
ad.pca.ComBat <- pca(ad.ComBat, ncomp = 3, scale = TRUE)
ad.pca.PLSDA_batch <- pca(ad.PLSDA_batch, ncomp = 3, scale = TRUE)
ad.pca.sPLSDA_batch <- pca(ad.sPLSDA_batch, ncomp = 3, scale = TRUE)
ad.pca.PN <- pca(ad.PN, ncomp = 3, scale = TRUE)
ad.pca.RUVIII <- pca(ad.RUVIII, ncomp = 3, scale = TRUE)
ad.pca.RUVIII.new <- pca(ad.RUVIII.new, ncomp = 3, scale = TRUE)

# order batches
ad.batch = factor(ad.metadata$sequencing_run_date, 
                  levels = unique(ad.metadata$sequencing_run_date))

ad.pca.before.plot <- Scatter_Density(object = ad.pca.before, 
                                      batch = ad.batch, 
                                      trt = ad.trt, 
                                      title = 'Before correction')

ad.pca.rBE.plot <- Scatter_Density(object = ad.pca.rBE, 
                                   batch = ad.batch, 
                                   trt = ad.trt, 
                                   title = 'removeBatchEffect')

ad.pca.ComBat.plot <- Scatter_Density(object = ad.pca.ComBat, 
                                      batch = ad.batch, 
                                      trt = ad.trt, 
                                      title = 'ComBat')

ad.pca.PLSDA_batch.plot <- Scatter_Density(object = ad.pca.PLSDA_batch, 
                                           batch = ad.batch, 
                                           trt = ad.trt, 
                                           title = 'PLSDA-batch')

ad.pca.sPLSDA_batch.plot <- Scatter_Density(object = ad.pca.sPLSDA_batch, 
                                            batch = ad.batch, 
                                            trt = ad.trt, 
                                            title = 'sPLSDA-batch')

ad.pca.PN.plot <- Scatter_Density(object = ad.pca.PN, 
                                  batch = ad.batch, 
                                  trt = ad.trt, 
                                  title = 'Percentile Normalisation')

ad.pca.RUVIII.plot <- Scatter_Density(object = ad.pca.RUVIII, 
                                      batch = ad.batch, 
                                      trt = ad.trt, 
                                      title = 'RUVIII')

ad.pca.RUVIII.new.plot <- Scatter_Density(object = ad.pca.RUVIII.new, 
                                      batch = ad.batch.new, 
                                      trt = ad.trt.new, 
                                      title = 'RUVIII+pseR')

$PCA sample plots with densities before and after batch effect correction in the AD data.$

PCA sample plots with densities before and after batch effect correction in the AD data.

pRDA

We can also calculate and display the pRDA results for the data before and after batch effect correction using all methods presented in the workshop.

ad.corrected.list <- list(`Before correction` = ad.clr, 
                          removeBatchEffect = ad.rBE, 
                          ComBat = ad.ComBat, 
                          `PLSDA-batch` = ad.PLSDA_batch, 
                          `sPLSDA-batch` = ad.sPLSDA_batch, 
                          `Percentile Normalisation` = ad.PN,
                          RUVIII = ad.RUVIII)

ad.prop.df <- data.frame(Treatment = NA, Batch = NA, 
                         Intersection = NA, 
                         Residuals = NA) 
for(i in seq_len(length(ad.corrected.list))){
  rda.res = varpart(ad.corrected.list[[i]], ~ trt, ~ batch,
                    data = ad.factors.df, scale = TRUE)
  ad.prop.df[i, ] <- rda.res$part$indfract$Adj.R.squared}

rownames(ad.prop.df) = names(ad.corrected.list)

ad.prop.df <- ad.prop.df[, c(1,3,2,4)]

ad.prop.df[ad.prop.df < 0] = 0
ad.prop.df <- as.data.frame(t(apply(ad.prop.df, 1, 
                                    function(x){x/sum(x)})))

# RUVIII + pseudo replicates
rda.ruviii = varpart(ad.RUVIII.new, ~ ad.trt.new, ~ ad.batch.new, scale = TRUE)
RUVIII.prop = rda.ruviii$part$indfract$Adj.R.squared[c(1,3,2,4)]

RUVIII.prop[RUVIII.prop < 0] = 0
RUVIII.prop <- RUVIII.prop/sum(RUVIII.prop)

ad.prop.df <- rbind(ad.prop.df, `RUVIII+pseR` = RUVIII.prop)

partVar_plot(prop.df = ad.prop.df)

Global explained variance before and after batch effect correction in the AD data.

We can also apply additional methods introduced in the workshop to evaluate batch effect correction.

Session Info

sessionInfo()

## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Australia/Melbourne
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] PLSDAbatch_1.1.1     metagenomeSeq_1.46.0 RColorBrewer_1.1-3  
##  [4] glmnet_4.1-8         Matrix_1.7-3         Biobase_2.64.0      
##  [7] BiocGenerics_0.50.0  limma_3.60.6         sva_3.52.0          
## [10] BiocParallel_1.38.0  genefilter_1.86.0    mgcv_1.9-1          
## [13] nlme_3.1-164         mixOmics_6.30.0      MASS_7.3-60.2       
## [16] gridExtra_2.3        performance_0.13.0   ggplot2_3.5.2       
## [19] ruv_0.9.7.1          vegan_2.6-10         lattice_0.22-6      
## [22] permute_0.9-7        pheatmap_1.0.12     
## 
## loaded via a namespace (and not attached):
##   [1] rstudioapi_0.16.0               jsonlite_1.8.8                 
##   [3] shape_1.4.6.1                   datawizard_1.0.2               
##   [5] magrittr_2.0.3                  farver_2.1.2                   
##   [7] nloptr_2.1.1                    rmarkdown_2.27                 
##   [9] fs_1.6.4                        zlibbioc_1.50.0                
##  [11] vctrs_0.6.5                     memoise_2.0.1                  
##  [13] minqa_1.2.7                     rstatix_0.7.2                  
##  [15] S4Arrays_1.4.1                  htmltools_0.5.8.1              
##  [17] broom_1.0.6                     SparseArray_1.4.8              
##  [19] sass_0.4.9                      KernSmooth_2.23-24             
##  [21] bslib_0.8.0                     httr2_1.0.3                    
##  [23] plyr_1.8.9                      cachem_1.1.0                   
##  [25] igraph_2.0.3                    lifecycle_1.0.4                
##  [27] iterators_1.0.14                pkgconfig_2.0.3                
##  [29] R6_2.5.1                        fastmap_1.2.0                  
##  [31] GenomeInfoDbData_1.2.12         rbibutils_2.2.16               
##  [33] MatrixGenerics_1.16.0           numDeriv_2016.8-1.1            
##  [35] digest_0.6.36                   colorspace_2.1-1               
##  [37] rARPACK_0.11-0                  patchwork_1.3.0                
##  [39] AnnotationDbi_1.66.0            S4Vectors_0.42.1               
##  [41] RSpectra_0.16-2                 GenomicRanges_1.56.1           
##  [43] ellipse_0.5.0                   RSQLite_2.3.7                  
##  [45] ggpubr_0.6.0                    labeling_0.4.3                 
##  [47] fansi_1.0.6                     TreeSummarizedExperiment_2.12.0
##  [49] httr_1.4.7                      abind_1.4-5                    
##  [51] compiler_4.4.1                  bit64_4.0.5                    
##  [53] withr_3.0.1                     backports_1.5.0                
##  [55] carData_3.0-5                   DBI_1.2.3                      
##  [57] highr_0.11                      gplots_3.2.0                   
##  [59] ggsignif_0.6.4                  rappdirs_0.3.3                 
##  [61] DelayedArray_0.30.1             corpcor_1.6.10                 
##  [63] gtools_3.9.5                    caTools_1.18.3                 
##  [65] Wrench_1.22.0                   tools_4.4.1                    
##  [67] ape_5.8                         glue_1.7.0                     
##  [69] grid_4.4.1                      cluster_2.1.6                  
##  [71] reshape2_1.4.4                  see_0.11.0                     
##  [73] generics_0.1.3                  gtable_0.3.5                   
##  [75] tidyr_1.3.1                     car_3.1-2                      
##  [77] utf8_1.2.4                      XVector_0.44.0                 
##  [79] ggrepel_0.9.5                   foreach_1.5.2                  
##  [81] pillar_1.9.0                    stringr_1.5.1                  
##  [83] yulab.utils_0.1.6               splines_4.4.1                  
##  [85] dplyr_1.1.4                     treeio_1.28.0                  
##  [87] survival_3.6-4                  bit_4.0.5                      
##  [89] annotate_1.82.0                 tidyselect_1.2.1               
##  [91] SingleCellExperiment_1.26.0     locfit_1.5-9.10                
##  [93] Biostrings_2.72.1               knitr_1.48                     
##  [95] IRanges_2.38.1                  SummarizedExperiment_1.34.0    
##  [97] edgeR_4.2.1                     stats4_4.4.1                   
##  [99] xfun_0.46                       statmod_1.5.0                  
## [101] matrixStats_1.5.0               stringi_1.8.4                  
## [103] UCSC.utils_1.0.0                lazyeval_0.2.2                 
## [105] yaml_2.3.10                     boot_1.3-30                    
## [107] evaluate_0.24.0                 codetools_0.2-20               
## [109] tibble_3.2.1                    BiocManager_1.30.24            
## [111] cli_3.6.3                       xtable_1.8-4                   
## [113] Rdpack_2.6.1                    munsell_0.5.1                  
## [115] jquerylib_0.1.4                 Rcpp_1.0.13                    
## [117] GenomeInfoDb_1.40.1             png_0.1-8                      
## [119] XML_3.99-0.17                   parallel_4.4.1                 
## [121] blob_1.2.4                      bayestestR_0.15.3              
## [123] bitops_1.0-8                    lme4_1.1-35.5                  
## [125] tidytree_0.4.6                  lmerTest_3.1-3                 
## [127] scales_1.3.0                    insight_1.2.0                  
## [129] purrr_1.0.2                     crayon_1.5.3                   
## [131] BiocStyle_2.30.0                rlang_1.1.4                    
## [133] KEGGREST_1.44.1

References

Chapleur, Olivier, Céline Madigou, Raphaël Civade, Yohan Rodolphe, Laurent Mazéas, and Théodore Bouchez. 2016. “Increasing Concentrations of Phenol Progressively Affect Anaerobic Digestion of Cellulose and Associated Microbial Communities.” Biodegradation 27 (1): 15–27.

LÃŒdecke, Daniel, Dominique Makowski, Philip Waggoner, and Indrajeet Patil. 2020. “Performance: Assessment of Regression Models Performance.” CRAN. https://doi.org/10.5281/zenodo.3952174.

Leek, Jeffrey T, W Evan Johnson, Hilary S Parker, Andrew E Jaffe, and John D Storey. 2012. “The Sva Package for Removing Batch Effects and Other Unwanted Variation in High-Throughput Experiments.” Bioinformatics 28 (6): 882–83.

Moskovicz, Veronica, Rina Ben-El, Guy Horev, and Boaz Mizrahi. 2020. “Skin Microbiota Dynamics Following b. Subtilis Formulation Challenge.”

Rohart, Florian, Benoı̂t Gautier, Amrit Singh, and Kim-Anh Lê Cao. 2017. “mixOmics: An r Package for ‘Omics Feature Selection and Multiple Data Integration.” PLoS Computational Biology 13 (11): e1005752.

Wang, Yiwen, and Kim-Anh Lê Cao. 2023. “PLSDA-Batch: A Multivariate Framework to Correct for Batch Effects in Microbiome Data.” Briefings in Bioinformatics 24 (2): bbac622.

Wang, Yiwen, and Kim-Anh LêCao. 2020. “Managing Batch Effects in Microbiome Data.” Briefings in Bioinformatics 21 (6): 1954–70.

Batch effect management

Eva

04 June 2025

Introduction

Case study description

Packages installation and loading

Load data

Exercise 1: How many samples are there within each treatment and batch group?

Answer

Batch effect detection

Principal component analysis (PCA)

Exercise 2: Interpret the PCA plot created above

Answer

Boxplots and density plots

Exercise 3: Interpret the result when “21/09/2017” was set as the reference level

Answer

Heatmap

Partial redundancy analysis (pRDA)

Managing batch effects

Accounting for batch effects

Linear regression: linear model (LM) and linear mixed model (LMM)

Remove unwanted variation in 4 steps (RUV4)

Correcting for batch effects

ComBat

PLSDA-batch

Remove Unwanted Variation-III (RUVIII)

Assessing batch effect correction

Methods that detect batch effects

PCA

Exercise 4: Interpret the PCA plots created above

Answer

pRDA

Other methods

\(\mathbf{R^2}\)

Alignment scores

Discussion and conclusions

To go further

Accounting for batch effects

Methods designed for microbiome data

Zero-inflated Gaussian mixture model (CSS+ZIG)

Methods adapted for microbiome data

Surrogate variable analysis (SVA)

Correcting for batch effects

removeBatchEffect

sPLSDA-batch

Percentile Normalisation (PN)

RUVIII with pseudo replicates

Assessing batch effect correction

PCA

pRDA

Session Info

References