Linear models are powerful statistical techniques for modelling relationships between variables (e.g genes). Biological datasets are often rich with covariates that can impact the response variable. Linear models can reveal intricate relationships in these data. Linear mixed models, extending traditional linear models can address unwanted variation commonly encountered because of complex study designs.
The workshop materials drew inspiration from the r-linear-abacbs2018 workshop conducted by Monash Bioinformatics, as well as the QCBS R workshops conducted by Quebec Centre for Biodiversity.
# Install tidyverse package using CRAN
# install.packages(c("tidyverse", "nlme", "ggpubr", "MuMIn"))
# Install edgeR and limma packages using BiocManager
#if (!requireNamespace("BiocManager", quietly = TRUE)) {
#install.packages("BiocManager")
#}
#BiocManager::install(c("edgeR","limma"))
library(tidyverse) # working with data frames, plotting
library(edgeR) # cpm, etc -- RNA-Seq normalization
library(limma) # lmFit, etc -- fitting many models
library(nlme) # lme
library(ggpubr) #ggqqplot
library(MuMIn) #AICc
Tooth growth in mouse embryos is studied using RNA-Seq. The RNA expression levels of several genes are examined in the cells that form the upper and lower first molars, in eight individual mouse embryos that have been dissected after different times of embryo development. The measurements are in terms of “Reads Per Million”, essentially the fraction of RNA in each sample belonging to each gene, times 1 million. This data was extracted from ARCHS4.
The files we will use are csv comma-separated, so we will use the
read_csv()
function from the tidyverse.
Your turn 2.1: Download the data.zip file here.
Unzip the file and store the content in the data
folder in your working directory.
For the first part of the workshop, we will use the counts file
called teeth.csv
stored in the data folder. The path to the
file should be data/teeth.csv
. This file includes count
information for three genes, which we will use up to Section 5.
Your turn 2.2: Load the data into your R working
directory. The contents of the counts file will be stored in an
object named teeth
. Afterwards, explore
the gene_smoc1
across different days using a scatter
plot.
# read in counts file
teeth <- read_csv("data/teeth.csv")
ggplot(teeth,aes(x=day, y=gene_smoc1))+
geom_point()+theme_bw()+ylab("Smoc1")
Interpret the scatter plot created above.
Let’s fit a simple linear regression model using the
smoc1
gene counts as the dependent variable and the number
of days as the independent variable. We can then explore the residual
plots to identify any potential violations of the model assumptions.
Your turn 2.3: Run the following command line. From the output, identify if there are any assumption violations.
# fitting a simple linear regression
smoc1_fit <- lm(gene_smoc1 ~ day, data=teeth)
summary(smoc1_fit)
##
## Call:
## lm(formula = gene_smoc1 ~ day, data = teeth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.780 -20.254 -1.399 23.652 35.448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.238 89.349 -0.148 0.884
## day 4.212 5.485 0.768 0.455
##
## Residual standard error: 25.13 on 14 degrees of freedom
## Multiple R-squared: 0.04041, Adjusted R-squared: -0.02813
## F-statistic: 0.5896 on 1 and 14 DF, p-value: 0.4553
par(mfrow=c(2,2))
plot(smoc1_fit)
We can observe a clear violation of linearity in the plot of residuals versus fitted values. To account for this, you might consider including a squared value of days in the linear model to see if this improves the residuals.
# fitting a simple linear regression
day2<-teeth$day^2
smoc1_fit_sq <- lm(gene_smoc1 ~ day+day2, data=teeth)
summary(smoc1_fit_sq)
par(mfrow=c(2,2))
plot(smoc1_fit_sq)
Your turn 2.4: Examine the code above, run and then look at the residual plots to see if the modification helped.
Here onwards we will use log2 transformation on all gene counts before fitting any model.
In Section 2, we conducted an exploration of simple linear regression
using the smoc1
gene from our dataset. Simple linear
regression employs a single numerical independent variable to explain a
corresponding numerical dependent variable. Additionally, we examined
how data transformation can enhance the assumptions of our model.
Let’s fit an ANOVA model using the log2 transformed
gene_pou3f3
gene counts as the dependent variable, and the
tooth’s position (upper or lower) as the independent variable. Note that
in this case, the tooth variable has only two groups (i.e., upper and
lower). In such situations, you could use a t-test to compare the two
groups. However, we will use ANOVA to explore these two groups to
demonstrate the use of ANOVA.
Your turn 3.1: First, you will explore the scatter
plot and convert the tooth
variable into a factor variable.
Before fitting the ANOVA model, examine the QQ plot to check the
normality of the dependent variable.
# converting to a factor
teeth$tooth <- factor(teeth$tooth, c("lower","upper"))
ggplot(teeth, aes(x=tooth, y=log2(gene_pou3f3))) +
geom_point()+
theme_bw()
ggqqplot(teeth, x="log2(gene_pou3f3)", facet.by = "tooth")
We can run our ANOVA in R using different functions. The most basic
and common functions we can use are aov()
and
lm()
. Because ANOVA is a type of linear model, we can use
the lm()
function.
Your turn 3.2: Let’s see what lm()
produces for log2 transformed gene_pou3f3
.
#ANOVA using lm function
aov.lm_pou3f3 <- lm(log2(gene_pou3f3) ~ tooth, data=teeth)
summary(aov.lm_pou3f3)
##
## Call:
## lm(formula = log2(gene_pou3f3) ~ tooth, data = teeth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57445 -0.39358 -0.02223 0.40193 0.65208
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2574 0.1659 19.63 1.38e-11 ***
## toothupper 3.7434 0.2346 15.96 2.24e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4693 on 14 degrees of freedom
## Multiple R-squared: 0.9479, Adjusted R-squared: 0.9441
## F-statistic: 254.6 on 1 and 14 DF, p-value: 2.244e-10
The F-statistic is the test statistic for ANOVA and is a combination of the sums of squares described above. The associated p-value can help provide a significance interepretation on the F-statistic. That is, a \(p-value<0.05\) tells us that at least one group mean differs from another at the \(\alpha=0.05\) level of significance.
Your turn 3.3: Let’s explore aov()
for
log2 transformed gene_pou3f3
.
# Compute the analysis of variance
aov_pou3f3 <- aov(log2(gene_pou3f3) ~ tooth, data=teeth)
# Summary of the analysis
summary(aov_pou3f3)
## Df Sum Sq Mean Sq F value Pr(>F)
## tooth 1 56.05 56.05 254.6 2.24e-10 ***
## Residuals 14 3.08 0.22
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When using the aov()
function, if the p-value is less
than 0.05, you would want to identify which groups significantly differ
from each other. To do this, you could perform multiple comparisons.
This is an analysis conducted after ANOVA that helps us quantify the
differences between groups. A very common test for this purpose is
Tukey’s Honest Significant Differences (HSD). The Tukey HSD procedure
conducts a pairwise comparison of all possible combinations of groups,
testing for significant differences between their means, while adjusting
the p-value.
Your turn 3.4: Let’s explore multiple comparisons on
the aov_pou3f3
model.
TukeyHSD(aov_pou3f3)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = log2(gene_pou3f3) ~ tooth, data = teeth)
##
## $tooth
## diff lwr upr p adj
## upper-lower 3.743405 3.24018 4.24663 0
In Section 3, we investigated ANOVA using the pou3f3
gene from our dataset. ANOVA is utilized to compare three or more groups
and thus, it employs a single categorical independent variable to
examine a numerical dependent variable. Initially, we verified the
normality assumption for the dependent variable within each group. If
the ANOVA results are significant, which indicates that at least one
group significantly differs from another, then post-hoc multiple
comparisons can be conducted as demonstrated in Subsection 3.4.
We will use the lme()
function to fit linear regression
models on the log2 transformed gene_ace data, treating day
as a fixed effect and tooth
as a random effect.
Your turn 4.1: Fit two linear mixed models: the first one assumes that the random effects have the same slope but different intercepts, while the second one assumes that the random effects have different intercepts and slopes. Compare the two models based on their corrected Akaike Information Criterion (AIC) values.
#Different intecept for random effect
lmm_acefit2<-nlme::lme(log2(gene_ace) ~ day ,
random= (~1 | tooth), data = teeth)
#Different intecept and slope for random effect
lmm_acefit3<-nlme::lme(log2(gene_ace) ~ day ,
random= (~1+ day| tooth), data = teeth)
# Compare two models
MuMIn::AICc(lmm_acefit2,lmm_acefit3)
summary(lmm_acefit2)
## Linear mixed-effects model fit by REML
## Data: teeth
## AIC BIC logLik
## 3.24186 5.798089 2.37907
##
## Random effects:
## Formula: ~1 | tooth
## (Intercept) Residual
## StdDev: 0.1909977 0.1510182
##
## Fixed effects: log2(gene_ace) ~ day
## Value Std.Error DF t-value p-value
## (Intercept) -13.227623 0.5535735 13 -23.89497 0
## day 0.986802 0.0329549 13 29.94403 0
## Correlation:
## (Intr)
## day -0.967
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -1.18379736 -0.76118373 -0.00285813 0.36112191 2.30557892
##
## Number of Observations: 16
## Number of Groups: 2
In Section 4, we investigated linear mixed models using the
ace
gene from our dataset. Linear mixed models are useful
when there are blocking variables in the data that we do not want to
evaluate the effects of. Here, we used the lme()
function
from thenlme
R package to fit the linear mixed model.
However, you can also use the lmer()
function from the
lme4
R package.
In this section we look at fitting linear models to all genes in the data, which includes a total of 32,544 genes. We will use the package limma from Bioconductor. This is a very brief demonstration, and there is much more to this package. See the excellent usersguide.pdf
Your turn 5.1: Use the following R code to load the data, convert the read counts to Reads Per Million (RPM), and then apply a log2 transformation to the results. There are some concepts that we will only briefly touch on: trimmed mean of M-values (TMM) normalization is employed as a minor adjustment to the total number of reads in each sample (i.e., the library size). A small constant is added to the counts to prevent the calculation of log2(0). These steps are described in more detail in the edgeR and limma manuals.
counts_df <- read_csv("data/teeth-read-counts.csv")
counts <- as.matrix(counts_df[,-1])
rownames(counts) <- counts_df[[1]]
#calculate TMM normalization factors
dgelist <- calcNormFactors(DGEList(counts))
#get the normalized counts
log2_cpms <- cpm(dgelist, log=TRUE, prior.count=0.25)
There is chance of detecting differential expression in genes with very low read counts. Including these genes will require a larger False Discovery Rate correction, and also confuses limma’s Empirical Bayes parameter estimation. Let’s only retain genes with around 5 reads per sample or more.
keep <- rowSums(counts>=5)==16
log2_cpms_filtered <- log2_cpms[keep,]
How many genes were present before the filtering, and how many remain after the filtering?
Your turn 5.2: We use limma to fit a linear model to each gene with day as independent variable. The same model formula will be used in each case. limma does not automatically convert a formula into a model matrix, so we have to do this step manually. Here we are using a model formula following a different linear trend over time.
X <- model.matrix(~ day, data=teeth)
X
## (Intercept) day
## 1 1 14.5
## 2 1 14.5
## 3 1 15.0
## 4 1 15.0
## 5 1 15.5
## 6 1 15.5
## 7 1 16.0
## 8 1 16.0
## 9 1 16.5
## 10 1 16.5
## 11 1 17.0
## 12 1 17.0
## 13 1 17.5
## 14 1 17.5
## 15 1 18.0
## 16 1 18.0
## attr(,"assign")
## [1] 0 1
fit <- lmFit(log2_cpms_filtered, X)
fit$coefficients[1:5,]
## (Intercept) day
## 0610007P14Rik 6.4967648 -0.07408778
## 0610009B22Rik 3.4651959 0.01200518
## 0610009L18Rik -0.1494263 0.08878789
## 0610009O20Rik 6.7599797 -0.04497871
## 0610010F05Rik 5.9782132 -0.07663791
efit <- eBayes(fit)
topTable(efit, n=5, coef = c("day"))
Your turn 5.3: We will use the below code to model to tooth as a blocking variable.
block <- teeth$tooth
design <- model.matrix(~ day, data = teeth)
dup_corrs <- duplicateCorrelation(log2_cpms_filtered,
design = design, block = block)
fit <- lmFit(log2_cpms_filtered, X, block = block,
correlation = dup_corrs$consensus.correlation)
efit <- eBayes(fit, trend=TRUE)
topTable(efit, n=5, coef = c("day"))
Similar to section 5.2, use ‘limma’ to fit a linear model for each gene, with ‘day’, ‘tooth’, and their interactions as independent variables.
Extract the coefficient values for the first five genes.
Check top genes for each coefficient.
Your turn 5.4: We will now plot the significant genes identified in Exercise 4 along with the log fold change and expression level.
all_results <- topTable(efit1,coef = c("day"), n=Inf)
significant <- all_results$adj.P.Val <= 0.05
ggplot(all_results, aes(x=AveExpr, y=logFC)) +
geom_point(size=0.1, color="grey") +
geom_point(data=all_results[significant,], size=0.1)
Your turn 5.5: We will now plot the top gene.
# Filter Tfap2b gene
Tfap2b <- log2_cpms["Tfap2b",]
# Fitting a linear model
Tfap2b_fit <- lm(Tfap2b ~ day*tooth, data=teeth)
# creating a set of values to be used to fit the line
more_data <- expand.grid(
day=seq(14.3,18.2,by=0.01),
tooth=as_factor(c("lower","upper")))
p <- ggplot(teeth,aes(x=day,group=tooth))
more_ci <- cbind(more_data,
predict(Tfap2b_fit, more_data, interval="confidence"))
p <- p +
geom_ribbon(data=more_ci, aes(ymin=lwr,ymax=upr),alpha=0.1) +
geom_line(data=more_ci,aes(y=fit,color=tooth))
p + geom_point(aes(y=Tfap2b,color=tooth)) +
ylab("Tfap2b")
In Section 5, we delved into the limma package, a powerful tool for the analysis of gene expression data, particularly from microarray and RNA-Seq experiments. We began by fitting linear models to our data using limma. This approach allowed us to model the relationship between our gene expression data and independent variables. The linear models provided us with estimates of the effects of these variables on gene expression, as well as statistical tests for the significance of these effects. In addition to standard linear models, we also fitted linear mixed effect models with limma. These models are particularly useful when our data has blocking variables. We used the results of our models to explore the top genes in our dataset. These are the genes that showed the most significant effects.
sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.4.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Australia/Melbourne
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MuMIn_1.47.5 ggpubr_0.6.0 nlme_3.1-164 edgeR_4.0.16
## [5] limma_3.58.1 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
## [9] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1
## [13] tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.5 xfun_0.43 bslib_0.7.0 rstatix_0.7.2
## [5] lattice_0.22-6 tzdb_0.4.0 vctrs_0.6.5 tools_4.3.2
## [9] generics_0.1.3 stats4_4.3.2 parallel_4.3.2 fansi_1.0.6
## [13] highr_0.10 pkgconfig_2.0.3 Matrix_1.6-5 lifecycle_1.0.4
## [17] farver_2.1.1 compiler_4.3.2 statmod_1.5.0 munsell_0.5.1
## [21] carData_3.0-5 htmltools_0.5.8.1 sass_0.4.9 yaml_2.3.8
## [25] pillar_1.9.0 car_3.1-2 crayon_1.5.2 jquerylib_0.1.4
## [29] cachem_1.0.8 abind_1.4-5 tidyselect_1.2.1 locfit_1.5-9.9
## [33] digest_0.6.35 stringi_1.8.3 splines_4.3.2 labeling_0.4.3
## [37] fastmap_1.1.1 grid_4.3.2 colorspace_2.1-0 cli_3.6.2
## [41] magrittr_2.0.3 utf8_1.2.4 broom_1.0.5 withr_3.0.0
## [45] scales_1.3.0 backports_1.4.1 bit64_4.0.5 timechange_0.3.0
## [49] rmarkdown_2.26 bit_4.0.5 ggsignif_0.6.4 hms_1.1.3
## [53] evaluate_0.23 knitr_1.45 rlang_1.1.3 Rcpp_1.0.12
## [57] glue_1.7.0 rstudioapi_0.16.0 vroom_1.6.5 jsonlite_1.8.8
## [61] R6_2.5.1