1. Introduction to linear models

Linear models are powerful statistical techniques for modelling relationships between variables (e.g genes). Biological datasets are often rich with covariates that can impact the response variable. Linear models can reveal intricate relationships in these data. Linear mixed models, extending traditional linear models can address unwanted variation commonly encountered because of complex study designs.

The workshop materials drew inspiration from the r-linear-abacbs2018 workshop conducted by Monash Bioinformatics, as well as the QCBS R workshops conducted by Quebec Centre for Biodiversity.

Loading packages

# Install tidyverse package using CRAN
# install.packages(c("tidyverse", "nlme", "ggpubr", "MuMIn"))

# Install edgeR and limma packages using BiocManager
#if (!requireNamespace("BiocManager", quietly = TRUE)) {
  #install.packages("BiocManager")
#}
#BiocManager::install(c("edgeR","limma"))

library(tidyverse)  # working with data frames, plotting
library(edgeR)      # cpm, etc -- RNA-Seq normalization
library(limma)      # lmFit, etc -- fitting many models
library(nlme)       # lme
library(ggpubr)     #ggqqplot
library(MuMIn)      #AICc

Data used

Tooth growth in mouse embryos is studied using RNA-Seq. The RNA expression levels of several genes are examined in the cells that form the upper and lower first molars, in eight individual mouse embryos that have been dissected after different times of embryo development. The measurements are in terms of “Reads Per Million”, essentially the fraction of RNA in each sample belonging to each gene, times 1 million. This data was extracted from ARCHS4.

2. Simple linear regression (Single numerical predictor)

Load the data

The files we will use are csv comma-separated, so we will use the read_csv() function from the tidyverse.

Your turn 2.1: Download the data.zip file here. Unzip the file and store the content in the data folder in your working directory.

For the first part of the workshop, we will use the counts file called teeth.csv stored in the data folder. The path to the file should be data/teeth.csv. This file includes count information for three genes, which we will use up to Section 5.

Your turn 2.2: Load the data into your R working directory. The contents of the counts file will be stored in an object named teeth. Afterwards, explore the gene_smoc1 across different days using a scatter plot.

# read in counts file
teeth <- read_csv("data/teeth.csv")

ggplot(teeth,aes(x=day, y=gene_smoc1))+
  geom_point()+theme_bw()+ylab("Smoc1")

Exercise 1: Interpretation

Interpret the scatter plot created above.

Fitting a simple linear relationship

Let’s fit a simple linear regression model using the smoc1 gene counts as the dependent variable and the number of days as the independent variable. We can then explore the residual plots to identify any potential violations of the model assumptions.

Your turn 2.3: Run the following command line. From the output, identify if there are any assumption violations.

# fitting a simple linear regression
smoc1_fit <- lm(gene_smoc1 ~ day, data=teeth)
summary(smoc1_fit)

## 
## Call:
## lm(formula = gene_smoc1 ~ day, data = teeth)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.780 -20.254  -1.399  23.652  35.448 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -13.238     89.349  -0.148    0.884
## day            4.212      5.485   0.768    0.455
## 
## Residual standard error: 25.13 on 14 degrees of freedom
## Multiple R-squared:  0.04041,    Adjusted R-squared:  -0.02813 
## F-statistic: 0.5896 on 1 and 14 DF,  p-value: 0.4553

par(mfrow=c(2,2))
plot(smoc1_fit)

We can observe a clear violation of linearity in the plot of residuals versus fitted values. To account for this, you might consider including a squared value of days in the linear model to see if this improves the residuals.

# fitting a simple linear regression
day2<-teeth$day^2
smoc1_fit_sq <- lm(gene_smoc1 ~ day+day2, data=teeth)
summary(smoc1_fit_sq)

par(mfrow=c(2,2))
plot(smoc1_fit_sq)

Your turn 2.4: Examine the code above, run and then look at the residual plots to see if the modification helped.

Exercise 2: Log2 transformation of gene counts

A common transformation used in RNA-seq models is the log2 transformation. Write your own R code to apply this transformation to the smoc1 gene counts.
Explore if the transformation improves our previous results.

Here onwards we will use log2 transformation on all gene counts before fitting any model.

Summary

In Section 2, we conducted an exploration of simple linear regression using the smoc1 gene from our dataset. Simple linear regression employs a single numerical independent variable to explain a corresponding numerical dependent variable. Additionally, we examined how data transformation can enhance the assumptions of our model.

3. Analysis of Variance (ANOVA) (Single categorical predictor)

Let’s fit an ANOVA model using the log2 transformed gene_pou3f3 gene counts as the dependent variable, and the tooth’s position (upper or lower) as the independent variable. Note that in this case, the tooth variable has only two groups (i.e., upper and lower). In such situations, you could use a t-test to compare the two groups. However, we will use ANOVA to explore these two groups to demonstrate the use of ANOVA.

Your turn 3.1: First, you will explore the scatter plot and convert the tooth variable into a factor variable. Before fitting the ANOVA model, examine the QQ plot to check the normality of the dependent variable.

# converting to a factor
teeth$tooth <- factor(teeth$tooth, c("lower","upper"))

ggplot(teeth, aes(x=tooth, y=log2(gene_pou3f3))) + 
  geom_point()+
  theme_bw()

ggqqplot(teeth, x="log2(gene_pou3f3)", facet.by = "tooth")

We can run our ANOVA in R using different functions. The most basic and common functions we can use are aov() and lm(). Because ANOVA is a type of linear model, we can use the lm() function.

Your turn 3.2: Let’s see what lm() produces for log2 transformed gene_pou3f3.

#ANOVA using lm function
aov.lm_pou3f3 <- lm(log2(gene_pou3f3) ~ tooth, data=teeth)
summary(aov.lm_pou3f3)

## 
## Call:
## lm(formula = log2(gene_pou3f3) ~ tooth, data = teeth)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57445 -0.39358 -0.02223  0.40193  0.65208 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.2574     0.1659   19.63 1.38e-11 ***
## toothupper    3.7434     0.2346   15.96 2.24e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4693 on 14 degrees of freedom
## Multiple R-squared:  0.9479, Adjusted R-squared:  0.9441 
## F-statistic: 254.6 on 1 and 14 DF,  p-value: 2.244e-10

The F-statistic is the test statistic for ANOVA and is a combination of the sums of squares described above. The associated p-value can help provide a significance interepretation on the F-statistic. That is, a \(p-value<0.05\) tells us that at least one group mean differs from another at the \(\alpha=0.05\) level of significance.

Your turn 3.3: Let’s explore aov() for log2 transformed gene_pou3f3.

# Compute the analysis of variance
aov_pou3f3 <- aov(log2(gene_pou3f3) ~ tooth, data=teeth)
# Summary of the analysis
summary(aov_pou3f3)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## tooth        1  56.05   56.05   254.6 2.24e-10 ***
## Residuals   14   3.08    0.22                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When using the aov() function, if the p-value is less than 0.05, you would want to identify which groups significantly differ from each other. To do this, you could perform multiple comparisons. This is an analysis conducted after ANOVA that helps us quantify the differences between groups. A very common test for this purpose is Tukey’s Honest Significant Differences (HSD). The Tukey HSD procedure conducts a pairwise comparison of all possible combinations of groups, testing for significant differences between their means, while adjusting the p-value.

Your turn 3.4: Let’s explore multiple comparisons on the aov_pou3f3 model.

TukeyHSD(aov_pou3f3)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = log2(gene_pou3f3) ~ tooth, data = teeth)
## 
## $tooth
##                 diff     lwr     upr p adj
## upper-lower 3.743405 3.24018 4.24663     0

Summary

In Section 3, we investigated ANOVA using the pou3f3 gene from our dataset. ANOVA is utilized to compare three or more groups and thus, it employs a single categorical independent variable to examine a numerical dependent variable. Initially, we verified the normality assumption for the dependent variable within each group. If the ANOVA results are significant, which indicates that at least one group significantly differs from another, then post-hoc multiple comparisons can be conducted as demonstrated in Subsection 3.4.

4. Linear mixed model

We will use the lme() function to fit linear regression models on the log2 transformed gene_ace data, treating day as a fixed effect and tooth as a random effect.

Your turn 4.1: Fit two linear mixed models: the first one assumes that the random effects have the same slope but different intercepts, while the second one assumes that the random effects have different intercepts and slopes. Compare the two models based on their corrected Akaike Information Criterion (AIC) values.

#Different intecept for random effect
lmm_acefit2<-nlme::lme(log2(gene_ace) ~ day , 
                       random= (~1 | tooth), data = teeth)

#Different intecept and slope for random effect
lmm_acefit3<-nlme::lme(log2(gene_ace) ~ day , 
                       random= (~1+ day| tooth), data = teeth)

# Compare two models
MuMIn::AICc(lmm_acefit2,lmm_acefit3)

summary(lmm_acefit2)

## Linear mixed-effects model fit by REML
##   Data: teeth 
##       AIC      BIC  logLik
##   3.24186 5.798089 2.37907
## 
## Random effects:
##  Formula: ~1 | tooth
##         (Intercept)  Residual
## StdDev:   0.1909977 0.1510182
## 
## Fixed effects:  log2(gene_ace) ~ day 
##                  Value Std.Error DF   t-value p-value
## (Intercept) -13.227623 0.5535735 13 -23.89497       0
## day           0.986802 0.0329549 13  29.94403       0
##  Correlation: 
##     (Intr)
## day -0.967
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -1.18379736 -0.76118373 -0.00285813  0.36112191  2.30557892 
## 
## Number of Observations: 16
## Number of Groups: 2

Summary

In Section 4, we investigated linear mixed models using the ace gene from our dataset. Linear mixed models are useful when there are blocking variables in the data that we do not want to evaluate the effects of. Here, we used the lme() function from thenlme R package to fit the linear mixed model. However, you can also use the lmer() function from the lme4 R package.

5. Testing many genes with limma

In this section we look at fitting linear models to all genes in the data, which includes a total of 32,544 genes. We will use the package limma from Bioconductor. This is a very brief demonstration, and there is much more to this package. See the excellent usersguide.pdf

Load, normalize, log transform

Your turn 5.1: Use the following R code to load the data, convert the read counts to Reads Per Million (RPM), and then apply a log2 transformation to the results. There are some concepts that we will only briefly touch on: trimmed mean of M-values (TMM) normalization is employed as a minor adjustment to the total number of reads in each sample (i.e., the library size). A small constant is added to the counts to prevent the calculation of log2(0). These steps are described in more detail in the edgeR and limma manuals.

counts_df <- read_csv("data/teeth-read-counts.csv")
counts <- as.matrix(counts_df[,-1])
rownames(counts) <- counts_df[[1]]

#calculate TMM normalization factors
dgelist <- calcNormFactors(DGEList(counts))

#get the normalized counts
log2_cpms <- cpm(dgelist, log=TRUE, prior.count=0.25)

There is chance of detecting differential expression in genes with very low read counts. Including these genes will require a larger False Discovery Rate correction, and also confuses limma’s Empirical Bayes parameter estimation. Let’s only retain genes with around 5 reads per sample or more.

keep <- rowSums(counts>=5)==16
log2_cpms_filtered <- log2_cpms[keep,]

Exercise 3: Examine the filtered genes

How many genes were present before the filtering, and how many remain after the filtering?

Fitting different models and exploring top genes

Your turn 5.2: We use limma to fit a linear model to each gene with day as independent variable. The same model formula will be used in each case. limma does not automatically convert a formula into a model matrix, so we have to do this step manually. Here we are using a model formula following a different linear trend over time.

X <- model.matrix(~ day, data=teeth)
X

##    (Intercept)  day
## 1            1 14.5
## 2            1 14.5
## 3            1 15.0
## 4            1 15.0
## 5            1 15.5
## 6            1 15.5
## 7            1 16.0
## 8            1 16.0
## 9            1 16.5
## 10           1 16.5
## 11           1 17.0
## 12           1 17.0
## 13           1 17.5
## 14           1 17.5
## 15           1 18.0
## 16           1 18.0
## attr(,"assign")
## [1] 0 1

fit <- lmFit(log2_cpms_filtered, X)
fit$coefficients[1:5,]

##               (Intercept)         day
## 0610007P14Rik   6.4967648 -0.07408778
## 0610009B22Rik   3.4651959  0.01200518
## 0610009L18Rik  -0.1494263  0.08878789
## 0610009O20Rik   6.7599797 -0.04497871
## 0610010F05Rik   5.9782132 -0.07663791

efit <- eBayes(fit)
topTable(efit, n=5, coef = c("day"))

Mixed effect models with limma

Your turn 5.3: We will use the below code to model to tooth as a blocking variable.

block <- teeth$tooth
design <- model.matrix(~ day, data = teeth)
dup_corrs <- duplicateCorrelation(log2_cpms_filtered,
                                  design = design, block = block)
fit <- lmFit(log2_cpms_filtered, X, block = block,
             correlation = dup_corrs$consensus.correlation)
efit <- eBayes(fit, trend=TRUE)
topTable(efit, n=5, coef = c("day"))

Exercise 4: Fit linear models with limma with more independent variables.

Similar to section 5.2, use ‘limma’ to fit a linear model for each gene, with ‘day’, ‘tooth’, and their interactions as independent variables.
Extract the coefficient values for the first five genes.
Check top genes for each coefficient.

Your turn 5.4: We will now plot the significant genes identified in Exercise 4 along with the log fold change and expression level.

all_results <- topTable(efit1,coef = c("day"), n=Inf)

significant <- all_results$adj.P.Val <= 0.05

ggplot(all_results, aes(x=AveExpr, y=logFC)) + 
  geom_point(size=0.1, color="grey") +
  geom_point(data=all_results[significant,], size=0.1)

Your turn 5.5: We will now plot the top gene.

# Filter Tfap2b gene
Tfap2b <- log2_cpms["Tfap2b",]

# Fitting a linear model
Tfap2b_fit <- lm(Tfap2b ~  day*tooth, data=teeth)

# creating a set of values to be used to fit the line
more_data <- expand.grid(
  day=seq(14.3,18.2,by=0.01),
  tooth=as_factor(c("lower","upper")))

p <- ggplot(teeth,aes(x=day,group=tooth))
more_ci <- cbind(more_data, 
                 predict(Tfap2b_fit, more_data, interval="confidence"))
p <- p + 
      geom_ribbon(data=more_ci, aes(ymin=lwr,ymax=upr),alpha=0.1) + 
      geom_line(data=more_ci,aes(y=fit,color=tooth))

p + geom_point(aes(y=Tfap2b,color=tooth)) +
    ylab("Tfap2b")

Summary

In Section 5, we delved into the limma package, a powerful tool for the analysis of gene expression data, particularly from microarray and RNA-Seq experiments. We began by fitting linear models to our data using limma. This approach allowed us to model the relationship between our gene expression data and independent variables. The linear models provided us with estimates of the effects of these variables on gene expression, as well as statistical tests for the significance of these effects. In addition to standard linear models, we also fitted linear mixed effect models with limma. These models are particularly useful when our data has blocking variables. We used the results of our models to explore the top genes in our dataset. These are the genes that showed the most significant effects.

6. Session Info

sessionInfo()

## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Australia/Melbourne
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] MuMIn_1.47.5    ggpubr_0.6.0    nlme_3.1-164    edgeR_4.0.16   
##  [5] limma_3.58.1    lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1  
##  [9] dplyr_1.1.4     purrr_1.0.2     readr_2.1.5     tidyr_1.3.1    
## [13] tibble_3.2.1    ggplot2_3.5.1   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.5      xfun_0.43         bslib_0.7.0       rstatix_0.7.2    
##  [5] lattice_0.22-6    tzdb_0.4.0        vctrs_0.6.5       tools_4.3.2      
##  [9] generics_0.1.3    stats4_4.3.2      parallel_4.3.2    fansi_1.0.6      
## [13] highr_0.10        pkgconfig_2.0.3   Matrix_1.6-5      lifecycle_1.0.4  
## [17] farver_2.1.1      compiler_4.3.2    statmod_1.5.0     munsell_0.5.1    
## [21] carData_3.0-5     htmltools_0.5.8.1 sass_0.4.9        yaml_2.3.8       
## [25] pillar_1.9.0      car_3.1-2         crayon_1.5.2      jquerylib_0.1.4  
## [29] cachem_1.0.8      abind_1.4-5       tidyselect_1.2.1  locfit_1.5-9.9   
## [33] digest_0.6.35     stringi_1.8.3     splines_4.3.2     labeling_0.4.3   
## [37] fastmap_1.1.1     grid_4.3.2        colorspace_2.1-0  cli_3.6.2        
## [41] magrittr_2.0.3    utf8_1.2.4        broom_1.0.5       withr_3.0.0      
## [45] scales_1.3.0      backports_1.4.1   bit64_4.0.5       timechange_0.3.0 
## [49] rmarkdown_2.26    bit_4.0.5         ggsignif_0.6.4    hms_1.1.3        
## [53] evaluate_0.23     knitr_1.45        rlang_1.1.3       Rcpp_1.0.12      
## [57] glue_1.7.0        rstudioapi_0.16.0 vroom_1.6.5       jsonlite_1.8.8   
## [61] R6_2.5.1

Linear Models

Saritha Kodikara

29 April 2024

1. Introduction to linear models

Loading packages

Data used

2. Simple linear regression (Single numerical predictor)

Load the data

Exercise 1: Interpretation

Answer

Fitting a simple linear relationship

Exercise 2: Log2 transformation of gene counts

Answer

Summary

3. Analysis of Variance (ANOVA) (Single categorical predictor)

Summary

4. Linear mixed model

Summary

5. Testing many genes with limma

Load, normalize, log transform

Exercise 3: Examine the filtered genes

Answer

Fitting different models and exploring top genes

Mixed effect models with limma

Exercise 4: Fit linear models with limma with more independent variables.

Answer

Summary

6. Session Info