Before you start 


Remember to create a working directory, and use an R script in Rstudio to type your command lines, as we covered in our previous workhop

Packages

Seurat

Seurat is an R package designed for QC, analysis, and exploration of single-cell RNA-seq data. Seurat aims to enable users to identify and interpret sources of heterogeneity from single-cell transcriptomic measurements, and to integrate types of single-cell data. After this short introduction workshop you can read Seurat offical website to dive a bit deeper.

SeuratData

SeuratData is a mechanism for distributing datasets in the form of Seurat objects using R’s internal package and data management systems. It represents an easy way for users to get access to datasets that are used in the Seurat vignettes.

Installation

(If you have installed them before workshop, you do not need to run this block of code.)

install.packages('Seurat')

if (!requireNamespace("remotes", quietly = TRUE)) {
  install.packages("remotes")
}

remotes::install_github("satijalab/seurat-data", quiet = TRUE)

1. Load the packages

We will use Seurat V5, which was published last year. Seurat V5 has gradually gained popularity due to its faster running speed. However, Seurat V5 has some data structure changes compared with older versions (V3 & V4), which may cause some old codes to fail to run. More details can be found on this website.

library(Seurat)
options(Seurat.object.assay.version = "v5")
library(SeuratData)

Load the data

We will use the pbmcsca dataset. This public dataset includes single-cell RNA-seq data of human peripheral blood mononuclear cells (PBMCs) using multiple sequencing platforms. Only data that passed quality control are included in the pbmcsca dataset.

InstallData("pbmcsca")

data("pbmcsca")
pbmcsca <- UpdateSeuratObject(pbmcsca)

(You can ignore the warning message).

table(pbmcsca$Method)
## 
##   10x Chromium (v2) 10x Chromium (v2) A 10x Chromium (v2) B   10x Chromium (v3) 
##                3362                3222                3222                3222 
##            CEL-Seq2            Drop-seq             inDrops            Seq-Well 
##                 526                6584                6584                3773 
##          Smart-seq2 
##                 526

The table indicates the number of cells sequenced using different platforms.

In this workshop, we will consider two scRNA-seq data sets generated from two platforms, 10x Chromium (v2) & 10x Chromium (v3). PBMCs were sequenced from two patients.

Extract raw counts

The raw count matrix and the information of each gene and each cell are saved in a Seurat object pbmc_10x_v2 and pbmc_10x_v3 independently. In addition, we combine the two sequencing results without any processing and store them in the Seurat object pbmc_combo:

pbmc_10x_v2 <- pbmcsca[,pbmcsca$Method == "10x Chromium (v2)"]
pbmc_10x_v3 <- pbmcsca[,pbmcsca$Method == "10x Chromium (v3)"]
pbmc_combo <- pbmcsca[,pbmcsca$Method %in% c("10x Chromium (v2)", "10x Chromium (v3)")]

Data Structure of a Seurat object

The Seurat object is a representation of single-cell expression data for R; each Seurat object revolves around a set of cells and consists of one or more Assay objects, or individual representations of expression data (eg. RNA-seq, ATAC-seq, etc). These assays can be reduced from their high-dimensional state to a lower-dimension state and stored as DimReduc objects.

Seurat objects also store additional metadata, both at the cell and feature level (stored within individual assays). The object is designed to be as self-contained as possible, and easily extendable to new methods.

We use Seurat object pbmc_10x_v3 as an example.

The raw count matrix of scRNA-seq experiment is here:

pbmc_10x_v3@assays$RNA

ⓘ Count matrix in Seurat A count matrix from a Seurat object displays the genes in rows and the cells in columns.

The information and labels of each cell is here:

pbmc_10x_v3@meta.data

The dimension reduction information is stored here:

pbmc_10x_v3@reductions

2. Analysis of single-cell RNA-seq data from a single experiment

Let’s start with a simple case: the data generated using the the 10x Chromium (v3) platform (i.e the Seurat object pbmc_10x_v3.

Let’s first take a look at how many cells and genes passed Quality Control (QC).

ⓘ Count matrix in Seurat A count matrix from a Seurat object displays the genes in rows and the cells in columns.

dim(pbmc_10x_v3)
## [1] 33694  3222

Here we have 3,222 cells with 33,694 genes that passed QC.

Normalization

We use the Seurat function NormalizeData() to normalize raw counts. By default, Seurat implements a global-scaling normalization method called LogNormalize that normalizes the gene expression measurements for each cell by the total number of counts accross all genes, and multiplies this by a scaling factor (10,000 by default), then log transforms the result:

pbmc_10x_v3 <- NormalizeData(object=pbmc_10x_v3, normalization.method = "LogNormalize", 
    scale.factor = 10000)

Feature Selection

We use the Seurat function FindVariableFeatures() to select highly variable genes (HVGs) which have most of useful information for downstream analysis.

Here we select the top 3,000 most variable genes to save some computing time. In practice, you can select more genes (5,000 or more) to preserve more information from the scRNA-seq experiment:

pbmc_10x_v3 <- FindVariableFeatures(pbmc_10x_v3, selection.method = "vst", nfeatures = 3000)

After feature selection, the downstream Seurat functions will only use these HVGs, unless otherwise specified.

Scaling

Many downstream statistical analyses requires data matrix to be centred and scaled. We use the Seurat function ScaleData() fro this

pbmc_10x_v3 <- ScaleData(pbmc_10x_v3)

(Note that ScaleData() can also be used to remove some unwanted variation. The sources of variation may include, for example, technical noise, batch effects, or even biological sources of variation (cell cycle stage). As suggested by Buettner et al, 2015, regressing these signals out of the analysis can improve downstream dimensionality reduction and clustering. This can be done by specifying the vars.to.regress variable in ScaleData(), eg ScaleData(pbmc_10x_v3, vars.to.regress = nUMI) can regress out the effects of nUMI ie library size.)

The scaled z-scored residuals of these models are stored in the scale.data slot, and are used for downstream dimension reduction and clustering tasks.

Principal component analysis (PCA)

We perform PCA on the scaled data. By default, HVGs in pbmc_10x_v3@var.genes are used as input, but they can be defined by specifying the argument pc.genes.

Performing dimensionality reduction on highly variable genes can improve performance. However, with UMI data – particularly after regressing out technical variables, we often see that PCA returns similar (albeit slower) results when run on much larger subsets of genes, including the whole transcriptome.

We run PCA on top 3,000 most variable genes:

pbmc_10x_v3 <- RunPCA(pbmc_10x_v3, features = VariableFeatures(object = pbmc_10x_v3))
## PC_ 1 
## Positive:  LYZ, FCN1, CLEC7A, CPVL, SERPINA1, SPI1, S100A9, AIF1, NAMPT, CSTA 
##     CTSS, MAFB, MPEG1, NCF2, FGL2, VCAN, S100A8, TYMP, CST3, LST1 
##     CYBB, CFD, FCER1G, TGFBI, SLC11A1, GRN, CD14, SLC7A7, PSAP, RNF130 
## Negative:  IL32, CCL5, TRBC2, TRAC, CST7, CD69, RORA, CTSW, CD247, ITM2A 
##     TRBC1, C12orf75, IL7R, CD8A, GZMA, NKG7, CD7, LDHB, GZMH, CD6 
##     CD8B, PRF1, BCL11B, LYAR, FGFBP2, HOPX, LTB, TCF7, KLRD1, ZFP36L2 
## PC_ 2 
## Positive:  NRGN, PF4, SDPR, HIST1H2AC, MAP3K7CL, GNG11, PPBP, GPX1, TUBB1, SPARC 
##     CLU, PGRMC1, RGS18, FTH1, MARCH2, TREML1, HIST1H3H, ACRBP, AP003068.23, NCOA4 
##     PRKAR2B, CMTM5, CD9, CA2, TAGLN2, TSC22D1, CTTN, HIST1H2BJ, MTURN, TMSB4X 
## Negative:  EEF1A1, TMSB10, RPS2, RPS12, RPS18, RPLP1, RPS8, IL32, S100A4, RPLP0 
##     PFN1, NKG7, CYBA, ARL4C, HSPA8, CST7, HSP90AA1, ZFP36L2, S100A6, ANXA1 
##     CTSW, CORO1A, LDHA, CD247, GZMA, CALR, YBX1, S100A10, CFL1, ARPC3 
## PC_ 3 
## Positive:  CCL5, TMSB4X, SRGN, NKG7, ACTB, CST7, S100A4, CTSW, ANXA1, GZMH 
##     FGFBP2, PRF1, GZMA, IL32, GZMB, C12orf75, KLRD1, GNLY, GAPDH, CD247 
##     MYO1F, NRGN, ID2, ACTG1, CD7, PF4, SDPR, PPBP, HIST1H2AC, CD8A 
## Negative:  CD79A, HLA-DQA1, MS4A1, BANK1, IGHM, LINC00926, IGHD, TNFRSF13C, HLA-DQB1, CD74 
##     IGKC, HLA-DRA, BLK, CD83, ADAM28, CD22, HLA-DRB1, CD37, NFKBID, CD79B 
##     P2RX5, VPREB3, IGLC2, JUND, FCER2, TCOF1, GNG7, COBLL1, RALGPS2, CD19 
## PC_ 4 
## Positive:  IL7R, S100A12, VCAN, S100A8, SLC2A3, CD14, CSF3R, S100A9, MS4A6A, CD93 
##     IER3, FOS, RCAN3, ZFP36L2, EGR1, RP11-1143G9.4, RGCC, MGST1, LEPROTL1, CLEC4E 
##     SOCS3, VIM, THBS1, SELL, CXCL8, LTB, SGK1, TNFAIP3, MAL, IRF2BP2 
## Negative:  FCGR3A, CDKN1C, HES4, CSF1R, CKB, RHOC, ZNF703, MTSS1, TCF7L2, SIGLEC10 
##     MS4A7, FAM110A, HLA-DPA1, IFITM2, CTSL, PLD4, BATF3, SLC2A6, IFITM3, ABI3 
##     GZMB, NKG7, HLA-DPB1, FGFBP2, CTD-2006K23.1, CD79B, BID, LRRC25, GZMH, CTSC 
## PC_ 5 
## Positive:  GZMB, FGFBP2, GZMH, PRF1, CST7, NKG7, GNLY, KLRD1, GZMA, CTSW 
##     CCL5, SPON2, ADGRG1, PRSS23, KLRF1, CCL4, ZEB2, CLIC3, S1PR5, ITGAM 
##     HOPX, CEP78, CYBA, TRDC, C1orf21, MATK, TTC38, C12orf75, HLA-DRB1, HLA-DPB1 
## Negative:  IL7R, LTB, LEPROTL1, PAG1, MAL, CDKN1C, RCAN3, LEF1, TCF7, NOSIP 
##     CAMK4, LDHB, TOB1, TRABD2A, HES4, CSF1R, CKB, JUNB, TMEM123, ZFP36L2 
##     NELL2, RNASET2, CD28, BCL11B, TNFRSF25, CD27, CTSL, CD40LG, ZNF703, AQP3

The PCA result is stored in pbmc_10x_v3@reduction. The output information tells us which genes are positively, or negatvely correlated with the top 5 principal components.

ⓘ Choosing parameters in PCA (genes and pcs) How many genes to choose for PCA and how many PCs to use for downstream analysis is a complex and important issue that is out of the scope of today’s workshop.

But we highly recommend you to read this document before analyzing your own scRNA-seq dat, where the authors show how to use some visualization methods to guide your choice.

2D Visualization with UMAP

Uniform manifold approximation and project (UMAP) is a tool for visualising higher dimensional data. It takes PCs as input. Here we use the top 30 PCs:

pbmc_10x_v3 <- RunUMAP(pbmc_10x_v3, dims=1:30)

We draw the UMAP plot with the Dimplot() function by specifying the argument reduction = 'umap'. We can colour points (cells) by using the metadata information (pbmc_10x_v3@meta.data) by specifying the argument group.by. For example,

DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'Method')

Exercise 1: Plot different UMAP with a different number of PCs. Do you observe any changes? What about using different tuning parameters

Show solutions

pbmc_10x_v3 <- RunUMAP(pbmc_10x_v3, dims = 1:2)
DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'Method')
pbmc_10x_v3 <- RunUMAP(pbmc_10x_v3, dims = 1:10)
DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'Method')
pbmc_10x_v3 <- RunUMAP(pbmc_10x_v3, dims = 1:30)
DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'Method')
pbmc_10x_v3 <- RunUMAP(pbmc_10x_v3, dims = 1:50)
DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'Method')

Clustering

By default, Seurat uses the Louvain algorithm.

The Louvain algorithm requires a neighbor graph as input. Therefore, we first run the FindNeighbors() function first. Note that FindNeighbors() is also based on PCs, here we use all 30 top PCs:

pbmc_10x_v3 <- FindNeighbors(pbmc_10x_v3, dims = 1:30)

We then run the FindClusters() function for clustering. The argument Algorithm=1 means we are using the Louvain algorithm for clustering:

pbmc_10x_v3 <- FindClusters(object = pbmc_10x_v3, resolution = 0.3, algorithm=1)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 3222
## Number of edges: 134293
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9386
## Number of communities: 10
## Elapsed time: 0 seconds

The key tunning parameter here is resolution, which determines how many clusters we get. The number of clusters to have depends on biology. It can be, eg, the number of cell types you expect to see in the sample.

Other options are available such as Algorithm=4 for the Leiden algorithm, but you have to install Python and some Python packages first. See ??FindClusters for more details.

We can use UMAP to visualize the clustering results. The argument group.by='seurat_clusters' is used to colour the cells according to the clustering results.

DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'seurat_clusters')

Challenge 1: Characterising cluster 3 based on gene expression

According to the UMAP representation above, it seems that cluster is different from the other cells and cell clusters. Let’s dig a bit further to characterize this cluster.

First, we look for marker genes that were significantly differential expressed in Cluster 3 compared with other clusters. We use the FindAllMarkers() function to identify marker genes for each cluster.

This command line takes some time to run:

pbmc_10x_v3.markers <- FindAllMarkers(pbmc_10x_v3, min.pct = .25, logfc.threshold = .25)

We then extract the top 5 marker of cluster 3 with the smallest p-values:

cluster3.markers <- pbmc_10x_v3.markers[which(pbmc_10x_v3.markers$cluster==3),]
cluster3.markers[1:5, ]

We then search these marker genes in this database website https://panglaodb.se/search.html. We need to remove the dash (‘-’) before searching.

Based on the result provided by the database, it looks like cluster 3 might be a cluster of B cells. We can check whether this is the case by checking the cell-type ground truth information pre-saved in the pbmc_10x_v3 object, and plotting this as UMAP:

DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'CellType')

Yes, we are correct, cluster 3 is annotated as B cells.

Very often we might have a few genes of interest, say CD14, a marker of CD14+ monocytes. Then we can look for clusters where CD14 is up-regulated. This can be done by:

Idents(object = pbmc_10x_v3) <- "seurat_clusters"
VlnPlot(pbmc_10x_v3, features = 'CD14')

So, cluster 4 is high likely to be annotated as CD14+ monocytes.

ⓘ Ground-truth about cell-types Typically, cell-type information is unknown in single cell data, as this is what we are trying to find out! But publicly available datasets may have been previously annotated using the technique higlighted above, or using other reference datasets.

3. Automatic cell type annotation for pbmc_10x_v2

Suppose we have annotated pbmc_10x_v3. We can then use it as a reference to annotate a new PBMC dataset (eg pbmc_10x_v2). This can be done computationally, so that we do not need to go through the process of clustering, finding differentially expressed genes and looking up gene names.

Remember: pbmc_10x_v2 and pbmc_10x_v3 include different cells from different patients and sequenced using platforms.

Pre-processing and visualization for pbmc_10x_v2

Exercise 2: use the functions presented in Section 2 to normalise, select highly variable genes, scale, run a PCA and visualize the data using UMAP.

Cell-type annotation

Using pbmc_10x_v3 as a reference

In Seurat we can learn cell type annotation results from one scRNA-seq data to provide cell type annotation for another scRNA-seq dataset.

We use the FindTransferAnchors() function to predict which cells in two datasets are of the same cell type. Here we use pbmc_10x_v3 as the reference data set, and pbmc_10x_v2 is the query data.

We specify that we use the top 30 PCs:

anchors <- FindTransferAnchors(reference = pbmc_10x_v3, query = pbmc_10x_v2, 
                               dims = 1:30)

We then assign the cell-type of the cells in pbmc_10x_v2 using the TransferData() function:

predictions <- TransferData(anchorset = anchors, refdata = pbmc_10x_v3$CellType, 
                                 dims = 1:30)

Seurat will provide a table with the most likely cell type and the probability of each cell type. We assign the most likely cell type to the pbmc_10x_v2 object:

pbmc_10x_v2@meta.data$CellType_Prediction <- predictions$predicted.id 

We then use UMAP to visualize this annotation:

DimPlot(pbmc_10x_v2, reduction = "umap", label = FALSE, group.by = 'CellType_Prediction')

Using Azimuth (a website tool)

Azimuth is a web application that uses an annotated reference dataset to automate the processing, analysis, and interpretation of a new single-cell RNA-seq or ATAC-seq experiment.

The input of Azimuth can be a Seurat object. In order to reduce the size of the uploaded file, we retain only the useful information for cell type annotation with the following command lines:

DefaultAssay(pbmc_10x_v2) <- "RNA"
pbmc_10x_v2_simple <- DietSeurat(object = pbmc_10x_v2, assays = "RNA")
saveRDS(pbmc_10x_v2_simple, 'pbmc_10x_v2.Rds')

An Rds file called pbmc_10x_v2.Rds is saved in your working directory. You can check where is your working directory by using the getwd() function.

Exercise 3 Open the Azimuth website: https://azimuth.hubmapconsortium.org/.

For cell type annotation (see also the slides here):

  1. Find ‘References for scRNA-seq Queries’ -> Then find ‘Human - PBMC’ -> click ‘Go to App’

  2. Click ‘Browse’ -> find ‘pbmc_10x_v2.Rds’ at your working directory -> Click ‘Open’

  3. Waiting for the Rds file upload to the website

  4. Click ‘Map cells to reference’

  5. Click ‘Download Results’

  6. Find ‘Predicted cell types and scores (TSV)’

  7. Click ‘Download’ to get the cell type annotation result stored in azimuth_pred.tsv

  8. Copy the tsv file (azimuth_pred.tsv) to your R working directory

The tsv file has the same data structure of Seurat annotation result (predictions). We read the tsv file, then add the annotation result to the pbmc_10x_v2 object with the AddMetaData() function:

azimuth_predictions <- read.delim('azimuth_pred.tsv', row.names = 1)
pbmc_10x_v2 <- AddMetaData(object = pbmc_10x_v2, metadata = azimuth_predictions)

We use UMAP to visualize the cell type annotation result from Azimuth.

DimPlot(pbmc_10x_v2, reduction = "umap", label = FALSE, group.by = 'predicted.celltype.l2')

Comparing annotations

Here is the cell type annotation results provided by the data provider:

DimPlot(pbmc_10x_v2, reduction = "umap", label = FALSE, group.by = 'CellType')

Exercise 4 Assuming the ground truth of the cell type annotation is from the provider, which cell type annotation is better from Seurat or Azimuth? Why?

Show solutions

4. Integration of two data sets

We have combined the two data sets without any processing in the Seurat object pbmc_combo. In this section we will focus on combining these data sets (note: the process would be similar if you are combining different patients).

Challenge 2: Why can’t we analyze pbmc_combo directly?

Exercise 5: use the functions presented in Section 2 to normalise, select highly variable genes, scale, run a PCA and visualize the data using UMAP on pbmc_combo. Highlight the issues in this basic analysis where we are combining two independent data sets.

Removing batch effects

In Seurat, we use the FindIntegrationAnchors() function to identify cells with similar biological information between two data sets. The difference between cells in two data sets with similar biological information is considered as batch effect:

anchor_combo <- FindIntegrationAnchors(object.list = list(pbmc_10x_v2, pbmc_10x_v3), dims = 1:30)

We then use the IntegrateData() function to remove batch effects and integrate the two data sets:

pbmc_combo <- IntegrateData(anchorset = anchor_combo, dims = 1:30)

Exercise 6: Below is the code to visualize the batch corrected data using UMAP (we need to scale the data first). What do you observe?

# Scaling
pbmc_combo.all.genes <- rownames(pbmc_combo)
pbmc_combo <- ScaleData(pbmc_combo, features = pbmc_combo.all.genes)

# PCA
pbmc_combo <- RunPCA(pbmc_combo, features = VariableFeatures(object = pbmc_combo))

# UMAP
pbmc_combo <- FindNeighbors(pbmc_combo, dims = 1:30)
pbmc_combo <- RunUMAP(pbmc_combo, dims=1:30)
DimPlot(pbmc_combo, reduction = "umap", label = FALSE, group.by = 'Method')

DimPlot(pbmc_combo, reduction = "umap", label = FALSE, group.by = 'CellType')

Show solutions