Remember to create a working directory, and use an R script in Rstudio to type your command lines, as we covered in our previous workhop
Seurat is an R package designed for QC, analysis, and exploration of single-cell RNA-seq data. Seurat aims to enable users to identify and interpret sources of heterogeneity from single-cell transcriptomic measurements, and to integrate types of single-cell data. After this short introduction workshop you can read Seurat offical website to dive a bit deeper.
SeuratData
is a mechanism for distributing datasets in
the form of Seurat objects using Râs internal package and data
management systems. It represents an easy way for users to get access to
datasets that are used in the Seurat vignettes.
(If you have installed them before workshop, you do not need to run this block of code.)
install.packages('Seurat')
if (!requireNamespace("remotes", quietly = TRUE)) {
install.packages("remotes")
}
remotes::install_github("satijalab/seurat-data", quiet = TRUE)
We will use Seurat V5, which was published last year. Seurat V5 has gradually gained popularity due to its faster running speed. However, Seurat V5 has some data structure changes compared with older versions (V3 & V4), which may cause some old codes to fail to run. More details can be found on this website.
library(Seurat)
options(Seurat.object.assay.version = "v5")
library(SeuratData)
We will use the pbmcsca
dataset. This public dataset
includes single-cell RNA-seq data of human peripheral blood mononuclear
cells (PBMCs) using multiple sequencing platforms. Only data that
passed quality control are included in the pbmcsca
dataset.
InstallData("pbmcsca")
data("pbmcsca")
pbmcsca <- UpdateSeuratObject(pbmcsca)
(You can ignore the warning message).
table(pbmcsca$Method)
##
## 10x Chromium (v2) 10x Chromium (v2) A 10x Chromium (v2) B 10x Chromium (v3)
## 3362 3222 3222 3222
## CEL-Seq2 Drop-seq inDrops Seq-Well
## 526 6584 6584 3773
## Smart-seq2
## 526
The table indicates the number of cells sequenced using different platforms.
In this workshop, we will consider two scRNA-seq data sets generated from two platforms, 10x Chromium (v2) & 10x Chromium (v3). PBMCs were sequenced from two patients.
The raw count matrix and the information of each gene and each cell
are saved in a Seurat object pbmc_10x_v2
and
pbmc_10x_v3
independently. In addition, we combine the two
sequencing results without any processing and store them in the Seurat
object pbmc_combo
:
pbmc_10x_v2 <- pbmcsca[,pbmcsca$Method == "10x Chromium (v2)"]
pbmc_10x_v3 <- pbmcsca[,pbmcsca$Method == "10x Chromium (v3)"]
pbmc_combo <- pbmcsca[,pbmcsca$Method %in% c("10x Chromium (v2)", "10x Chromium (v3)")]
The Seurat object is a representation of single-cell expression data
for R; each Seurat object revolves around a set of cells and consists of
one or more Assay objects, or individual representations of expression
data (eg. RNA-seq, ATAC-seq, etc). These assays can be reduced from
their high-dimensional state to a lower-dimension state and stored as
DimReduc
objects.
Seurat objects also store additional metadata, both at the cell and feature level (stored within individual assays). The object is designed to be as self-contained as possible, and easily extendable to new methods.
We use Seurat object pbmc_10x_v3
as an example.
The raw count matrix of scRNA-seq experiment is here:
pbmc_10x_v3@assays$RNA
â Count matrix in Seurat A count matrix from a Seurat object displays the genes in rows and the cells in columns.
The information and labels of each cell is here:
pbmc_10x_v3@meta.data
The dimension reduction information is stored here:
pbmc_10x_v3@reductions
Letâs start with a simple case: the data generated using the the 10x
Chromium (v3) platform (i.e the Seurat object
pbmc_10x_v3
.
Letâs first take a look at how many cells and genes passed Quality Control (QC).
â Count matrix in Seurat A count matrix from a Seurat object displays the genes in rows and the cells in columns.
dim(pbmc_10x_v3)
## [1] 33694 3222
Here we have 3,222 cells with 33,694 genes that passed QC.
We use the Seurat function NormalizeData()
to normalize
raw counts. By default, Seurat implements a global-scaling normalization
method called LogNormalize
that normalizes the gene
expression measurements for each cell by the total number of counts
accross all genes, and multiplies this by a scaling factor (10,000 by
default), then log transforms the result:
pbmc_10x_v3 <- NormalizeData(object=pbmc_10x_v3, normalization.method = "LogNormalize",
scale.factor = 10000)
We use the Seurat function FindVariableFeatures()
to
select highly variable genes (HVGs) which have most of useful
information for downstream analysis.
Here we select the top 3,000 most variable genes to save some computing time. In practice, you can select more genes (5,000 or more) to preserve more information from the scRNA-seq experiment:
pbmc_10x_v3 <- FindVariableFeatures(pbmc_10x_v3, selection.method = "vst", nfeatures = 3000)
After feature selection, the downstream Seurat functions will only use these HVGs, unless otherwise specified.
Many downstream statistical analyses requires data matrix to be
centred and scaled. We use the Seurat function ScaleData()
fro this
pbmc_10x_v3 <- ScaleData(pbmc_10x_v3)
(Note that ScaleData()
can also be used to remove some
unwanted variation. The sources of variation may include, for example,
technical noise, batch effects, or even biological sources of variation
(cell cycle stage). As suggested by Buettner et al,
2015, regressing these signals out of the analysis can improve
downstream dimensionality reduction and clustering. This can be done by
specifying the vars.to.regress
variable in
ScaleData()
, eg
ScaleData(pbmc_10x_v3, vars.to.regress = nUMI)
can regress
out the effects of nUMI
ie library size.)
The scaled z-scored residuals of these models are stored in the
scale.data
slot, and are used for downstream dimension
reduction and clustering tasks.
We perform PCA on the scaled data. By default, HVGs in
pbmc_10x_v3@var.genes
are used as input, but they can be
defined by specifying the argument pc.genes
.
Performing dimensionality reduction on highly variable genes can improve performance. However, with UMI data â particularly after regressing out technical variables, we often see that PCA returns similar (albeit slower) results when run on much larger subsets of genes, including the whole transcriptome.
We run PCA on top 3,000 most variable genes:
pbmc_10x_v3 <- RunPCA(pbmc_10x_v3, features = VariableFeatures(object = pbmc_10x_v3))
## PC_ 1
## Positive: LYZ, FCN1, CLEC7A, CPVL, SERPINA1, SPI1, S100A9, AIF1, NAMPT, CSTA
## CTSS, MAFB, MPEG1, NCF2, FGL2, VCAN, S100A8, TYMP, CST3, LST1
## CYBB, CFD, FCER1G, TGFBI, SLC11A1, GRN, CD14, SLC7A7, PSAP, RNF130
## Negative: IL32, CCL5, TRBC2, TRAC, CST7, CD69, RORA, CTSW, CD247, ITM2A
## TRBC1, C12orf75, IL7R, CD8A, GZMA, NKG7, CD7, LDHB, GZMH, CD6
## CD8B, PRF1, BCL11B, LYAR, FGFBP2, HOPX, LTB, TCF7, KLRD1, ZFP36L2
## PC_ 2
## Positive: NRGN, PF4, SDPR, HIST1H2AC, MAP3K7CL, GNG11, PPBP, GPX1, TUBB1, SPARC
## CLU, PGRMC1, RGS18, FTH1, MARCH2, TREML1, HIST1H3H, ACRBP, AP003068.23, NCOA4
## PRKAR2B, CMTM5, CD9, CA2, TAGLN2, TSC22D1, CTTN, HIST1H2BJ, MTURN, TMSB4X
## Negative: EEF1A1, TMSB10, RPS2, RPS12, RPS18, RPLP1, RPS8, IL32, S100A4, RPLP0
## PFN1, NKG7, CYBA, ARL4C, HSPA8, CST7, HSP90AA1, ZFP36L2, S100A6, ANXA1
## CTSW, CORO1A, LDHA, CD247, GZMA, CALR, YBX1, S100A10, CFL1, ARPC3
## PC_ 3
## Positive: CCL5, TMSB4X, SRGN, NKG7, ACTB, CST7, S100A4, CTSW, ANXA1, GZMH
## FGFBP2, PRF1, GZMA, IL32, GZMB, C12orf75, KLRD1, GNLY, GAPDH, CD247
## MYO1F, NRGN, ID2, ACTG1, CD7, PF4, SDPR, PPBP, HIST1H2AC, CD8A
## Negative: CD79A, HLA-DQA1, MS4A1, BANK1, IGHM, LINC00926, IGHD, TNFRSF13C, HLA-DQB1, CD74
## IGKC, HLA-DRA, BLK, CD83, ADAM28, CD22, HLA-DRB1, CD37, NFKBID, CD79B
## P2RX5, VPREB3, IGLC2, JUND, FCER2, TCOF1, GNG7, COBLL1, RALGPS2, CD19
## PC_ 4
## Positive: IL7R, S100A12, VCAN, S100A8, SLC2A3, CD14, CSF3R, S100A9, MS4A6A, CD93
## IER3, FOS, RCAN3, ZFP36L2, EGR1, RP11-1143G9.4, RGCC, MGST1, LEPROTL1, CLEC4E
## SOCS3, VIM, THBS1, SELL, CXCL8, LTB, SGK1, TNFAIP3, MAL, IRF2BP2
## Negative: FCGR3A, CDKN1C, HES4, CSF1R, CKB, RHOC, ZNF703, MTSS1, TCF7L2, SIGLEC10
## MS4A7, FAM110A, HLA-DPA1, IFITM2, CTSL, PLD4, BATF3, SLC2A6, IFITM3, ABI3
## GZMB, NKG7, HLA-DPB1, FGFBP2, CTD-2006K23.1, CD79B, BID, LRRC25, GZMH, CTSC
## PC_ 5
## Positive: GZMB, FGFBP2, GZMH, PRF1, CST7, NKG7, GNLY, KLRD1, GZMA, CTSW
## CCL5, SPON2, ADGRG1, PRSS23, KLRF1, CCL4, ZEB2, CLIC3, S1PR5, ITGAM
## HOPX, CEP78, CYBA, TRDC, C1orf21, MATK, TTC38, C12orf75, HLA-DRB1, HLA-DPB1
## Negative: IL7R, LTB, LEPROTL1, PAG1, MAL, CDKN1C, RCAN3, LEF1, TCF7, NOSIP
## CAMK4, LDHB, TOB1, TRABD2A, HES4, CSF1R, CKB, JUNB, TMEM123, ZFP36L2
## NELL2, RNASET2, CD28, BCL11B, TNFRSF25, CD27, CTSL, CD40LG, ZNF703, AQP3
The PCA result is stored in pbmc_10x_v3@reduction
. The
output information tells us which genes are positively, or negatvely
correlated with the top 5 principal components.
â Choosing parameters in PCA (genes and pcs) How many genes to choose for PCA and how many PCs to use for downstream analysis is a complex and important issue that is out of the scope of todayâs workshop.
But we highly recommend you to read this document before analyzing your own scRNA-seq dat, where the authors show how to use some visualization methods to guide your choice.
Uniform manifold approximation and project (UMAP) is a tool for visualising higher dimensional data. It takes PCs as input. Here we use the top 30 PCs:
pbmc_10x_v3 <- RunUMAP(pbmc_10x_v3, dims=1:30)
We draw the UMAP plot with the Dimplot()
function by
specifying the argument reduction = 'umap'
. We can colour
points (cells) by using the metadata information
(pbmc_10x_v3@meta.data
) by specifying the argument
group.by
. For example,
DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'Method')
Exercise 1: Plot different UMAP with a different number of PCs. Do you observe any changes? What about using different tuning parameters
pbmc_10x_v3 <- RunUMAP(pbmc_10x_v3, dims = 1:2)
DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'Method')
pbmc_10x_v3 <- RunUMAP(pbmc_10x_v3, dims = 1:10)
DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'Method')
pbmc_10x_v3 <- RunUMAP(pbmc_10x_v3, dims = 1:30)
DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'Method')
pbmc_10x_v3 <- RunUMAP(pbmc_10x_v3, dims = 1:50)
DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'Method')
By default, Seurat uses the Louvain algorithm.
The Louvain algorithm requires a neighbor graph as input. Therefore,
we first run the FindNeighbors()
function first. Note that
FindNeighbors()
is also based on PCs, here we use all 30
top PCs:
pbmc_10x_v3 <- FindNeighbors(pbmc_10x_v3, dims = 1:30)
We then run the FindClusters()
function for clustering.
The argument Algorithm=1
means we are using the Louvain
algorithm for clustering:
pbmc_10x_v3 <- FindClusters(object = pbmc_10x_v3, resolution = 0.3, algorithm=1)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 3222
## Number of edges: 134293
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9386
## Number of communities: 10
## Elapsed time: 0 seconds
The key tunning parameter here is resolution
, which
determines how many clusters we get. The number of clusters to have
depends on biology. It can be, eg, the number of cell types you expect
to see in the sample.
Other options are available such as Algorithm=4
for the
Leiden algorithm, but you have to install Python and some Python
packages first. See ??FindClusters
for more details.
We can use UMAP to visualize the clustering results. The argument
group.by='seurat_clusters'
is used to colour the cells
according to the clustering results.
DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'seurat_clusters')
According to the UMAP representation above, it seems that cluster is different from the other cells and cell clusters. Letâs dig a bit further to characterize this cluster.
First, we look for marker genes that were significantly differential
expressed in Cluster 3 compared with other clusters. We use the
FindAllMarkers()
function to identify marker genes for each
cluster.
This command line takes some time to run:
pbmc_10x_v3.markers <- FindAllMarkers(pbmc_10x_v3, min.pct = .25, logfc.threshold = .25)
We then extract the top 5 marker of cluster 3 with the smallest p-values:
cluster3.markers <- pbmc_10x_v3.markers[which(pbmc_10x_v3.markers$cluster==3),]
cluster3.markers[1:5, ]
We then search these marker genes in this database website https://panglaodb.se/search.html. We need to remove the dash (â-â) before searching.
Based on the result provided by the database, it looks like cluster 3
might be a cluster of B cells. We can check whether this is the case by
checking the cell-type ground truth
information pre-saved
in the pbmc_10x_v3 object
, and plotting this as UMAP:
DimPlot(pbmc_10x_v3, reduction = "umap", label = TRUE, group.by = 'CellType')
Yes, we are correct, cluster 3 is annotated as B cells.
Very often we might have a few genes of interest, say CD14, a marker of CD14+ monocytes. Then we can look for clusters where CD14 is up-regulated. This can be done by:
Idents(object = pbmc_10x_v3) <- "seurat_clusters"
VlnPlot(pbmc_10x_v3, features = 'CD14')
So, cluster 4 is high likely to be annotated as CD14+ monocytes.
â Ground-truth about cell-types Typically, cell-type information is unknown in single cell data, as this is what we are trying to find out! But publicly available datasets may have been previously annotated using the technique higlighted above, or using other reference datasets.
pbmc_10x_v2
Suppose we have annotated pbmc_10x_v3
. We can then use
it as a reference to annotate a new PBMC dataset (eg
pbmc_10x_v2
). This can be done computationally, so that we
do not need to go through the process of clustering, finding
differentially expressed genes and looking up gene names.
Remember: pbmc_10x_v2
and
pbmc_10x_v3
include different cells from different patients
and sequenced using platforms.
pbmc_10x_v2
Exercise 2: use the functions presented in Section 2 to normalise, select highly variable genes, scale, run a PCA and visualize the data using UMAP.
pbmc_10x_v3
as a referenceIn Seurat we can learn cell type annotation results from one scRNA-seq data to provide cell type annotation for another scRNA-seq dataset.
We use the FindTransferAnchors()
function to predict
which cells in two datasets are of the same cell type. Here we use
pbmc_10x_v3
as the reference data set, and
pbmc_10x_v2
is the query data.
We specify that we use the top 30 PCs:
anchors <- FindTransferAnchors(reference = pbmc_10x_v3, query = pbmc_10x_v2,
dims = 1:30)
We then assign the cell-type of the cells in pbmc_10x_v2
using the TransferData()
function:
predictions <- TransferData(anchorset = anchors, refdata = pbmc_10x_v3$CellType,
dims = 1:30)
Seurat will provide a table with the most likely cell type and the
probability of each cell type. We assign the most likely cell type to
the pbmc_10x_v2
object:
pbmc_10x_v2@meta.data$CellType_Prediction <- predictions$predicted.id
We then use UMAP to visualize this annotation:
DimPlot(pbmc_10x_v2, reduction = "umap", label = FALSE, group.by = 'CellType_Prediction')
Azimuth is a web application that uses an annotated reference dataset to automate the processing, analysis, and interpretation of a new single-cell RNA-seq or ATAC-seq experiment.
The input of Azimuth can be a Seurat object. In order to reduce the size of the uploaded file, we retain only the useful information for cell type annotation with the following command lines:
DefaultAssay(pbmc_10x_v2) <- "RNA"
pbmc_10x_v2_simple <- DietSeurat(object = pbmc_10x_v2, assays = "RNA")
saveRDS(pbmc_10x_v2_simple, 'pbmc_10x_v2.Rds')
An Rds file called pbmc_10x_v2.Rds
is saved in your
working directory. You can check where is your working directory by
using the getwd()
function.
Exercise 3 Open the Azimuth website: https://azimuth.hubmapconsortium.org/.
For cell type annotation (see also the slides here):
Find âReferences for scRNA-seq Queriesâ -> Then find âHuman - PBMCâ -> click âGo to Appâ
Click âBrowseâ -> find âpbmc_10x_v2.Rdsâ at your working directory -> Click âOpenâ
Waiting for the Rds file upload to the website
Click âMap cells to referenceâ
Click âDownload Resultsâ
Find âPredicted cell types and scores (TSV)â
Click âDownloadâ to get the cell type annotation result stored in azimuth_pred.tsv
Copy the tsv file (azimuth_pred.tsv) to your R working directory
The tsv file has the same data structure of Seurat annotation result
(predictions). We read the tsv file, then add the annotation result to
the pbmc_10x_v2
object with the AddMetaData()
function:
azimuth_predictions <- read.delim('azimuth_pred.tsv', row.names = 1)
pbmc_10x_v2 <- AddMetaData(object = pbmc_10x_v2, metadata = azimuth_predictions)
We use UMAP to visualize the cell type annotation result from Azimuth.
DimPlot(pbmc_10x_v2, reduction = "umap", label = FALSE, group.by = 'predicted.celltype.l2')
Here is the cell type annotation results provided by the data provider:
DimPlot(pbmc_10x_v2, reduction = "umap", label = FALSE, group.by = 'CellType')
Exercise 4 Assuming the ground truth of the cell type annotation is from the provider, which cell type annotation is better from Seurat or Azimuth? Why?
We have combined the two data sets without any processing in the
Seurat object pbmc_combo
. In this section we will focus on
combining these data sets (note: the process would be similar if you are
combining different patients).
pbmc_combo
directly?Exercise 5: use the functions presented in Section 2
to normalise, select highly variable genes, scale, run a PCA and
visualize the data using UMAP on pbmc_combo
. Highlight the
issues in this basic analysis where we are combining two independent
data sets.
In Seurat, we use the FindIntegrationAnchors()
function
to identify cells with similar biological information between two data
sets. The difference between cells in two data sets with similar
biological information is considered as batch effect:
anchor_combo <- FindIntegrationAnchors(object.list = list(pbmc_10x_v2, pbmc_10x_v3), dims = 1:30)
We then use the IntegrateData()
function to remove batch
effects and integrate the two data sets:
pbmc_combo <- IntegrateData(anchorset = anchor_combo, dims = 1:30)
Exercise 6: Below is the code to visualize the batch corrected data using UMAP (we need to scale the data first). What do you observe?
# Scaling
pbmc_combo.all.genes <- rownames(pbmc_combo)
pbmc_combo <- ScaleData(pbmc_combo, features = pbmc_combo.all.genes)
# PCA
pbmc_combo <- RunPCA(pbmc_combo, features = VariableFeatures(object = pbmc_combo))
# UMAP
pbmc_combo <- FindNeighbors(pbmc_combo, dims = 1:30)
pbmc_combo <- RunUMAP(pbmc_combo, dims=1:30)
DimPlot(pbmc_combo, reduction = "umap", label = FALSE, group.by = 'Method')
DimPlot(pbmc_combo, reduction = "umap", label = FALSE, group.by = 'CellType')