In the following tutorial, we examine the recently published Microwell-seq “Mouse Cell Atlas”, composed of hundreds of thousands of cells derived from all major mouse organs. For those that are getting started using Seurat, we recommend first working through our 3k PBMC tutorial, which introduces the basic functionality of the package.
Our goal is to demonstrate a workflow for handling very large datasets in Seurat, emphasizing recent improvements we have made for speed and memory efficiency. We do not perform downstream biological analyses on the resulting clusters, but encourage users to explore this dataset and interpret this exciting resource. All analyses here are performed in memory, but we also now support storage on-disk (using the HDF5-based loom framework). See this vignette for a workflow of the same MCA dataset using loomR.
The original data for the MCA is available here. For ease of getting started, we provide a sparse matrix R data file (.rds) file containing the combined expression matrix and the published metadata file here.
library(Matrix)
library(Seurat)
mca.matrix <- readRDS(file = "~/Downloads/MCA/MCA_merged_mat.rds")
mca.metadata <- read.csv("~/Downloads/MCA/MCA_All-batch-removed-assignments.csv",
row.names = 1)
We will analyze ~242,000 cells that were assigned a cluster ID in the original study. As a result, we don’t do perform additional QC steps or filtration steps here.
mca <- CreateSeuratObject(raw.data = mca.matrix, meta.data = mca.metadata, project = "MouseCellAtlas")
# Only keep annotated cells
mca <- SubsetData(mca, cells.use = rownames(mca@meta.data[!is.na(mca@meta.data$ClusterID),
]), do.clean = TRUE)
# Leaves us with 242k cells
mca
## An object of class seurat in project MouseCellAtlas
## 39855 genes across 242533 samples.
We perform standard log-normalization.
mca <- NormalizeData(object = mca, normalization.method = "LogNormalize", scale.factor = 10000)
FindVariableGenes
calculates the variance and mean for each gene in the dataset in the dataset (storing this in object@hvg.info
), and sorts genes by their variance/mean ratio (VMR). We have observed that for large-cell datasets with unique molecular identifiers, selecting highly variable genes (HVG) simply based on VMR is an efficient and robust strategy. Here, we select the top 1,000 HVG for downstream analysis.
mca <- FindVariableGenes(object = mca, mean.function = ExpMean, dispersion.function = LogVMR,
do.plot = FALSE)
hv.genes <- head(rownames(mca@hvg.info), 1000)
We calculate and regress out mitochondrial expression per cell.
num.cores
argument. Thanks to aymanm for this improvement.
mito.genes <- grep(pattern = "^mt-", x = rownames(x = mca@data), value = TRUE)
percent.mito <- Matrix::colSums(mca@raw.data[mito.genes, ])/Matrix::colSums(mca@raw.data)
mca <- AddMetaData(object = mca, metadata = percent.mito, col.name = "percent.mito")
mca <- ScaleData(object = mca, genes.use = hv.genes, display.progress = FALSE,
vars.to.regress = "percent.mito", do.par = TRUE, num.cores = 4)
mca <- RunPCA(object = mca, pc.genes = hv.genes, pcs.compute = 100, do.print = TRUE,
pcs.print = 1:5, genes.print = 5)
## [1] "PC1"
## [1] "Prm1" "Prm2" "Meig1" "Fabp9" "Ldhc"
## [1] ""
## [1] "Tmsb4x" "B2m" "Fth1" "Ftl1" "Rpl18a"
## [1] ""
## [1] ""
## [1] "PC2"
## [1] "Sparc" "Col2a1" "Col9a2" "Col9a1" "Col9a3"
## [1] ""
## [1] "Defa24" "Reg3b" "Defa30" "Igkc" "Reg3g"
## [1] ""
## [1] ""
## [1] "PC3"
## [1] "Camp" "Ngp" "S100a9" "S100a8" "Chil3"
## [1] ""
## [1] "Defa24" "Reg3b" "Defa30" "Reg3g" "Defa17"
## [1] ""
## [1] ""
## [1] "PC4"
## [1] "Csn1s2b" "Wap" "Trf" "Lalba" "Glycam1"
## [1] ""
## [1] "S100a8" "S100a9" "Lyz2" "Tyrobp" "Srgn"
## [1] ""
## [1] ""
## [1] "PC5"
## [1] "Wap" "Csn1s2b" "Lalba" "Glycam1" "Csn1s2a"
## [1] ""
## [1] "Tmsb10" "Rplp1" "Rpl13" "Rps27" "Rps15a"
## [1] ""
## [1] ""
JackStraw
now also features mutli-core parallelization. However, for data sets of this size, the saturation of explained variance, along with the visualization of PC ‘metagenes’, are likely to be more effective. We select 75 PCs here for downstream analysis.
PCElbowPlot(object = mca, num.pc = 100)
PCHeatmap(mca, pc.use = c(1:3, 70:75), cells.use = 500, do.balanced = TRUE)
We now cluster the data in lower-dimensional space. Our approach was heavily inspired by recent manuscripts which applied graph-based clustering approaches to scRNA-seq data [SNN-Cliq, Xu and Su, Bioinformatics, 2015] and CyTOF data [PhenoGraph, Levine et al., Cell, 2015].
nn.eps
parameter. Setting this at 0 (the default) represents an exact neighbor search.
n.start
parameter to reduce clustering time.
mca <- FindClusters(object = mca, reduction.type = "pca", dims.use = 1:75, resolution = 3,
save.SNN = TRUE, n.start = 10, nn.eps = 0.5, print.output = FALSE)
We use the same principal components that were input for clustering, for two-dimensional visualization with t-SNE or UMAP.
max_iter
can sometimes provide better cluster resolution.
umap-learn
python package.
min_dist
and/or n_neighbors
may help with larger datasets
vector.friendly = TRUE
will convert only the single cells in the t-SNE plot into a PNG file, while the remainder of the image (axes, labels, etc.) remain as vector graphics. We’ve found that this can be very useful when importing into Adobe Illustrator.
mca <- RunTSNE(object = mca, reduction.use = "pca", dims.use = 1:75, tsne.method = "FIt-SNE",
nthreads = 4, reduction.name = "FItSNE", reduction.key = "FItSNE_", fast_tsne_path = "/home/andrew/Software/FIt-SNE/bin/fast_tsne",
max_iter = 2000)
mca <- RunUMAP(object = mca, reduction.use = "pca", dims.use = 1:75, min_dist = 0.75)
library(cowplot)
p1 <- DimPlot(object = mca, reduction.use = "FItSNE", no.legend = TRUE, do.return = TRUE,
vector.friendly = TRUE, pt.size = 0.1) + ggtitle("FIt-SNE") + theme(plot.title = element_text(hjust = 0.5))
p2 <- DimPlot(object = mca, reduction.use = "umap", no.legend = TRUE, do.return = TRUE,
vector.friendly = TRUE, pt.size = 0.1) + ggtitle("UMAP") + theme(plot.title = element_text(hjust = 0.5))
plot_grid(p1, p2)
FeaturePlot(mca, c("S100a9", "Sftpc"), reduction.use = "FItSNE", dark.theme = TRUE,
pt.size = 0.1, vector.friendly = TRUE)
These tests were performed on a desktop computer running Ubuntu 16.04.4 LTS with an Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz and 32GB of RAM.
Function | Time |
---|---|
CreateSeuratObject | 14sec |
NormalizeData | 12sec |
FindVariableGenes | 19sec |
ScaleData (no regression) | 4sec |
ScaleData (w/regression) | 1min 20sec |
ScaleData (w/regression + parallelization) | 49sec |
RunPCA | 5min 3sec |
FindClusters (default) | 52min 11sec |
FindClusters (10 starts + nn.eps = 0.5) | 18min 35sec |
RunTSNE (Rtsne) | 1hr 7min |
RunTSNE (FIt-SNE) | 7min 31sec |