In the following tutorial, we examine the recently published Microwell-seq “Mouse Cell Atlas”, composed of hundreds of thousands of cells derived from all major mouse organs. For those that are getting started using Seurat, we recommend first working through our 3k PBMC tutorial, which introduces the basic functionality of the package.

Our goal is to demonstrate a workflow for handling very large datasets in Seurat, emphasizing recent improvements we have made for speed and memory efficiency. We do not perform downstream biological analyses on the resulting clusters, but encourage users to explore this dataset and interpret this exciting resource. All analyses here are performed in memory, but we also now support storage on-disk (using the HDF5-based loom framework). See this vignette for a workflow of the same MCA dataset using loomR.

Setup the Seurat Object

The original data for the MCA is available here. For ease of getting started, we provide a sparse matrix R data file (.rds) file containing the combined expression matrix and the published metadata file here.

library(Seurat)
mca.matrix <- readRDS(file = "../data/MCA_merged_mat.rds")
mca.metadata <- read.csv(file = "../data/MCA_All-batch-removed-assignments.csv", row.names = 1)

We will analyze ~242,000 cells that were assigned a cluster ID in the original study. As a result, we don’t do perform additional QC steps or filtration steps here.

mca <- CreateSeuratObject(counts = mca.matrix, meta.data = mca.metadata, project = "MouseCellAtlas")
# Only keep annotated cells
mca <- subset(mca, cells = names(which(!is.na(mca$ClusterID))))
# Leaves us with 242k cells
mca
## An object of class Seurat 
## 39855 features across 242533 samples within 1 assay 
## Active assay: RNA (39855 features)

Data Preprocessing

We perform standard log-normalization.

mca <- NormalizeData(mca, normalization.method = "LogNormalize", scale.factor = 10000)

FindVariableGenes calculates the variance and mean for each gene in the dataset in the dataset (storing this in object[[assay]]@meta.features). We have observed that for large-cell datasets with unique molecular identifiers, selecting highly variable genes (HVG) simply based on variance mean ratio (VMR) is an efficient and robust strategy. Here, we select the top 1,000 HVG for downstream analysis.

mca <- FindVariableFeatures(mca)

We calculate and regress out mitochondrial expression per cell.

Suggestions for large datasets
  • ScaleData now has the option for multi-core parallel processing, using the future framework.
  • You can perform gene scaling on only the HVF, dramatically improving speed and memory use. Since dimensional reduction is run only on HVF, this will not affect downstream results.
mca[["percent.mt"]] <- PercentageFeatureSet(mca, pattern = "^mt-")
mca <- ScaleData(mca, vars.to.regress = "percent.mt")

Dimensional Reduction (PCA)

mca <- RunPCA(mca, npcs = 100, ndims.print = 1:5, nfeatures.print = 5)
## PC_ 1 
## Positive:  Lyz2, S100a8, S100a6, S100a9, Igkc 
## Negative:  Fabp9, Meig1, Prm1, Ldhc, Prm2 
## PC_ 2 
## Positive:  Lypd8, Reg3b, Lgals2, Reg3g, Lgals4 
## Negative:  Sparc, Col2a1, Col9a2, Col9a1, Col9a3 
## PC_ 3 
## Positive:  Plp1, Mobp, Mag, Ermn, Cldn11 
## Negative:  Sparc, Mgp, Col2a1, Col9a2, Col9a1 
## PC_ 4 
## Positive:  Zpbp2, Hdgfl1, Pabpc6, Tmbim7, Tuba3a 
## Negative:  1700031M16Rik, 1700042G07Rik, 4930570D08Rik, 1700016P04Rik, H1fnt 
## PC_ 5 
## Positive:  Plp1, Mobp, Mag, Ermn, Cldn11 
## Negative:  S100a8, S100a9, Camp, Ngp, Ly6c2
Suggestions for large datasets
  • To select downstream PCs for analysis, JackStraw now also features mutli-core parallelization. However, for data sets of this size, the saturation of explained variance, along with the visualization of PC ‘metagenes’, are likely to be more effective. We select 75 PCs here for downstream analysis.
ElbowPlot(mca, ndims = 100)

DimHeatmap(mca, dims = c(1:3, 70:75), cells = 500, balanced = TRUE)

Graph-based clustering

We now cluster the data in lower-dimensional space. Our approach was heavily inspired by recent manuscripts which applied graph-based clustering approaches to scRNA-seq data [SNN-Cliq, Xu and Su, Bioinformatics, 2015] and CyTOF data [PhenoGraph, Levine et al., Cell, 2015].

Suggestions for large datasets
  • The construction of the shared nearest neighbor graph is now fully implemented in C++, significantly improving performance.
  • To further increase speed, you can employ an approximate nearest neighbor search via the RANN package by increasing the nn.eps parameter. Setting this at 0 (the default) represents an exact neighbor search.
  • By default, we perform 100 random starts for clustering and select the result with highest modularity. You can lower this through the n.start parameter to reduce clustering time.
mca <- FindNeighbors(mca, reduction = "pca", dims = 1:75, nn.eps = 0.5)
mca <- FindClusters(mca, resolution = 3, n.start = 10)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 242533
## Number of edges: 11218977
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9232
## Number of communities: 148
## Elapsed time: 138 seconds

Visualization (t-SNE/UMAP)

We use the same principal components that were input for clustering, for two-dimensional visualization with t-SNE or UMAP.

Suggestions for large datasets
  • We now support the use of a modified Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE, Kluger Lab, preprint here). You need to first compile and install the software.
  • Increasing max_iter can sometimes provide better cluster resolution.
  • We also now support the use of UMAP for dimensional reduction (preprint here). To run you need to install the umap-learn python package.
  • Adjusting min_dist and/or n_neighbors may help with larger datasets
  • For visualization, the AugmentPlot function will convert only the single cells in the t-SNE plot into a PNG file, while the remainder of the image (axes, labels, etc.) remain as vector graphics. We’ve found that this can be very useful when importing into Adobe Illustrator.
mca <- RunTSNE(mca, dims = 1:75, tsne.method = "FIt-SNE", nthreads = 4, max_iter = 2000)
mca <- RunUMAP(mca, dims = 1:75, min.dist = 0.75)
library(ggplot2)
p1 <- DimPlot(mca, reduction = "tsne", pt.size = 0.1) + ggtitle(label = "FIt-SNE")
p2 <- DimPlot(mca, reduction = "umap", pt.size = 0.1) + ggtitle(label = "UMAP")
p1 <- AugmentPlot(plot = p1)
p2 <- AugmentPlot(plot = p2)
CombinePlots(plots = list(p1, p2), legend = "none")

p3 <- FeaturePlot(mca, features = c("S100a9", "Sftpc"), reduction = "tsne", pt.size = 0.1, combine = FALSE)
p3 <- lapply(X = p3, FUN = function(x) AugmentPlot(x + DarkTheme() + NoLegend()))
CombinePlots(plots = p3)

Elapsed time for key steps

These tests were performed on a desktop computer running Ubuntu 16.04.5 LTS with an Intel(R) Core(TM) i7-6800K CPU @ 3.40GHz and 96 GB of RAM.

Function Time
CreateSeuratObject 14sec
NormalizeData 12sec
FindVariableGenes 22sec
ScaleData (no regression) 13sec
ScaleData (w/regression) 3min 7sec
ScaleData (w/regression + parallelization) 1min 21sec
RunPCA 6min 24sec
FindNeighbors + FindClusters (default) 38min 15sec
FindNeighbors(nn.eps = 0.5) + FindClusters (10 starts) 18min 56sec
RunTSNE (Rtsne) 36min 58sec
RunTSNE (FIt-SNE) 9min 49sec