Guided Clustering of the Mouse Cell Atlas: loom edition

In the following tutorial, we examine the recently published Microwell-seq “Mouse Cell Atlas”, composed of hundreds of thousands of cells derived from all major mouse organs. This tutorial is meant to mirror our recently released vignette, but exemplifying new functionality, under beta release.

Working with loom files in Seurat

Our goal is to demonstrate a workflow for handling very large datasets in Seurat, without loading data directly into memory, but leveraging on-disk storage instead. Preprocessing and downstream analysis steps (ex. normalization, variable gene selection, PCA, etc.) are computed using block processing, where results are sequentially computed in small chunks. We use the HDF5-based loom storage format, originally introduced by Sten Linnarson’s lab with a Python API, alongside an R API loomR.

While we obtain essentially identical results with both workflows, the on-disk approach enables us to analyze substantially larger datasets than can be loaded into memory. However, the need to perform read/write operations to disk, does result in a reduction in speed. Most users will therefore not require these features, particularly for datasets with <1M cells.

Lastly, we acknowledge exceptional work by Aaron Lun (beachmat package) and the Bioconductor team (rhdf5 package) for developing powerful interfaces between R and HDF5. While we use infrastructure in the hdf5r package in this beta, we are now looking forward to integrate support for these methods into a future release.

Setup the `loom` Object

Installing loomR

Support for loom files in Seurat is released as beta software. To enable loom support, we need three packages: loomR, hdf5r, and the loom branch of Seurat. Installation is as follows:

devtools::install_github(repo = 'hhoeflin/hdf5r')
devtools::install_github(repo = 'mojaveazure/loomR', ref = 'develop')
devtools::install_github(repo = 'satijalab/seurat', ref = 'loom')

For more details about interacting with loom files in R, please see the tutorial for loomR here.

The original data for the MCA is available here. For ease of getting started, we provide a sparse matrix R data file (.rds) file containing the combined expression matrix and the published metadata file here.

library(loomR)
library(Seurat)

mca.matrix <- readRDS(file = "~/Downloads/MCA/MCA_merged_mat.rds")
mca.metadata <- read.csv("~/Downloads/MCA/MCA_All-batch-removed-assignments.csv",
    row.names = 1)

We will analyze ~242,000 cells that were assigned a cluster ID in the original study. As a result, we don’t do perform additional QC steps or filtration steps here.

# Only keep annotated cells
cells.use <- which(x = colnames(x = mca.matrix) %in% rownames(x = mca.metadata))
mca.matrix <- mca.matrix[, cells.use]
mca.metadata <- mca.metadata[colnames(x = mca.matrix), ]
# Create the loom file
mca <- create(filename = "mca.loom", data = mca.matrix, display.progress = FALSE,
    calc.numi = TRUE)
# Leaves us with 242k cells
mca

## Class: loom
## Filename: /home/paul/Software/loomR/mca-subset.loom
## Access type: H5F_ACC_RDONLY
## Attributes: chunks, version
## Listing:
##        name    obj_type   dataset.dims dataset.type_class
##   col_attrs   H5I_GROUP           <NA>               <NA>
##  col_graphs   H5I_GROUP           <NA>               <NA>
##      layers   H5I_GROUP           <NA>               <NA>
##      matrix H5I_DATASET 242533 x 39855          H5T_FLOAT
##   row_attrs   H5I_GROUP           <NA>               <NA>
##  row_graphs   H5I_GROUP           <NA>               <NA>

We will also add tissue data to the loom file.

# Pull the tissue information for the cells we're analysing
tissues <- as.character(x = mca.metadata[, "Tissue"])
# Works similarly to AddMetaData for Seurat objects
mca$add.col.attribute(attribute = list(tissue = tissues))

Data Preprocessing

Using loom objects in Seurat

We have implemented the same function names and syntax for interacting with loom objects. However, in this beta release, please note that not all Seurat functions have methods for loom objects. To get a general idea of which functions work on loom objects, use the methods function to list all functions where a method for loom objects has been defined.

methods(class = "loom")

##  [1] BlockCov          BuildSNN          coerce
##  [4] Convert           DimPlot           DownsampleSeurat
##  [7] FetchData         FindClusters      FindVariableGenes
## [10] GetAllCalcParam   GetAssayData      GetCalcParam
## [13] GetDimReduction   GetVariableGenes  initialize
## [16] NormalizeData     ProjectSeurat     RunPCA
## [19] RunTSNE           ScaleData         SetCalcParams
## [22] SetDimReduction   show              slotsFromS3
## [25] subset            SubsetSeurat
## see '?methods' for accessing help and source code

In addition to these methods, the following functions simply work on both Seurat and loom objects as they use the methods listed above: GetCellEmbeddings, GetGeneLoadings, DimTopGenes, DimTopCells, PrintDim, FeaturePlot, DimHeatmap, and DimElbowPlot.

We perform standard log-normalization.

NormalizeData(object = mca, chunk.size = 1000, scale.factor = 10000, display.progress = FALSE)

Identify the top 1,000 genes sorted by variance/mean ratio.

FindVariableGenes(object = mca)
hv.genes <- head(x = GetVariableGenes(object = mca)$index, n = 1000)

We calculate and regress out mitochondrial expression per cell.

# Pull the indices of the mitochondrial genes
mito.genes <- grep(pattern = "^mt-", x = mca[["row_attrs/gene_names"]][], value = FALSE)
# Calculate percent.mito and store directly to the loom object Similar to
# creating a vector with the percentage of expression from mitochondrial
# genes, then using AddMetaData to put in to an object
mca$apply(name = "col_attrs/percent_mito", FUN = function(mat) {
    return(rowSums(x = mat[, mito.genes])/rowSums(x = mat))
}, MARGIN = 2, dataset.use = "matrix")
ScaleData(object = mca, genes.use = hv.genes, chunk.size = 20, display.progress = FALSE,
    vars.to.regress = "percent_mito")

Dimensional Reduction (PCA)

RunPCA(object = mca, pc.genes = hv.genes, online.pca = FALSE, pcs.compute = 100,
    do.print = TRUE, pcs.print = 1:5, genes.print = 5, display.progress = FALSE)

Suggestions for large datasets

To select downstream PCs for analysis, JackStraw now also features mutli-core parallelization. However, for data sets of this size, the saturation of explained variance, along with the visualization of PC ‘metagenes’, are likely to be more effective. We select 75 PCs here for downstream analysis.

PCElbowPlot(object = mca, num.pc = 100)

PCHeatmap(mca, pc.use = c(1:3, 70:75), cells.use = 500, do.balanced = TRUE)

Graph-based clustering

We now cluster the data in lower-dimensional space. Our approach was heavily inspired by recent manuscripts which applied graph-based clustering approaches to scRNA-seq data [SNN-Cliq, Xu and Su, Bioinformatics, 2015] and CyTOF data [PhenoGraph, Levine et al., Cell, 2015].

Suggestions for large datasets

The construction of the shared nearest neighbor graph is now fully implemented in C++, significantly improving performance.
To further increase speed, you can employ an approximate nearest neighbor search via the RANN package by increasing the nn.eps parameter. Setting this at 0 (the default) represents an exact neighbor search.
By default, we perform 100 random starts for clustering and select the result with highest modularity. You can lower this through the n.start parameter to reduce clustering time.

FindClusters(object = mca, reduction.type = "pca", dims.use = 1:75, resolution = 3,
    save.SNN = TRUE, n.start = 10, nn.eps = 0.5, print.output = FALSE)

Visualization (t-SNE)

We use the same principal components that were input for clustering, for two-dimensional visualization with t-SNE.

Suggestions for large datasets

We now support the use of a modified Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE, Kluger Lab, preprint here). You need to first compile and install the software.
For visualization, setting vector.friendly = TRUE will convert only the single cells in the t-SNE plot into a PNG file, while the remainder of the image (axes, labels, etc.) remain as vector graphics. We’ve found that this can be very useful when importing into Adobe Illustrator.
Increasing max_iter can sometimes provide better cluster resolution.

RunTSNE(object = mca, reduction.use = "pca", dims.use = 1:75, tsne.method = "FIt-SNE",
    max_iter = 2000, nthreads = 4, overwrite = TRUE)

p1 <- DimPlot(object = mca, reduction.use = "tsne", no.legend = TRUE, do.return = TRUE,
    pt.size = 0.1, ident.use = "col_attrs/res.3", vector.friendly = TRUE) +
    ggtitle("Cluster ID") + theme(plot.title = element_text(hjust = 0.5))
p2 <- DimPlot(object = mca, reduction.use = "tsne", no.legend = TRUE, do.return = TRUE,
    pt.size = 0.1, ident.use = "col_attrs/tissue", vector.friendly = TRUE) +
    ggtitle("Tissue") + theme(plot.title = element_text(hjust = 0.5))
plot_grid(p1, p2)

FeaturePlot(mca, c("S100a9", "Sftpc"), reduction.use = "tsne", dark.theme = TRUE,
    pt.size = 0.1, vector.friendly = TRUE)

mca$close_all()

Elapsed time for key steps

These tests were performed on a laptop computer running Ubuntu 16.04.4 LTS with an Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz and 16GB of RAM.

Function	Time
create	1hr 21min
NormalizeData	1hr 27 min
FindVariableGenes	21min 24sec
ScaleData (w/regression)	40min 10sec
RunPCA	26min 24sec
FindClusters (10 starts + nn.eps = 0.5)	36min 48sec
RunTSNE (FIt-SNE)	17min 45sec

Guided Clustering of the Mouse Cell Atlas: loom edition

Compiled: April 22, 2018

Setup the loom Object

Data Preprocessing

Dimensional Reduction (PCA)

Graph-based clustering

Visualization (t-SNE)

Elapsed time for key steps

Setup the `loom` Object