As single cell datasets continue to grow in size, computational requirements are growing exponentially. We’ve noticed that, even when using sparse matrices, Seurat analysis can be challenging for datasets >100,000 cells, primarily due to difficulties in storing the full dataset in memory. Instead of storing data in memory, the HDF5 data format offers efficient, on-disk storage, that is scalable to massive datasets even >1M cells.
The Linnarson lab has developed an HDF5-based data structure, loom, to easily store single cell genomics datasets and metadata. They also released a Python API, called loompy, (full details can be found here) to interact with loom files.
To complement loompy, we are introducing loomR: an R implementation of the loom API. Though still in development, loomR provides a way to access and interact with loom files from R. This tutorial will walk through how to install loomR, interact with loom objects, take advantage of the chunking mechanisms built into loomR. Finally, we introduce initial steps in the Seurat workflow that enable direct compatibility with loom files, towards the goal of making Seurat fully HDF5-compatible in the near future.
loomR is freely available on GitHub and currently in development. A CRAN release is forthcoming as well.
The pre-requisites for loomR include the C++ HDF5 library (installation details can be found here), and the devtools package for GitHub installation.
Full documentation on the loom
class’s fields and methods are found in the help page for the loom
class (see ?loomR::loom
)
# Install devtools from CRAN
install.packages("devtools")
# Use devtools to install hdf5r and loomR from GitHub
devtools::install_github(repo = "hhoeflin/hdf5r")
devtools::install_github(repo = "mojaveazure/loomR", ref = "develop")
# Load loomR
library(loomR)
loom
objectsloomR implements the loom API through the R6-based loom
class. Interacting with R6 objects is slightly different than S4 objects (like Seurat objects). Briefly, R6 fields are accessed using the $
sigil instead of the @
operator (i.e. my.object$field.name
), and R6 methods are called directly from (and modify the object) (i.e. my.object$method()
). We provide examples below, and also direct interested users to this page for a more complete introduction to R6 classes.
loom
objectsUnlike standard R objects that load all data contained within them into memory, loom
objects are merely connections to a file on disk, which enables scaling to massive datasets with low memory consumption. You can connect
to an existing loom file (example here), create your own from an expression matrix using loomR::create
, or create a loom file from an existing Seurat object using Convert
(covered later in the tutorial).
# Connect to the loom file in read/write mode
lfile <- connect(filename = "pbmc.loom", mode = "r+")
lfile
## Class: loom
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Attributes: version, chunks
## Listing:
## name obj_type dataset.dims dataset.type_class
## col_attrs H5I_GROUP <NA> <NA>
## col_graphs H5I_GROUP <NA> <NA>
## layers H5I_GROUP <NA> <NA>
## matrix H5I_DATASET 2700 x 13714 H5T_FLOAT
## row_attrs H5I_GROUP <NA> <NA>
## row_graphs H5I_GROUP <NA> <NA>
A loom
object is a container for six sub-objects: one dataset (matrix
) and five groups (layers
, row_attrs
, col_attrs
, row_graphs
, and col_graphs
); you can read full details at the loom file specification.
matrix
layers
matrix
, must have the same dimensions as matrix
row_attrs
and col_attrs
row_attrs
or \(m\) for
col_attrs
row_graphs
and col_graphs
a
(row index), b
(column index), and w
(value)
loom
objectsInteracting with loom
objects is done in two ways: using the double subset [[
operator or the S3 $
sigil.
# Viewing the `matrix` dataset with the double subset [[ operator You can
# also use the $ sigil, i.e. lfile$matrix
lfile[["matrix"]]
## Class: H5D
## Dataset: /matrix
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_IEEE_F64LE
## Space: Type=Simple Dims=2700 x 13714 Maxdims=Inf x Inf
## Chunk: 32 x 32
To view datasets within groups, we can either use the double subset [[
or $
operators.
# Viewing a dataset in the 'col_attrs' group with the double subset [[
# operator and full UNIX-style path
lfile[["col_attrs/cell_names"]]
## Class: H5D
## Dataset: /col_attrs/cell_names
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_STRING {
## STRSIZE H5T_VARIABLE;
## STRPAD H5T_STR_NULLTERM;
## CSET H5T_CSET_ASCII;
## CTYPE H5T_C_S1;
## }
## Space: Type=Simple Dims=2700 Maxdims=Inf
## Chunk: 1024
# Viewing a dataset in the 'row_attrs' group with S3 $ chaining
lfile$row.attrs$gene_names
## Class: H5D
## Dataset: /row_attrs/gene_names
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_STRING {
## STRSIZE H5T_VARIABLE;
## STRPAD H5T_STR_NULLTERM;
## CSET H5T_CSET_ASCII;
## CTYPE H5T_C_S1;
## }
## Space: Type=Simple Dims=13714 Maxdims=Inf
## Chunk: 1024
Note that the above commands simply return a description of the data stored on disk. To access the data itself, we have to use a single subset [
operator on the description.
Accessing data requires integer indices, or empty []
to load an entire dataset ([, ]
for an entire two-dimensional dataset). The data is never loaded into memory until the single subset [
access operator is called. This prevents incredibly large datasets from being loaded into memory and allows accessing only the dataset you need
# Access the upper left corner of the data matrix
lfile[["matrix"]][1:5, 1:5]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 0 0
## [2,] 0 0 0 0 0
## [3,] 0 0 0 0 0
## [4,] 0 0 0 0 0
## [5,] 0 0 0 0 0
# Access the full data matrix (here, using the $ instead of the [[ operator
# to access the matrix)
full.matrix <- lfile$matrix[, ]
dim(x = full.matrix)
## [1] 2700 13714
# Access all gene names
gene.names <- lfile[["row_attrs/gene_names"]][]
head(x = gene.names)
## [1] "AL627309.1" "AP006222.2" "RP11-206L10.2" "RP11-206L10.9"
## [5] "LINC00115" "NOC2L"
get.attribute.df
As metadata is stored in several datasets within a loom file, each loom
object has a get.attribute.df
method: a method for collecting various metadata datasets and organizing them into a data frame for ease of use. This method takes a direction to look in (either 1
or 2
for row (gene) or column (cell) metadata, respectively) and a list of metadata dataset names. See below in the “Chunk-based iteration” section for details about MARGINs in loomR.
# Pull three bits of metadata from the column attributes
attrs <- c("nUMI", "nGene", "orig.ident")
attr.df <- lfile$get.attribute.df(MARGIN = 2, attribute.names = attrs)
head(x = attr.df)
## nUMI nGene orig.ident
## AAACATACAACCAC 2419 779 SeuratProject
## AAACATTGAGCTAC 4903 1352 SeuratProject
## AAACATTGATCAGC 3147 1129 SeuratProject
## AAACCGTGCTTCCG 2639 960 SeuratProject
## AAACCGTGTATGCG 980 521 SeuratProject
## AAACGCACTGGTAC 2163 781 SeuratProject
row.attrs
slot still refers to gene metadata, and the col.attrs
still refers to the cell metadata
# Print the number of genes
lfile[["row_attrs/gene_names"]]$dims
## [1] 13714
# Is the number of genes the same as the second dimension (typically
# columns) for the matrix?
lfile[["row_attrs/gene_names"]]$dims == lfile[["matrix"]]$dims[2]
## [1] TRUE
# For the sake of consistency within the single-cell community, we've
# reversed the dimensions for the `shape` field. As such, the number of
# genes is stored in `lfile$shape[1]`; the number of cells is stored in the
# second field
lfile[["row_attrs/gene_names"]]$dims == lfile$shape[1]
## [1] TRUE
To access data from a subset of genes (or a subset of cells), see the examples below:
# Pull gene expression data for all genes, for the first 5 cells Note that
# we're using the row position for cells
data.subset <- lfile[["matrix"]][1:5, ]
dim(x = data.subset)
## [1] 5 13714
# You can transpose this matrix if you wish to restore the standard
# orientation
data.subset <- t(x = data.subset)
dim(x = data.subset)
## [1] 13714 5
# Pull gene expression data for the gene MS4A1 Note that we're using the
# column position for genes
data.gene <- lfile[["matrix"]][, lfile$row.attrs$gene_names[] == "MS4A1"]
head(x = data.gene)
## [1] 0 6 0 0 0 0
loom
objectsWe can layers, gene-level metadata (row_attrs
), and cell-level metadata (col_attrs
) to a loom object using loomR. You can read full details at the loom file specification.
Methods for adding layers and matrices are provided by the loom
class with add.layer
, add.row.attribute
, and add.col.attribute
. All of the adding methods take a named list of either matrices or vectors. For example, to ENSEMBL IDs to gene-level metadata would be done as follows:
# Generate random ENSEMBL IDs for demonstration purposes
ensembl.ids <- paste0("ENSG0000", 1:length(x = lfile$row.attrs$gene_names[]))
# Use add.row.attribute to add the IDs Note that if you want to overwrite an
# existing value, set overwrite = TRUE
lfile$add.row.attribute(list(ensembl.id = ensembl.ids), overwrite = TRUE)
lfile[["row_attrs"]]
## Class: H5Group
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Group: /row_attrs
## Listing:
## name obj_type dataset.dims dataset.type_class
## ensembl.id H5I_DATASET 13714 H5T_STRING
## gene_names H5I_DATASET 13714 H5T_STRING
# Find the ENSEMBL ID for TCL1A
lfile[["row_attrs/ensembl.id"]][lfile$row.attrs$gene_names[] == "TCL1A"]
## [1] "ENSG00009584"
loomR provides built-in chunking methods for loom
objects. For massive datasets, this allows you to load the data in small pieces, to optimize memory usage. loomR implements the map
and apply
methods (which perform calculations as the batches are read in). Under the hood, these use the batch.scan
and batch.next
methods, which do the low-level data chunking. Each method works on any dataset in the loom file, including the one-dimensional arrays in row_attrs
and col_attrs
.
The map
method performs a calculation across all genes or cells without loading all data into memory at once, though the final results are kept in memory. For example, we can calculate the total number of UMI per cell.
# Map rowSums to `matrix`, using 500 cells at a time, returning a vector
nUMI_map <- lfile$map(FUN = rowSums, MARGIN = 2, chunk.size = 500, dataset.use = "matrix",
display.progress = FALSE)
# How long is nUMI_map? It should be equal to the number of cells in
# `matrix`
length(x = nUMI_map) == lfile$matrix$dims[1]
## [1] TRUE
loom
object have a MARGIN
argument; this argument tells the loom file on which dimension to iterate over, add, or fetch data. To keep consistent with other R tools for single-cell RNAseq analysis, a MARGIN
of 1 represents the rows, or genes, while a MARGIN
of 2 represents the columns, or cells. This also applies to the shape
field of a loom
object: index 1 represents the number of genes in a loom file while index 2 represents the number of cells.
The apply
method is similar to map
, except it stores the results in the loom file rather than returning them to the user. Therefore, it uses the least amount of memory as memory is only being used for the current iteration. To save a result to disk, a name must be provided. This name must be a full UNIX-style path (group/dataset), and the result will take the shape of datasets stored in said group.
# Apply rowSums to `matrix`, using 500 cells at a time, storing in
# `col_attrs/umi_apply`
lfile$apply(name = "col_attrs/umi_apply", FUN = rowSums, MARGIN = 2, chunk.size = 500,
dataset.use = "matrix", display.progress = FALSE, overwrite = TRUE)
lfile$col.attrs$umi_apply
## Class: H5D
## Dataset: /col_attrs/umi_apply
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_IEEE_F64LE
## Space: Type=Simple Dims=2700 Maxdims=Inf
## Chunk: 1024
# Ensure that all values are the same as doing a non-chunked calculation
all(lfile$col.attrs$umi_apply[] == rowSums(x = lfile$matrix[, ]))
## [1] TRUE
map
and apply
support working with certain values instead of all values in a given dataset. For example, one could use apply
to scale the expression values of certain genes instead of all genes. This is done by passing a vector of integers to the index.use
argument of either map
or apply
where each value in this vector is the numeric index of the row or column to use in the analysis. For map
, the result will be either a vector that’s as long as index.use
or a matrix where one dimension is the length of index.use
. For apply
, as all datasets within the loom file must be a certain size, skipped indices will be filled with NAs.
loom
objectsBecause loom
objects are connections to a file, they must be closed to ensure all data is written out to disk properly. This is done using the close_all
method.
lfile$close_all()
We are in the process of adding loom file support to Seurat, a single-cell RNAseq analysis toolkit developed by the Satija Lab, using loomR as the API. Currently, loom support in Seurat is available only in the ‘loom’ branch on GitHub. We can use devtools to install this branch directly from GitHub.
devtools::install_github(repo = "satijalab/seurat", ref = "loom")
library(Seurat)
loom
object from a Seurat object (converting between Seurat and loom)Seurat offers a conversion function to go from Seurat objects to loom files. The reverse conversion is currently in progress.
# Load the pbmc_small dataset included in Seurat
data("pbmc_small")
pbmc_small
## An object of class seurat in project SeuratProject
## 230 genes across 80 samples.
# Convert from Seurat to loom Convert takes and object in 'from', a name of
# a class in 'to', and, for conversions to loom, a filename
pfile <- Convert(from = pbmc_small, to = "loom", filename = "pbmc_small.loom",
display.progress = FALSE)
pfile
## Class: loom
## Filename: /home/paul/Documents/Satija/pbmc_small.loom
## Access type: H5F_ACC_RDWR
## Attributes: version, chunks
## Listing:
## name obj_type dataset.dims dataset.type_class
## col_attrs H5I_GROUP <NA> <NA>
## col_graphs H5I_GROUP <NA> <NA>
## layers H5I_GROUP <NA> <NA>
## matrix H5I_DATASET 80 x 230 H5T_FLOAT
## row_attrs H5I_GROUP <NA> <NA>
## row_graphs H5I_GROUP <NA> <NA>
loom
Support in Seurat
loom
objects within Seurat is constanly being improved and expanded. Not only do analysis functions work on loom
objects, but several visualization and Seurat-style accessor functions have now been implemented for loom
objects. For greater detail about using Seurat to analyze data stored in a loom file, please see this vignette walking through analysis of the Microwell-seq “Mouse Cell Atlas”.
On the loom branch, we’ve modified a subset of Seurat functions to work seamlessly with standard Seurat syntax. This is the S3/S4 style of R programming rather than the R6 found in loomR. The only difference is that since loom
objects are R6 objects, results do not need to be explicitly at the end of a function call, as the loom
objects are modified directly (see below).
# Normalize data, and find variable genes using the typical Seurat workflow
pbmc_small <- NormalizeData(object = pbmc_small, display.progress = FALSE)
pbmc_small <- FindVariableGenes(object = pbmc_small, display.progress = FALSE,
do.plot = FALSE)
head(x = pbmc_small@hvg.info)
## gene.mean gene.dispersion gene.dispersion.scaled
## PCMT1 3.942220 7.751848 2.808417
## PPBP 5.555949 7.652876 1.216898
## LYAR 4.231004 7.577377 1.528749
## VDAC3 4.128322 7.383980 1.296982
## KHDRBS1 3.562833 7.367928 2.476809
## IGLL5 3.758330 7.319567 2.018088
# Run the same workflow using the loom object
NormalizeData(object = pfile, overwrite = TRUE, display.progress = FALSE)
FindVariableGenes(object = pfile, overwrite = TRUE, display.progress = FALSE)
# Normalized data goes into the 'norm_data' layer, variable gene information
# goes into 'row_attrs' Are the results equal?
par(mfrow = c(1, 2))
plot(x = t(x = pfile$layers$norm_data[, ]), y = pbmc_small@data, main = "Normalized Data",
xlab = "loom", ylab = "Seurat")
plot(x = pfile$row.attrs$gene_means[], y = pbmc_small@hvg.info[pfile$row.attrs$gene_names[],
"gene.mean"], main = "Gene Means", xlab = "loom", ylab = "Seurat")
For more information about using loom
objects with Seurat, please see this vignette for more details. As always, remember to close loom
objects when finished using them.
pfile$close_all()