Introduction to loom

As single cell datasets continue to grow in size, computational requirements are growing exponentially. We’ve noticed that, even when using sparse matrices, Seurat analysis can be challenging for datasets >100,000 cells, primarily due to difficulties in storing the full dataset in memory. Instead of storing data in memory, the HDF5 data format offers efficient, on-disk storage, that is scalable to massive datasets even >1M cells.

The Linnarson lab has developed an HDF5-based data structure, loom, to easily store single cell genomics datasets and metadata. They also released a Python API, called loompy, (full details can be found here) to interact with loom files.

To complement loompy, we are introducing loomR: an R implementation of the loom API. Though still in development, loomR provides a way to access and interact with loom files from R. This tutorial will walk through how to install loomR, interact with loom objects, take advantage of the chunking mechanisms built into loomR. Finally, we introduce initial steps in the Seurat workflow that enable direct compatibility with loom files, towards the goal of making Seurat fully HDF5-compatible in the near future.

loomR

loomR is freely available on GitHub and currently in development. A CRAN release is forthcoming as well.

Installation

The pre-requisites for loomR include the C++ HDF5 library (installation details can be found here), and the devtools package for GitHub installation.

Full documentation on the loom class’s fields and methods are found in the help page for the loom class (see ?loomR::loom)

# Install devtools from CRAN
install.packages("devtools")
# Use devtools to install hdf5r and loomR from GitHub
devtools::install_github(repo = "hhoeflin/hdf5r")
devtools::install_github(repo = "mojaveazure/loomR", ref = "develop")
# Load loomR
library(loomR)

Introduction to loom objects

loomR implements the loom API through the R6-based loom class. Interacting with R6 objects is slightly different than S4 objects (like Seurat objects). Briefly, R6 fields are accessed using the $ sigil instead of the @ operator (i.e. my.object$field.name), and R6 methods are called directly from (and modify the object) (i.e. my.object$method()). We provide examples below, and also direct interested users to this page for a more complete introduction to R6 classes.

Connecting to loom objects

Unlike standard R objects that load all data contained within them into memory, loom objects are merely connections to a file on disk, which enables scaling to massive datasets with low memory consumption. You can connect to an existing loom file (example here), create your own from an expression matrix using loomR::create, or create a loom file from an existing Seurat object using Convert (covered later in the tutorial).

# Connect to the loom file in read/write mode
lfile <- connect(filename = "pbmc.loom", mode = "r+")
lfile
## Class: loom
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Attributes: version, chunks
## Listing:
##        name    obj_type dataset.dims dataset.type_class
##   col_attrs   H5I_GROUP         <NA>               <NA>
##  col_graphs   H5I_GROUP         <NA>               <NA>
##      layers   H5I_GROUP         <NA>               <NA>
##      matrix H5I_DATASET 2700 x 13714          H5T_FLOAT
##   row_attrs   H5I_GROUP         <NA>               <NA>
##  row_graphs   H5I_GROUP         <NA>               <NA>

A loom object is a container for six sub-objects: one dataset (matrix) and five groups (layers, row_attrs, col_attrs, row_graphs, and col_graphs); you can read full details at the loom file specification.

Loom file structure
The loom file is simply an HDF5 file with a strict structure imposed on it. This structure helps keep consistency in an otherwise unordered binary file and provides security in the knowledge of which data is which. Below is a summary of the file structure and rules imposed on each dataset; for more details, please read the loom file specification.
matrix
The root of a loom file’s structure, it has two dimensions of \(n\) genes and \(m\) cells
layers
Alternative representations of the data in matrix, must have the same dimensions as matrix
row_attrs and col_attrs
Metadata for rows (genes) and columns (cells), respectively; each dataset in these groups must be one- or two-dimensional and the first dimension must be \(n\) for row_attrs or \(m\) forcol_attrs
row_graphs and col_graphs
Sparse cluster graphs in coordinate form; each graph is a group with three equal-length datasets: a (row index), b (column index), and w (value)

Interacting with loom objects

Interacting with loom objects is done in two ways: using the double subset [[ operator or the S3 $ sigil.

# Viewing the `matrix` dataset with the double subset [[ operator You can
# also use the $ sigil, i.e. lfile$matrix
lfile[["matrix"]]
## Class: H5D
## Dataset: /matrix
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_IEEE_F64LE
## Space: Type=Simple     Dims=2700 x 13714     Maxdims=Inf x Inf
## Chunk: 32 x 32

To view datasets within groups, we can either use the double subset [[ or $ operators.

# Viewing a dataset in the 'col_attrs' group with the double subset [[
# operator and full UNIX-style path
lfile[["col_attrs/cell_names"]]
## Class: H5D
## Dataset: /col_attrs/cell_names
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_STRING {
##       STRSIZE H5T_VARIABLE;
##       STRPAD H5T_STR_NULLTERM;
##       CSET H5T_CSET_ASCII;
##       CTYPE H5T_C_S1;
##    }
## Space: Type=Simple     Dims=2700     Maxdims=Inf
## Chunk: 1024
# Viewing a dataset in the 'row_attrs' group with S3 $ chaining
lfile$row.attrs$gene_names
## Class: H5D
## Dataset: /row_attrs/gene_names
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_STRING {
##       STRSIZE H5T_VARIABLE;
##       STRPAD H5T_STR_NULLTERM;
##       CSET H5T_CSET_ASCII;
##       CTYPE H5T_C_S1;
##    }
## Space: Type=Simple     Dims=13714     Maxdims=Inf
## Chunk: 1024

Note that the above commands simply return a description of the data stored on disk. To access the data itself, we have to use a single subset [ operator on the description.

Accessing data requires integer indices, or empty [] to load an entire dataset ([, ] for an entire two-dimensional dataset). The data is never loaded into memory until the single subset [ access operator is called. This prevents incredibly large datasets from being loaded into memory and allows accessing only the dataset you need

# Access the upper left corner of the data matrix
lfile[["matrix"]][1:5, 1:5]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    0
## [2,]    0    0    0    0    0
## [3,]    0    0    0    0    0
## [4,]    0    0    0    0    0
## [5,]    0    0    0    0    0
# Access the full data matrix (here, using the $ instead of the [[ operator
# to access the matrix)
full.matrix <- lfile$matrix[, ]
dim(x = full.matrix)
## [1]  2700 13714
# Access all gene names
gene.names <- lfile[["row_attrs/gene_names"]][]
head(x = gene.names)
## [1] "AL627309.1"    "AP006222.2"    "RP11-206L10.2" "RP11-206L10.9"
## [5] "LINC00115"     "NOC2L"
get.attribute.df

As metadata is stored in several datasets within a loom file, each loom object has a get.attribute.df method: a method for collecting various metadata datasets and organizing them into a data frame for ease of use. This method takes a direction to look in (either 1 or 2 for row (gene) or column (cell) metadata, respectively) and a list of metadata dataset names. See below in the “Chunk-based iteration” section for details about MARGINs in loomR.

# Pull three bits of metadata from the column attributes
attrs <- c("nUMI", "nGene", "orig.ident")
attr.df <- lfile$get.attribute.df(MARGIN = 2, attribute.names = attrs)
head(x = attr.df)
##                nUMI nGene    orig.ident
## AAACATACAACCAC 2419   779 SeuratProject
## AAACATTGAGCTAC 4903  1352 SeuratProject
## AAACATTGATCAGC 3147  1129 SeuratProject
## AAACCGTGCTTCCG 2639   960 SeuratProject
## AAACCGTGTATGCG  980   521 SeuratProject
## AAACGCACTGGTAC 2163   781 SeuratProject

Matrices in loomR

Matrix Transposition in loomR
In order to maintain maximum efficiency and I/O speed, the HDF5 library transposes the access for the underlying data matrix. This means that genes are stored in columns, and cells are stored in rows. The boost in speed is well worth it, and we’ve taken precaution in developing loomR so that all methods take this into account, and properly transpose data for users. However, when interacting with the data matrices directly in loomR, please be sure to keep this in mind. Note also that, in order to maintain variable compatibility with loomR, the row.attrs slot still refers to gene metadata, and the col.attrs still refers to the cell metadata
# Print the number of genes
lfile[["row_attrs/gene_names"]]$dims
## [1] 13714
# Is the number of genes the same as the second dimension (typically
# columns) for the matrix?
lfile[["row_attrs/gene_names"]]$dims == lfile[["matrix"]]$dims[2]
## [1] TRUE
# For the sake of consistency within the single-cell community, we've
# reversed the dimensions for the `shape` field.  As such, the number of
# genes is stored in `lfile$shape[1]`; the number of cells is stored in the
# second field
lfile[["row_attrs/gene_names"]]$dims == lfile$shape[1]
## [1] TRUE

To access data from a subset of genes (or a subset of cells), see the examples below:

# Pull gene expression data for all genes, for the first 5 cells Note that
# we're using the row position for cells
data.subset <- lfile[["matrix"]][1:5, ]
dim(x = data.subset)
## [1]     5 13714
# You can transpose this matrix if you wish to restore the standard
# orientation
data.subset <- t(x = data.subset)
dim(x = data.subset)
## [1] 13714     5
# Pull gene expression data for the gene MS4A1 Note that we're using the
# column position for genes
data.gene <- lfile[["matrix"]][, lfile$row.attrs$gene_names[] == "MS4A1"]
head(x = data.gene)
## [1] 0 6 0 0 0 0

Adding data to loom objects

We can layers, gene-level metadata (row_attrs), and cell-level metadata (col_attrs) to a loom object using loomR. You can read full details at the loom file specification.

Methods for adding layers and matrices are provided by the loom class with add.layer, add.row.attribute, and add.col.attribute. All of the adding methods take a named list of either matrices or vectors. For example, to ENSEMBL IDs to gene-level metadata would be done as follows:

# Generate random ENSEMBL IDs for demonstration purposes
ensembl.ids <- paste0("ENSG0000", 1:length(x = lfile$row.attrs$gene_names[]))
# Use add.row.attribute to add the IDs Note that if you want to overwrite an
# existing value, set overwrite = TRUE
lfile$add.row.attribute(list(ensembl.id = ensembl.ids), overwrite = TRUE)
lfile[["row_attrs"]]
## Class: H5Group
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Group: /row_attrs
## Listing:
##        name    obj_type dataset.dims dataset.type_class
##  ensembl.id H5I_DATASET        13714         H5T_STRING
##  gene_names H5I_DATASET        13714         H5T_STRING
# Find the ENSEMBL ID for TCL1A
lfile[["row_attrs/ensembl.id"]][lfile$row.attrs$gene_names[] == "TCL1A"]
## [1] "ENSG00009584"

Chunk-based iteration

loomR provides built-in chunking methods for loom objects. For massive datasets, this allows you to load the data in small pieces, to optimize memory usage. loomR implements the map and apply methods (which perform calculations as the batches are read in). Under the hood, these use the batch.scan and batch.next methods, which do the low-level data chunking. Each method works on any dataset in the loom file, including the one-dimensional arrays in row_attrs and col_attrs.

The map method performs a calculation across all genes or cells without loading all data into memory at once, though the final results are kept in memory. For example, we can calculate the total number of UMI per cell.

# Map rowSums to `matrix`, using 500 cells at a time, returning a vector
nUMI_map <- lfile$map(FUN = rowSums, MARGIN = 2, chunk.size = 500, dataset.use = "matrix", 
    display.progress = FALSE)
# How long is nUMI_map? It should be equal to the number of cells in
# `matrix`
length(x = nUMI_map) == lfile$matrix$dims[1]
## [1] TRUE
MARGINs
Several methods for a loom object have a MARGIN argument; this argument tells the loom file on which dimension to iterate over, add, or fetch data. To keep consistent with other R tools for single-cell RNAseq analysis, a MARGIN of 1 represents the rows, or genes, while a MARGIN of 2 represents the columns, or cells. This also applies to the shape field of a loom object: index 1 represents the number of genes in a loom file while index 2 represents the number of cells.

The apply method is similar to map, except it stores the results in the loom file rather than returning them to the user. Therefore, it uses the least amount of memory as memory is only being used for the current iteration. To save a result to disk, a name must be provided. This name must be a full UNIX-style path (group/dataset), and the result will take the shape of datasets stored in said group.

# Apply rowSums to `matrix`, using 500 cells at a time, storing in
# `col_attrs/umi_apply`
lfile$apply(name = "col_attrs/umi_apply", FUN = rowSums, MARGIN = 2, chunk.size = 500, 
    dataset.use = "matrix", display.progress = FALSE, overwrite = TRUE)
lfile$col.attrs$umi_apply
## Class: H5D
## Dataset: /col_attrs/umi_apply
## Filename: /home/paul/Documents/Satija/pbmc.loom
## Access type: H5F_ACC_RDWR
## Datatype: H5T_IEEE_F64LE
## Space: Type=Simple     Dims=2700     Maxdims=Inf
## Chunk: 1024
# Ensure that all values are the same as doing a non-chunked calculation
all(lfile$col.attrs$umi_apply[] == rowSums(x = lfile$matrix[, ]))
## [1] TRUE
Selective-chunking
Both map and apply support working with certain values instead of all values in a given dataset. For example, one could use apply to scale the expression values of certain genes instead of all genes. This is done by passing a vector of integers to the index.use argument of either map or apply where each value in this vector is the numeric index of the row or column to use in the analysis. For map, the result will be either a vector that’s as long as index.use or a matrix where one dimension is the length of index.use. For apply, as all datasets within the loom file must be a certain size, skipped indices will be filled with NAs.

Closing loom objects

Because loom objects are connections to a file, they must be closed to ensure all data is written out to disk properly. This is done using the close_all method.

lfile$close_all()

loomR and Seurat

We are in the process of adding loom file support to Seurat, a single-cell RNAseq analysis toolkit developed by the Satija Lab, using loomR as the API. Currently, loom support in Seurat is available only in the ‘loom’ branch on GitHub. We can use devtools to install this branch directly from GitHub.

devtools::install_github(repo = "satijalab/seurat", ref = "loom")
library(Seurat)

Creating a loom object from a Seurat object (converting between Seurat and loom)

Seurat offers a conversion function to go from Seurat objects to loom files. The reverse conversion is currently in progress.

# Load the pbmc_small dataset included in Seurat
data("pbmc_small")
pbmc_small
## An object of class seurat in project SeuratProject 
##  230 genes across 80 samples.
# Convert from Seurat to loom Convert takes and object in 'from', a name of
# a class in 'to', and, for conversions to loom, a filename
pfile <- Convert(from = pbmc_small, to = "loom", filename = "pbmc_small.loom", 
    display.progress = FALSE)
pfile
## Class: loom
## Filename: /home/paul/Documents/Satija/pbmc_small.loom
## Access type: H5F_ACC_RDWR
## Attributes: version, chunks
## Listing:
##        name    obj_type dataset.dims dataset.type_class
##   col_attrs   H5I_GROUP         <NA>               <NA>
##  col_graphs   H5I_GROUP         <NA>               <NA>
##      layers   H5I_GROUP         <NA>               <NA>
##      matrix H5I_DATASET     80 x 230          H5T_FLOAT
##   row_attrs   H5I_GROUP         <NA>               <NA>
##  row_graphs   H5I_GROUP         <NA>               <NA>

Seurat standard workflow

loom Support in Seurat
Support for loom objects within Seurat is constanly being improved and expanded. Not only do analysis functions work on loom objects, but several visualization and Seurat-style accessor functions have now been implemented for loom objects. For greater detail about using Seurat to analyze data stored in a loom file, please see this vignette walking through analysis of the Microwell-seq “Mouse Cell Atlas”.

On the loom branch, we’ve modified a subset of Seurat functions to work seamlessly with standard Seurat syntax. This is the S3/S4 style of R programming rather than the R6 found in loomR. The only difference is that since loom objects are R6 objects, results do not need to be explicitly at the end of a function call, as the loom objects are modified directly (see below).

# Normalize data, and find variable genes using the typical Seurat workflow
pbmc_small <- NormalizeData(object = pbmc_small, display.progress = FALSE)
pbmc_small <- FindVariableGenes(object = pbmc_small, display.progress = FALSE, 
    do.plot = FALSE)
head(x = pbmc_small@hvg.info)
##         gene.mean gene.dispersion gene.dispersion.scaled
## PCMT1    3.942220        7.751848               2.808417
## PPBP     5.555949        7.652876               1.216898
## LYAR     4.231004        7.577377               1.528749
## VDAC3    4.128322        7.383980               1.296982
## KHDRBS1  3.562833        7.367928               2.476809
## IGLL5    3.758330        7.319567               2.018088
# Run the same workflow using the loom object
NormalizeData(object = pfile, overwrite = TRUE, display.progress = FALSE)
FindVariableGenes(object = pfile, overwrite = TRUE, display.progress = FALSE)
# Normalized data goes into the 'norm_data' layer, variable gene information
# goes into 'row_attrs' Are the results equal?
par(mfrow = c(1, 2))
plot(x = t(x = pfile$layers$norm_data[, ]), y = pbmc_small@data, main = "Normalized Data", 
    xlab = "loom", ylab = "Seurat")
plot(x = pfile$row.attrs$gene_means[], y = pbmc_small@hvg.info[pfile$row.attrs$gene_names[], 
    "gene.mean"], main = "Gene Means", xlab = "loom", ylab = "Seurat")

For more information about using loom objects with Seurat, please see this vignette for more details. As always, remember to close loom objects when finished using them.

pfile$close_all()

Extra Information
  • For more information about loom files, please see the documentation here
  • To track the development of loomR, follow its GitHub repository
  • Bug reports for loomR can be filed here
  • A full function and method reference for loomR can be found here
  • For details about hdf5r, the underlying HDF5 library used by loomR, please see its GitHub page
  • Information about loomR’s matrix transposing can be found here and here
  • To learn more about R6, a series of vignettes are available here
  • Details about loom support in Seurat can be found on the loom branch of Seurat, hosted on GitHub
  • Bug reports for Seurat can be filed on its GitHub repository