Intro: Seurat v4 Reference Mapping

This vignette introduces the process of mapping query datasets to annotated references in Seurat. In this example, we map one of the first scRNA-seq datasets released by 10X Genomics of 2,700 PBMC to our recently described CITE-seq reference of 162,000 PBMC measured with 228 antibodies. We chose this example to demonstrate how supervised analysis guided by a reference dataset can help to enumerate cell states that would be challenging to find with unsupervised analysis. In a second example, we demonstrate how to serially map Human Cell Atlas datasets of human BMNC profiled from different individuals onto a consistent reference.

We have previously demonstrated how to use reference-mapping approach to annotate cell labels in a query dataset . In Seurat v4, we have substantially improved the speed and memory requirements for integrative tasks including reference mapping, and also include new functionality to project query cells onto a previously computed UMAP visualization.

In this vignette, we demonstrate how to use a previously established reference to interpret an scRNA-seq query:

  • Annotate each query cell based on a set of reference-defined cell states
  • Project each query cell onto a previously computed UMAP visualization
  • Impute the predicted levels of surface proteins that were measured in the CITE-seq reference

To run this vignette please install Seurat v4, available on CRAN. Additionally, you will need to install the SeuratDisk package.


Example 1: Mapping human peripheral blood cells

A Multimodal PBMC Reference Dataset

We load the reference (download here) from our recent preprint, and visualize the pre-computed UMAP. This reference is stored as an h5Seurat file, a format that enables on-disk storage of multimodal Seurat objects (more details on h5Seurat and SeuratDisk can be found here).

reference <- LoadH5Seurat("../data/pbmc_multimodal.h5seurat")
DimPlot(object = reference, reduction = "wnn.umap", = "celltype.l2", label = TRUE, label.size = 3, repel = TRUE) + NoLegend()


To demonstrate mapping to this multimodal reference, we will use a dataset of 2,700 PBMCs generated by 10x Genomics and available via SeuratData.

The reference was normalized using SCTransform(), so we use the same approach to normalize the query here.

pbmc3k <- SCTransform(pbmc3k, verbose = FALSE)

We then find anchors between reference and query. As described in the manuscript, we used a precomputed supervised PCA (spca) transformation for this example. We recommend the use of supervised PCA for CITE-seq datasets, and demonstrate how to compute this transformation on the next tab of this vignette. However, you can also use a standard PCA transformation.

anchors <- FindTransferAnchors(
  reference = reference,
  query = pbmc3k,
  normalization.method = "SCT",
  reference.reduction = "spca",
  dims = 1:50

We then transfer cell type labels and protein data from the reference to the query. Additionally, we project the query data onto the UMAP structure of the reference.

pbmc3k <- MapQuery(
  anchorset = anchors,
  query = pbmc3k,
  reference = reference,
  refdata = list(
    celltype.l1 = "celltype.l1",
    celltype.l2 = "celltype.l2",
    predicted_ADT = "ADT"
  reference.reduction = "spca", 
  reduction.model = "wnn.umap"

What is MapQuery doing?

MapQuery() is a wrapper around three functions: TransferData(), IntegrateEmbeddings(), and ProjectUMAP(). TransferData() is used to transfer cell type labels and impute the ADT values. IntegrateEmbeddings() and ProjectUMAP() are used to project the query data onto the UMAP structure of the reference. The equivalent code for doing this with the intermediate functions is below:

pbmc3k <- TransferData(
  anchorset = anchors, 
  reference = reference,
  query = pbmc3k,
  refdata = list(
    celltype.l1 = "celltype.l1",
    celltype.l2 = "celltype.l2",
    predicted_ADT = "ADT")
pbmc3k <- IntegrateEmbeddings(
  anchorset = anchors,
  reference = reference,
  query = pbmc3k, = "ref.spca"
pbmc3k <- ProjectUMAP(
  query = pbmc3k, 
  query.reduction = "ref.spca", 
  reference = reference, 
  reference.reduction = "spca", 
  reduction.model = "wnn.umap"

Explore the mapping results

We can now visualize the 2,700 query cells. They have been projected into a UMAP visualization defined by the reference, and each has received annotations at two levels of granularity (level 1, and level 2).

p1 = DimPlot(pbmc3k, reduction = "ref.umap", = "predicted.celltype.l1", label = TRUE, label.size = 3, repel = TRUE) + NoLegend()
p2 = DimPlot(pbmc3k, reduction = "ref.umap", = "predicted.celltype.l2", label = TRUE, label.size = 3 ,repel = TRUE) + NoLegend()
p1 + p2

The reference-mapped dataset helps us identify cell types that were previously blended in an unsupervised analysis of the query dataset. Just a few examples include plasmacytoid dendritic cells (pDC), hematopoietic stem and progenitor cells (HSPC), regulatory T cells (Treg), CD8 Naive T cells, cells, CD56+ NK cells, memory, and naive B cells, and plasmablasts.

Each prediction is assigned a score between 0 and 1.

FeaturePlot(pbmc3k, features = c("pDC", "CD16 Mono", "Treg"),  reduction = "ref.umap", cols = c("lightgrey", "darkred"), ncol = 3) & theme(plot.title = element_text(size = 10))

We can verify our predictions by exploring the expression of canonical marker genes. For example, CLEC4C and LIRA4 have been reported as markers of pDC identity, consistent with our predictions. Similarly, if we perform differential expression to identify markers of Tregs, we identify a set of canonical markers including RTKN2, CTLA4, FOXP3, and IL2RA.

Idents(pbmc3k) <- 'predicted.celltype.l2'
VlnPlot(pbmc3k, features = c("CLEC4C", "LILRA4"), sort = TRUE) + NoLegend()

treg_markers <- FindMarkers(pbmc3k, ident.1 = "Treg", only.pos = TRUE, logfc.threshold = 0.1)
##                       p_val avg_log2FC pct.1 pct.2    p_val_adj
## FOXP3          8.428389e-41  0.2002914 0.152 0.002 1.059617e-36
## RP11-1399P15.1 1.702575e-21  0.4749271 0.273 0.021 2.140478e-17
## RTKN2          3.890305e-17  0.2339929 0.121 0.005 4.890892e-13
## TIGIT          3.773433e-16  0.5934260 0.424 0.065 4.743960e-12
## F5             4.035005e-16  0.2318413 0.121 0.005 5.072808e-12
## FRS2           2.983908e-15  0.1906088 0.152 0.009 3.751369e-11

Finally, we can visualize the imputed levels of surface protein, which were inferred based on the CITE-seq reference.

DefaultAssay(pbmc3k) <- 'predicted_ADT'
# see a list of proteins: rownames(pbmc3k)
FeaturePlot(pbmc3k, features = c("CD3-1", "CD45RA", "IgD"), reduction = "ref.umap", cols = c("lightgrey", "darkgreen"), ncol = 3)

Computing a new UMAP visualiztion

In the previous examples, we visualize the query cells after mapping to the reference-derived UMAP. Keeping a consistent visualization can assist with the interpretation of new datasets. However, if there are cell states that are present in the query dataset that are not represented in the reference, they will project to the most similar cell in the reference. This is the expected behavior and functionality as established by the UMAP package, but can potentially mask the presence of new cell types in the query which may be of interest.

In our manuscript, we map a query dataset containing developing and differentiated neutrophils, which are not included in our reference. We find that computing a new UMAP (‘de novo visualization’) after merging the reference and query can help to identify these populations, as demonstrated in Supplementary Figure 8. In the ‘de novo’ visualization, unique cell states in the query remain separated. In this example, the 2,700 PBMC does not contain unique cell states, but we demonstrate how to compute this visualization below.

We emphasize that if users are attempting to map datasets where the underlying samples are not PBMC, or contain cell types that are not present in the reference, computing a ‘de novo’ visualization is an important step in interpreting their dataset.

#merge reference and query
reference$id <- 'reference'
pbmc3k$id <- 'query'
refquery <- merge(reference, pbmc3k)
refquery[["spca"]] <- merge(reference[["spca"]], pbmc3k[["ref.spca"]])
refquery <- RunUMAP(refquery, reduction = 'spca', dims = 1:50)
DimPlot(refquery, = 'id', shuffle = TRUE)

Example 2: Mapping human bone marrow cells

A Multimodal BMNC Reference Dataset

As a second example, we map a dataset of human bone marrow mononuclear (BMNC) cells from eight individual donors, produced by the Human Cell Atlas. As a reference, we use the CITE-seq reference of human BMNC that we analyzed using weighted-nearest neighbor analysis (WNN).

This vignette exhibits the same reference-mapping functionality as the PBMC example on the previous tab. In addition, we also demonstrate:

  • How to construct a supervised PCA (sPCA) transformation
  • How to serially map multiple datasets to the same reference
  • Optimization steps to further enhance to speed of mapping
# Both datasets are available through SeuratData
#load reference data
bm <- LoadData(ds = "bmcite")
#load query data
hcabm40k <- LoadData(ds = "hcabm40k")

The reference dataset contains a WNN graph, reflecting a weighted combination of the RNA and protein data in this CITE-seq experiment.

We can compute a UMAP visualization based on this graph. We set return.model = TRUE, which will enable us to project query datasets onto this visualization.

bm <- RunUMAP(bm, = "weighted.nn", = "wnn.umap", 
              reduction.key = "wnnUMAP_", return.model = TRUE)
DimPlot(bm, = "celltype.l2", reduction = "wnn.umap") 

Computing an sPCA transformation

As described in our manuscript, we first compute a ‘supervised’ PCA. This identifies the transformation of the transcriptome data that best encapsulates the structure of the WNN graph. This allows a weighted combination of the protein and RNA measurements to ‘supervise’ the PCA, and highlight the most relevant sources of variation. After computing this transformation, we can project it onto a query dataset. We can also compute and project a PCA projection, but recommend the use of sPCA when working with multimodal references that have been constructed with WNN analysis.

The sPCA calculation is performed once, and then can be rapidly projected onto each query dataset.

bm <- ScaleData(bm, assay = 'RNA')
bm <- RunSPCA(bm, assay = 'RNA', graph = 'wsnn')

Computing a cached neighbor index

Since we will be mapping multiple query samples to the same reference, we can cache particular steps that only involve the reference. This step is optional but will improve speed when mapping multiple samples.

We compute the first 50 neighbors in the sPCA space of the reference. We store this information in the spca.annoy.neighbors object within the reference Seurat object and also cache the annoy index data structure (via cache.index = TRUE).

bm <- FindNeighbors(
  object = bm,
  reduction = "spca",
  dims = 1:50, = "spca.annoy.neighbors", 
  k.param = 50,
  cache.index = TRUE,
  return.neighbor = TRUE,
  l2.norm = TRUE

How can I save and load a cached annoy index?

If you want to save and load a cached index for a Neighbor object generated with method = "annoy" and cache.index = TRUE, use the SaveAnnoyIndex()/LoadAnnoyIndex() functions. Importantly, this index cannot be saved normally to an RDS or RDA file, so it will not persist correctly across R session restarts or saveRDS/readRDS for the Seurat object containing it. Instead, use LoadAnnoyIndex() to add the Annoy index to the Neighbor object every time R restarts or you load the reference Seurat object from RDS. The file created by SaveAnnoyIndex() can be distributed along with a reference Seurat object, and added to the Neighbor object in the reference.

## A Neighbor object containing the 50 nearest neighbors for 30672 cells
SaveAnnoyIndex(object = bm[["spca.annoy.neighbors"]], file = "../data/reftmp.idx")
bm[["spca.annoy.neighbors"]] <- LoadAnnoyIndex(object = bm[["spca.annoy.neighbors"]], file = "../data/reftmp.idx")

Query dataset preprocessing

Here we will demonstrate mapping multiple donor bone marrow samples to the multimodal bone marrow reference. These query datasets are derived from the Human Cell Atlas (HCA) Immune Cell Atlas Bone marrow dataset and are available through SeuratData. This dataset is provided as a single merged object with 8 donors. We first split the data back into 8 separate Seurat objects, one for each original donor to map individually.

hcabm40k.batches <- SplitObject(hcabm40k, = "orig.ident")

We then normalize the query in the same manner as the reference. Here, the reference was normalized using log-normalization via NormalizeData(). If the reference had been normalized using SCTransform(), the query must be normalized with SCTransform() as well.

hcabm40k.batches <- lapply(X = hcabm40k.batches, FUN = NormalizeData, verbose = FALSE)


We then find anchors between each donor query dataset and the multimodal reference. This command is optimized to minimize mapping time, by passing in a pre-computed set of reference neighbors, and turning off anchor filtration.

anchors <- list()
for (i in 1:length(hcabm40k.batches)) {
  anchors[[i]] <- FindTransferAnchors(
    reference = bm,
    query = hcabm40k.batches[[i]],
    k.filter = NA,
    reference.reduction = "spca", 
    reference.neighbors = "spca.annoy.neighbors", 
    dims = 1:50

We then individually map each of the datasets.

for (i in 1:length(hcabm40k.batches)) {
  hcabm40k.batches[[i]] <- MapQuery(
    anchorset = anchors[[i]], 
    query = hcabm40k.batches[[i]],
    reference = bm, 
    refdata = list(
      celltype = "celltype.l2", 
      predicted_ADT = "ADT"),
    reference.reduction = "spca",
    reduction.model = "wnn.umap"

Explore the mapping results

Now that mapping is complete, we can visualize the results for individual objects

p1 <- DimPlot(hcabm40k.batches[[1]], reduction = 'ref.umap', = 'predicted.celltype', label.size = 3)
p2 <- DimPlot(hcabm40k.batches[[2]], reduction = 'ref.umap', = 'predicted.celltype', label.size = 3)
p1 + p2 + plot_layout(guides = "collect")

We can also merge all the objects into one dataset. Note that they have all been integrated into a common space, defined by the reference. We can then visualize the results together.

# Merge the batches 
hcabm40k <- merge(hcabm40k.batches[[1]], hcabm40k.batches[2:length(hcabm40k.batches)], merge.dr = "ref.umap")
DimPlot(hcabm40k, reduction = "ref.umap", =  "predicted.celltype", label = TRUE, repel = TRUE, label.size = 3) + NoLegend()

We can visualize gene expression, cluster prediction scores, and (imputed) surface protein levels in the query cells:

p3 <- FeaturePlot(hcabm40k, features = c("rna_TRDC", "rna_MPO", "rna_AVP"), reduction = 'ref.umap', 
                  max.cutoff = 3, ncol = 3)

# cell type prediction scores
DefaultAssay(hcabm40k) <- 'prediction.score.celltype'
p4 <- FeaturePlot(hcabm40k, features = c("CD16 Mono", "HSC", "Prog-RBC"), ncol = 3, 
                  cols = c("lightgrey", "darkred"))

# imputed protein levels
DefaultAssay(hcabm40k) <- 'predicted_ADT'
p5 <- FeaturePlot(hcabm40k, features = c("CD45RA", "CD16", "CD161"), reduction = 'ref.umap',
                  min.cutoff = 'q10', max.cutoff = 'q99', cols = c("lightgrey", "darkgreen") ,
                  ncol = 3)
p3 / p4 / p5

Session Info

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04 LTS
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
##  [1] dplyr_1.0.4                        thp1.eccite.SeuratData_3.1.5      
##  [3] stxBrain.SeuratData_0.1.1          ssHippo.SeuratData_3.1.4          
##  [5] pbmcsca.SeuratData_3.0.0           pbmcMultiome.SeuratData_0.0.0.9000
##  [7] pbmc3k.SeuratData_3.1.4            panc8.SeuratData_3.0.2            
##  [9] ifnb.SeuratData_3.1.0              hcabm40k.SeuratData_3.0.0         
## [11] bmcite.SeuratData_0.3.0            SeuratData_0.2.1                  
## [13] patchwork_1.1.1                    ggplot2_3.3.3                     
## [15] SeuratDisk_0.0.0.9018              SeuratObject_4.0.0                
## [17] Seurat_4.0.0                      
## loaded via a namespace (and not attached):
##   [1] systemfonts_1.0.0    plyr_1.8.6           igraph_1.2.6        
##   [4] lazyeval_0.2.2       splines_4.0.3        listenv_0.8.0       
##   [7] scattermore_0.7      digest_0.6.27        htmltools_0.5.1.1   
##  [10] magrittr_2.0.1       memoise_2.0.0        tensor_1.5          
##  [13] cluster_2.1.0        ROCR_1.0-11          limma_3.46.0        
##  [16] globals_0.14.0       matrixStats_0.58.0   pkgdown_1.6.1       
##  [19] colorspace_2.0-0     rappdirs_0.3.3       ggrepel_0.9.1       
##  [22] textshaping_0.2.1    xfun_0.20            crayon_1.4.0        
##  [25] jsonlite_1.7.2       spatstat_1.64-1      spatstat.data_1.7-0 
##  [28] survival_3.2-7       zoo_1.8-8            glue_1.4.2          
##  [31] polyclip_1.10-0      gtable_0.3.0         leiden_0.3.7        
##  [34] future.apply_1.7.0   abind_1.4-5          scales_1.1.1        
##  [37] DBI_1.1.1            miniUI_0.1.1.1       Rcpp_1.0.6          
##  [40] viridisLite_0.3.0    xtable_1.8-4         reticulate_1.18     
##  [43] bit_4.0.4            htmlwidgets_1.5.3    httr_1.4.2          
##  [46] RColorBrewer_1.1-2   ellipsis_0.3.1       ica_1.0-2           
##  [49] pkgconfig_2.0.3      farver_2.0.3         uwot_0.1.10         
##  [52] deldir_0.2-9         tidyselect_1.1.0     labeling_0.4.2      
##  [55] rlang_0.4.10         reshape2_1.4.4       later_1.1.0.1       
##  [58] munsell_0.5.0        tools_4.0.3          cachem_1.0.3        
##  [61] cli_2.3.0            generics_0.1.0       ggridges_0.5.3      
##  [64] evaluate_0.14        stringr_1.4.0        fastmap_1.1.0       
##  [67] yaml_2.2.1           ragg_0.4.1           goftest_1.2-2       
##  [70] knitr_1.31           bit64_4.0.5          fs_1.5.0            
##  [73] fitdistrplus_1.1-3   purrr_0.3.4          RANN_2.6.1          
##  [76] pbapply_1.4-3        future_1.21.0        nlme_3.1-149        
##  [79] mime_0.9             rstudioapi_0.13      hdf5r_1.3.3         
##  [82] compiler_4.0.3       plotly_4.9.3         png_0.1-7           
##  [85] spatstat.utils_2.0-0 tibble_3.0.6         stringi_1.5.3       
##  [88] ps_1.5.0             highr_0.8            RSpectra_0.16-0     
##  [91] desc_1.2.0           lattice_0.20-41      Matrix_1.2-18       
##  [94] vctrs_0.3.6          pillar_1.4.7         lifecycle_0.2.0     
##  [97] lmtest_0.9-38        RcppAnnoy_0.0.18     data.table_1.13.6   
## [100] cowplot_1.1.1        irlba_2.3.3          httpuv_1.5.5        
## [103] R6_2.5.0             promises_1.1.1       KernSmooth_2.23-17  
## [106] gridExtra_2.3        parallelly_1.23.0    codetools_0.2-16    
## [109] MASS_7.3-53          assertthat_0.2.1     rprojroot_2.0.2     
## [112] withr_2.4.1          sctransform_0.3.2    mgcv_1.8-33         
## [115] parallel_4.0.3       grid_4.0.3           rpart_4.1-15        
## [118] tidyr_1.1.2          rmarkdown_2.6        Rtsne_0.15          
## [121] shiny_1.6.0