vignettes/integration_mapping.Rmd
integration_mapping.RmdIn this vignette, we first build an integrated reference and then demonstrate how to leverage this reference to annotate new query datasets. Generating an integrated reference follows the same workflow described in more detail in the integration introduction vignette. Once generated, this reference can be used to analyze additional query datasets through tasks like cell type label transfer and projecting query cells onto reference UMAPs. Notably, this does not require correction of the underlying raw query data and can therefore be an efficient strategy if a high quality reference is available.
For the purposes of this example, we’ve chosen human pancreatic islet
cell datasets produced across four technologies, CelSeq (GSE81076)
CelSeq2 (GSE85241), Fluidigm C1 (GSE86469), and SMART-Seq2
(E-MTAB-5061). For convenience, we distribute this dataset through our
SeuratData
package. The metadata contains the technology (tech column)
and cell type annotations (celltype column) for each cell
in the four datasets.
InstallData("panc8")As a demonstration, we will use a subset of technologies to construct a reference. We will then map the remaining datasets onto this reference. We start by selecting cells from two technologies, and performing an analysis without integration.
##
## celseq celseq2 fluidigmc1 indrop smartseq2
## 1004 2285 638 8569 2394
# we will use data from 2 technologies for the reference
pancreas.ref <- subset(panc8, tech %in% c("celseq2", "smartseq2"))
pancreas.ref[["RNA"]] <- split(pancreas.ref[["RNA"]], f = pancreas.ref$tech)
# pre-process dataset (without integration)
pancreas.ref <- NormalizeData(pancreas.ref)
pancreas.ref <- FindVariableFeatures(pancreas.ref)
pancreas.ref <- ScaleData(pancreas.ref)
pancreas.ref <- RunPCA(pancreas.ref)
pancreas.ref <- FindNeighbors(pancreas.ref, dims = 1:30)
pancreas.ref <- FindClusters(pancreas.ref)## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 4679
## Number of edges: 174953
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9180
## Number of communities: 19
## Elapsed time: 0 seconds
pancreas.ref <- RunUMAP(pancreas.ref, dims = 1:30)
DimPlot(pancreas.ref, group.by = c("celltype", "tech"))
Next, we integrate the datasets into a shared reference. Please see our introduction to integration vignette.
pancreas.ref <- IntegrateLayers(object = pancreas.ref, method = CCAIntegration, orig.reduction = "pca",
new.reduction = "integrated.cca", verbose = FALSE)
pancreas.ref <- FindNeighbors(pancreas.ref, reduction = "integrated.cca", dims = 1:30)
pancreas.ref <- FindClusters(pancreas.ref)## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 4679
## Number of edges: 190152
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8680
## Number of communities: 15
## Elapsed time: 0 seconds
pancreas.ref <- RunUMAP(pancreas.ref, reduction = "integrated.cca", dims = 1:30)
DimPlot(pancreas.ref, group.by = c("tech", "celltype"))
Seurat also supports the projection of reference data (or meta data) onto a query object. While many of the methods are conserved (both procedures begin by identifying anchors), there is an important distinction between data transfer and integration: in data transfer, Seurat does not correct or modify the query expression data. Seurat has an option (set by default) to project the PCA structure of a reference onto the query, instead of learning a joint structure with CCA. We generally suggest using this option when projecting data between scRNA-seq datasets.
After finding anchors, we use the TransferData()
function to classify the query cells based on reference data (a vector
of reference cell type labels). TransferData() returns a
matrix with predicted IDs and prediction scores, which we can add to the
query metadata.
# select two technologies for the query datasets
pancreas.query <- subset(panc8, tech %in% c("fluidigmc1", "celseq"))
pancreas.query <- NormalizeData(pancreas.query)
pancreas.anchors <- FindTransferAnchors(reference = pancreas.ref, query = pancreas.query, dims = 1:30,
reference.reduction = "pca")
predictions <- TransferData(anchorset = pancreas.anchors, refdata = pancreas.ref$celltype, dims = 1:30)
pancreas.query <- AddMetaData(pancreas.query, metadata = predictions)Because we have the original label annotations from our full integrated analysis, we can evaluate how well our predicted cell type annotations match the full reference. In this example, we find that there is a high agreement in cell type classification, with over 96% of cells being labeled correctly.
pancreas.query$prediction.match <- pancreas.query$predicted.id == pancreas.query$celltype
table(pancreas.query$prediction.match)##
## FALSE TRUE
## 63 1579
To verify this further, we can examine some canonical cell type markers for specific pancreatic islet cell populations. Note that even though some of these cell types are only represented by one or two cells (e.g. epsilon cells), we are still able to classify them correctly.
table(pancreas.query$predicted.id)##
## acinar activated_stellate alpha beta
## 262 39 436 419
## delta ductal endothelial gamma
## 73 330 19 41
## macrophage mast schwann
## 15 2 6
VlnPlot(pancreas.query, c("REG1A", "PPY", "SST", "GHRL", "VWF", "SOX10"), group.by = "predicted.id")
We also enable projection of a query onto the reference UMAP
structure. This can be achieved by computing the reference UMAP model
and then calling MapQuery() instead of
TransferData().
pancreas.ref <- RunUMAP(pancreas.ref, dims = 1:30, reduction = "integrated.cca", return.model = TRUE)
pancreas.query <- MapQuery(anchorset = pancreas.anchors, reference = pancreas.ref, query = pancreas.query,
refdata = list(celltype = "celltype"), reference.reduction = "pca", reduction.model = "umap")MapQuery doing?
MapQuery() is a wrapper around three functions:
TransferData(), IntegrateEmbeddings(), and
ProjectUMAP(). TransferData() is used to
transfer cell type labels and impute the ADT values;
IntegrateEmbeddings() is used to integrate reference with
query by correcting the query’s projected low-dimensional embeddings;
and finally ProjectUMAP() is used to project the query data
onto the UMAP structure of the reference. The equivalent code for doing
this with the intermediate functions is below:
pancreas.query <- TransferData(anchorset = pancreas.anchors, reference = pancreas.ref, query = pancreas.query,
refdata = list(celltype = "celltype"))
pancreas.query <- IntegrateEmbeddings(anchorset = pancreas.anchors, reference = pancreas.ref, query = pancreas.query,
new.reduction.name = "ref.pca")
pancreas.query <- ProjectUMAP(query = pancreas.query, query.reduction = "ref.pca", reference = pancreas.ref,
reference.reduction = "pca", reduction.model = "umap")We can now visualize the query cells alongside our reference.
p1 <- DimPlot(pancreas.ref, reduction = "umap", group.by = "celltype", label = TRUE, label.size = 3,
repel = TRUE) + NoLegend() + ggtitle("Reference annotations")
p2 <- DimPlot(pancreas.query, reduction = "ref.umap", group.by = "predicted.celltype", label = TRUE,
label.size = 3, repel = TRUE) + NoLegend() + ggtitle("Query transferred labels")
p1 + p2
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] future_1.70.0 ggplot2_4.0.3 pbmcsca.SeuratData_3.0.0
## [4] pbmcref.SeuratData_1.0.0 panc8.SeuratData_3.0.2 SeuratData_0.2.2.9002
## [7] Seurat_5.5.0 SeuratObject_5.4.0 sp_2.2-1
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 jsonlite_2.0.0 magrittr_2.0.5
## [4] ggbeeswarm_0.7.3 spatstat.utils_3.2-2 farver_2.1.2
## [7] rmarkdown_2.30 fs_2.1.0 ragg_1.5.1
## [10] vctrs_0.7.1 ROCR_1.0-12 spatstat.explore_3.8-0
## [13] htmltools_0.5.9 sass_0.4.10 sctransform_0.4.3
## [16] parallelly_1.47.0 KernSmooth_2.23-26 bslib_0.10.0
## [19] htmlwidgets_1.6.4 desc_1.4.3 ica_1.0-3
## [22] plyr_1.8.9 plotly_4.12.0 zoo_1.8-15
## [25] cachem_1.1.0 igraph_2.3.0 mime_0.13
## [28] lifecycle_1.0.5 pkgconfig_2.0.3 Matrix_1.7-5
## [31] R6_2.6.1 fastmap_1.2.0 fitdistrplus_1.2-6
## [34] shiny_1.13.0 digest_0.6.39 patchwork_1.3.2
## [37] tensor_1.5.1 RSpectra_0.16-2 irlba_2.3.7
## [40] textshaping_1.0.5 labeling_0.4.3 progressr_0.19.0
## [43] spatstat.sparse_3.1-0 httr_1.4.8 polyclip_1.10-7
## [46] abind_1.4-8 compiler_4.5.2 withr_3.0.2
## [49] S7_0.2.1 fastDummies_1.7.6 MASS_7.3-65
## [52] rappdirs_0.3.4 tools_4.5.2 vipor_0.4.7
## [55] lmtest_0.9-40 otel_0.2.0 beeswarm_0.4.0
## [58] httpuv_1.6.16 future.apply_1.20.2 goftest_1.2-3
## [61] glue_1.8.0 nlme_3.1-168 promises_1.5.0
## [64] grid_4.5.2 Rtsne_0.17 cluster_2.1.8.2
## [67] reshape2_1.4.5 generics_0.1.4 gtable_0.3.6
## [70] spatstat.data_3.1-9 tidyr_1.3.2 data.table_1.18.2.1
## [73] spatstat.geom_3.7-3 RcppAnnoy_0.0.23 ggrepel_0.9.8
## [76] RANN_2.6.2 pillar_1.11.1 stringr_1.6.0
## [79] spam_2.11-3 RcppHNSW_0.6.0 later_1.4.8
## [82] splines_4.5.2 dplyr_1.2.0 lattice_0.22-7
## [85] survival_3.8-3 deldir_2.0-4 tidyselect_1.2.1
## [88] miniUI_0.1.2 pbapply_1.7-4 knitr_1.51
## [91] gridExtra_2.3 scattermore_1.2 xfun_0.56
## [94] matrixStats_1.5.0 stringi_1.8.7 lazyeval_0.2.3
## [97] yaml_2.3.12 evaluate_1.0.5 codetools_0.2-20
## [100] tibble_3.3.1 cli_3.6.6 uwot_0.2.4
## [103] xtable_1.8-8 reticulate_1.46.0 systemfonts_1.3.2
## [106] jquerylib_0.1.4 Rcpp_1.1.1-1.1 globals_0.19.1
## [109] spatstat.random_3.4-5 png_0.1-9 ggrastr_1.0.2
## [112] spatstat.univar_3.1-7 parallel_4.5.2 pkgdown_2.2.0
## [115] dotCall64_1.2 listenv_0.10.1 viridisLite_0.4.3
## [118] scales_1.4.0 ggridges_0.5.7 purrr_1.2.1
## [121] crayon_1.5.3 rlang_1.2.0 cowplot_1.2.0
## [124] formatR_1.14