Here we demonstrate the integration of multiple single-cell chromatin datasets derived from human PBMCs. One dataset was generated using the 10x Genomics multiome technology, and includes DNA accessibility and gene expression information for each cell. The other dataset was profiled using 10x Genomics scATAC-seq, and includes DNA accessibility data only.

We will integrate the two datasets together using the shared DNA accessibility assay, using tools available in the Seurat package. Furthermore, we will demonstrate transferring both continuous (gene expression) and categorical (cell labels) information from a reference to a query single-cell chromatin dataset.

View data download code

The PBMC multiome and scATAC-seq data can be downloaded from the 10x website:

Preprocessing

Here we’ll load the PBMC multiome data pre-processed in our multiome vignette, and create a new object from the scATAC-seq data:

library(Signac)
library(Seurat)
library(ggplot2)

# load the pre-processed multiome data
pbmc.multi <- readRDS("../vignette_data/pbmc_multiomic.rds")

# process the scATAC data
# first count fragments per cell
fragpath <- "../vignette_data/atac_pbmc_10k_nextgem_fragments.tsv.gz"
fragcounts <- CountFragments(fragments = fragpath)
atac.cells <- fragcounts[fragcounts$frequency_count > 2000, "CB"]

# create the fragment object
atac.frags <- CreateFragmentObject(path = fragpath, cells = atac.cells)

An important first step in any integrative analysis of single-cell chromatin data is to ensure that the same features are measured in each dataset. Here, we quantify the multiome peaks in the ATAC dataset to ensure that there are common features across the two datasets.

# quantify multiome peaks in the scATAC-seq dataset
counts <- FeatureMatrix(
  fragments = atac.frags,
  features = granges(pbmc.multi),
  cells = atac.cells
)

# create object
atac.assay <- CreateChromatinAssay(
  counts = counts,
  min.features = 1000,
  fragments = atac.frags
)
pbmc.atac <- CreateSeuratObject(counts = atac.assay, assay = "peaks")
pbmc.atac <- subset(pbmc.atac, nCount_peaks > 2000 & nCount_peaks < 30000)

# compute LSI
pbmc.atac <- FindTopFeatures(pbmc.atac, min.cutoff = 10)
pbmc.atac <- RunTFIDF(pbmc.atac)
pbmc.atac <- RunSVD(pbmc.atac)

Next we can merge the multiome and scATAC datasets together and observe that there is a difference between them that appears to be due to the batch (experiment and technology-specific variation).

# first add dataset-identifying metadata
pbmc.atac$dataset <- "ATAC"
pbmc.multi$dataset <- "Multiome"

# merge
pbmc.combined <- merge(pbmc.atac, pbmc.multi)

# process the combined dataset
pbmc.combined <- FindTopFeatures(pbmc.combined, min.cutoff = 10)
pbmc.combined <- RunTFIDF(pbmc.combined)
pbmc.combined <- RunSVD(pbmc.combined)
pbmc.combined <- RunUMAP(pbmc.combined, reduction = "lsi", dims = 2:30)
p1 <- DimPlot(pbmc.combined, group.by = "dataset")

Integration

To find integration anchors between the two datasets, we need to project them into a shared low-dimensional space. To do this, we’ll use reciprocal LSI projection (projecting each dataset into the others LSI space) by setting reduction="rlsi". For more information about the data integration methods in Seurat, see our recent paper and the Seurat website.

Rather than integrating the normalized data matrix, as is typically done for scRNA-seq data, we’ll integrate the low-dimensional cell embeddings (the LSI coordinates) across the datasets using the IntegrateEmbeddings() function. This is much better suited to scATAC-seq data, as we typically have a very sparse matrix with a large number of features. Note that this requires that we first compute an uncorrected LSI embedding using the merged dataset (as we did above).

# find integration anchors
integration.anchors <- FindIntegrationAnchors(
  object.list = list(pbmc.multi, pbmc.atac),
  anchor.features = rownames(pbmc.multi),
  reduction = "rlsi",
  dims = 2:30
)

# integrate LSI embeddings
integrated <- IntegrateEmbeddings(
  anchorset = integration.anchors,
  reductions = pbmc.combined[["lsi"]],
  new.reduction.name = "integrated_lsi",
  dims.to.integrate = 1:30
)

# create a new UMAP using the integrated embeddings
integrated <- RunUMAP(integrated, reduction = "integrated_lsi", dims = 2:30)
p2 <- DimPlot(integrated, group.by = "dataset")

Finally, we can compare the results of the merged and integrated datasets, and find that the integration has successfully removed the technology-specific variation in the dataset while retaining the cell-type-specific (biological) variation.

(p1 + ggtitle("Merged")) | (p2 + ggtitle("Integrated"))

Here we’ve demonstrated the integration method using two datasets, but the same workflow can be applied to integrate any number of datasets.

Reference mapping

In cases where we have a large, high-quality dataset, or a dataset containing unique information not present in other datasets (cell type annotations or additional data modalities, for example), we often want to use that dataset as a reference and map queries onto it so that we can interpret these query datasets in the context of the existing reference.

To demonstrate how to do this using single-cell chromatin reference and query datasets, we’ll treat the PBMC multiome dataset here as a reference and map the scATAC-seq dataset to it using the FindTransferAnchors() and MapQuery() functions from Seurat.

# compute UMAP and store the UMAP model
pbmc.multi <- RunUMAP(pbmc.multi, reduction = "lsi", dims = 2:30, return.model = TRUE)

# find transfer anchors
transfer.anchors <- FindTransferAnchors(
  reference = pbmc.multi,
  query = pbmc.atac,
  reference.reduction = "lsi",
  reduction = "lsiproject",
  dims = 2:30
)

# map query onto the reference dataset
pbmc.atac <- MapQuery(
  anchorset = transfer.anchors,
  reference = pbmc.multi,
  query = pbmc.atac,
  refdata = pbmc.multi$predicted.id,
  reference.reduction = "lsi",
  new.reduction.name = "ref.lsi",
  reduction.model = 'umap'
)

What is MapQuery() doing?

MapQuery() is a wrapper function that runs TransferData(), IntegrateEmbeddings(), and ProjectUMAP() for a query dataset, and sets sensible default parameters based on how the anchor object was generated. For finer control over the parameters used by each of these functions, you can pass parameters through MapQuery() to each function using the transferdata.args, integrateembeddings.args, and projectumap.args arguments for MapQuery(), or you can run each of the functions yourself. For example:

pbmc.atac <- TransferData(
  anchorset = transfer.anchors, 
  reference = pbmc.multi,
  weight.reduction = "lsiproject",
  query = pbmc.atac,
  refdata = list(
    celltype = "predicted.id",
    predicted_RNA = "RNA")
)
pbmc.atac <- IntegrateEmbeddings(
  anchorset = transfer.anchors,
  reference = pbmc.multi,
  query = pbmc.atac, 
  reductions = "lsiproject",
  new.reduction.name = "ref.lsi"
)
pbmc.atac <- ProjectUMAP(
  query = pbmc.atac, 
  query.reduction = "ref.lsi",
  reference = pbmc.multi, 
  reference.reduction = "lsi",
  reduction.model = "umap"
)

By running MapQuery(), we have mapped the scATAC-seq dataset onto the the multimodal reference, and enabled cell type labels to be transferred from reference to query. We can visualize these reference mapping results and the cell type labels now associated with the scATAC-seq dataset:

p1 <- DimPlot(pbmc.multi, reduction = "umap", group.by = "predicted.id", label = TRUE, repel = TRUE) + NoLegend() + ggtitle("Reference")
p2 <- DimPlot(pbmc.atac, reduction = "ref.umap", group.by = "predicted.id", label = TRUE, repel = TRUE) + NoLegend() + ggtitle("Query")

p1 | p2

For more information about multimodal reference mapping, see the Seurat vignette.

RNA imputation

Above we transferred categorical information (the cell labels) and mapped the query data onto an existing reference UMAP. We can also transfer continuous data from the reference to the query in the same way. Here we demonstrate transferring the gene expression values from the PBMC multiome dataset (that measured DNA accessibility and gene expression in the same cells) to the PBMC scATAC-seq dataset (that measured DNA accessibility only). Note that we could also transfer these values using the MapQuery() function call above by setting the refdata parameter to a list of values.

# predict gene expression values
rna <- TransferData(
  anchorset = transfer.anchors,
  refdata = GetAssayData(pbmc.multi, assay = "RNA", slot = "data"),
  weight.reduction = pbmc.atac[["lsi"]],
  dims = 2:30
)

# add predicted values as a new assay
pbmc.atac[["predicted"]] <- rna

We can look at some immune marker genes and see that the predicted expression patterns match our expectation based on known expression patterns.

DefaultAssay(pbmc.atac) <- "predicted"

FeaturePlot(
  object = pbmc.atac,
  features = c('MS4A1', 'CD3D', 'LEF1', 'NKG7', 'TREM1', 'LYZ'),
  pt.size = 0.1,
  max.cutoff = 'q95',
  reduction = "ref.umap",
  ncol = 3
)

Session Info

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_3.3.5      SeuratObject_4.0.2 Seurat_4.0.4       Signac_1.4.0      
## 
## loaded via a namespace (and not attached):
##   [1] fastmatch_1.1-3        systemfonts_1.0.2      plyr_1.8.6            
##   [4] igraph_1.2.6           lazyeval_0.2.2         splines_4.1.0         
##   [7] BiocParallel_1.26.2    listenv_0.8.0          SnowballC_0.7.0       
##  [10] scattermore_0.7        GenomeInfoDb_1.28.4    digest_0.6.27         
##  [13] htmltools_0.5.2        fansi_0.5.0            magrittr_2.0.1        
##  [16] memoise_2.0.0          tensor_1.5             cluster_2.1.2         
##  [19] ROCR_1.0-11            globals_0.14.0         Biostrings_2.60.2     
##  [22] matrixStats_0.61.0     docopt_0.7.1           pkgdown_1.6.1.9001    
##  [25] spatstat.sparse_2.0-0  colorspace_2.0-2       ggrepel_0.9.1         
##  [28] textshaping_0.3.5      xfun_0.26              dplyr_1.0.7           
##  [31] sparsesvd_0.2          crayon_1.4.1           RCurl_1.98-1.5        
##  [34] jsonlite_1.7.2         spatstat.data_2.1-0    survival_3.2-11       
##  [37] zoo_1.8-9              glue_1.4.2             polyclip_1.10-0       
##  [40] gtable_0.3.0           zlibbioc_1.38.0        XVector_0.32.0        
##  [43] leiden_0.3.9           future.apply_1.8.1     BiocGenerics_0.38.0   
##  [46] abind_1.4-5            scales_1.1.1           DBI_1.1.1             
##  [49] miniUI_0.1.1.1         Rcpp_1.0.7             viridisLite_0.4.0     
##  [52] xtable_1.8-4           reticulate_1.22        spatstat.core_2.3-0   
##  [55] stats4_4.1.0           htmlwidgets_1.5.4      httr_1.4.2            
##  [58] RColorBrewer_1.1-2     ellipsis_0.3.2         ica_1.0-2             
##  [61] farver_2.1.0           pkgconfig_2.0.3        ggseqlogo_0.1         
##  [64] sass_0.4.0             uwot_0.1.10            deldir_0.2-10         
##  [67] utf8_1.2.2             labeling_0.4.2         tidyselect_1.1.1      
##  [70] rlang_0.4.11           reshape2_1.4.4         later_1.3.0           
##  [73] munsell_0.5.0          tools_4.1.0            cachem_1.0.6          
##  [76] generics_0.1.0         ggridges_0.5.3         evaluate_0.14         
##  [79] stringr_1.4.0          fastmap_1.1.0          yaml_2.2.1            
##  [82] ragg_1.1.3             goftest_1.2-2          knitr_1.34            
##  [85] fs_1.5.0               fitdistrplus_1.1-5     purrr_0.3.4           
##  [88] RANN_2.6.1             pbapply_1.5-0          future_1.22.1         
##  [91] nlme_3.1-152           mime_0.11              slam_0.1-48           
##  [94] RcppRoll_0.3.0         compiler_4.1.0         plotly_4.9.4.1        
##  [97] png_0.1-7              spatstat.utils_2.2-0   tweenr_1.0.2          
## [100] tibble_3.1.4           bslib_0.3.0            stringi_1.7.4         
## [103] highr_0.9              RSpectra_0.16-0        desc_1.3.0            
## [106] lattice_0.20-44        Matrix_1.3-4           vctrs_0.3.8           
## [109] pillar_1.6.2           lifecycle_1.0.0        spatstat.geom_2.2-2   
## [112] lmtest_0.9-38          jquerylib_0.1.4        RcppAnnoy_0.0.19      
## [115] data.table_1.14.0      cowplot_1.1.1          bitops_1.0-7          
## [118] irlba_2.3.3            httpuv_1.6.3           patchwork_1.1.1       
## [121] GenomicRanges_1.44.0   R6_2.5.1               promises_1.2.0.1      
## [124] lsa_0.73.2             KernSmooth_2.23-20     gridExtra_2.3         
## [127] IRanges_2.26.0         parallelly_1.28.1      codetools_0.2-18      
## [130] MASS_7.3-54            assertthat_0.2.1       rprojroot_2.0.2       
## [133] withr_2.4.2            qlcMatrix_0.9.7        sctransform_0.3.2     
## [136] Rsamtools_2.8.0        S4Vectors_0.30.0       GenomeInfoDbData_1.2.6
## [139] mgcv_1.8-36            parallel_4.1.0         grid_4.1.0            
## [142] rpart_4.1-15           tidyr_1.1.3            rmarkdown_2.11        
## [145] Rtsne_0.15             ggforce_0.3.3          shiny_1.6.0