vignettes/atacseq_integration_vignette.Rmd
atacseq_integration_vignette.Rmd
Single-cell transcriptomics has transformed our ability to characterize cell states, but deep biological understanding requires more than a taxonomic listing of clusters. As new methods arise to measure distinct cellular modalities, a key analytical challenge is to integrate these datasets to better understand cellular identity and function. For example, users may perform scRNA-seq and scATAC-seq experiments on the same biological system and to consistently annotate both datasets with the same set of cell type labels. This analysis is particularly challenging as scATAC-seq datasets are difficult to annotate, due to both the sparsity of genomic data collected at single-cell resolution, and the lack of interpretable gene markers in scRNA-seq data.
In Stuart*, Butler* et al, 2019, we introduce methods to integrate scRNA-seq and scATAC-seq datasets collected from the same biological system, and demonstrate these methods in this vignette. In particular, we demonstrate the following analyses:
This vignette makes extensive use of the Signac package, recently developed for the analysis of chromatin datasets collected at single-cell resolution, including scATAC-seq. Please see the Signac website for additional vignettes and documentation for analyzing scATAC-seq data.
We demonstrate these methods using a publicly available ~12,000 human PBMC ‘multiome’ dataset from 10x Genomics. In this dataset, scRNA-seq and scATAC-seq profiles were simultaneously collected in the same cells. For the purposes of this vignette, we treat the datasets as originating from two different experiments and integrate them together. Since they were originally measured in the same cells, this provides a ground truth that we can use to assess the accuracy of integration. We emphasize that our use of the multiome dataset here is for demonstration and evaluation purposes, and that users should apply these methods to scRNA-seq and scATAC-seq datasets that are collected separately. We provide a separate weighted nearest neighbors vignette (WNN) that describes analysis strategies for multi-omic single-cell data.
The PBMC multiome dataset is available from 10x genomics. To facilitate easy loading and exploration, it is also available as part of our SeuratData package. We load the RNA and ATAC data in separately, and pretend that these profiles were measured in separate experiments. We annotated these cells in our WNN vignette, and the annotations are also included in SeuratData.
library(SeuratData)
# install the dataset and load requirements
InstallData("pbmcMultiome")
# load both modalities
pbmc.rna <- LoadData("pbmcMultiome", "pbmc.rna")
pbmc.atac <- LoadData("pbmcMultiome", "pbmc.atac")
# repeat QC steps performed in the WNN vignette
pbmc.rna <- subset(pbmc.rna, seurat_annotations != "filtered")
pbmc.atac <- subset(pbmc.atac, seurat_annotations != "filtered")
# Perform standard analysis of each modality independently RNA analysis
pbmc.rna <- NormalizeData(pbmc.rna)
pbmc.rna <- FindVariableFeatures(pbmc.rna)
pbmc.rna <- ScaleData(pbmc.rna)
pbmc.rna <- RunPCA(pbmc.rna)
pbmc.rna <- RunUMAP(pbmc.rna, dims = 1:30)
# ATAC analysis add gene annotation information
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
seqlevelsStyle(annotations) <- "UCSC"
genome(annotations) <- "hg38"
Annotation(pbmc.atac) <- annotations
# We exclude the first dimension as this is typically correlated with sequencing depth
pbmc.atac <- RunTFIDF(pbmc.atac)
pbmc.atac <- FindTopFeatures(pbmc.atac, min.cutoff = "q0")
pbmc.atac <- RunSVD(pbmc.atac)
pbmc.atac <- RunUMAP(pbmc.atac, reduction = "lsi", dims = 2:30, reduction.name = "umap.atac", reduction.key = "atacUMAP_")
Now we plot the results from both modalities. Cells have been previously annotated based on transcriptomic state. We will predict annotations for the scATAC-seq cells.
In order to identify ‘anchors’ between scRNA-seq and scATAC-seq experiments, we first generate a rough estimate of the transcriptional activity of each gene by quantifying ATAC-seq counts in the 2 kb-upstream region and gene body, using the GeneActivity()
function in the Signac package. The ensuing gene activity scores from the scATAC-seq data are then used as input for canonical correlation analysis, along with the gene expression quantifications from scRNA-seq. We perform this quantification for all genes identified as being highly variable from the scRNA-seq dataset.
# quantify gene activity
gene.activities <- GeneActivity(pbmc.atac, features = VariableFeatures(pbmc.rna))
# add gene activities as a new assay
pbmc.atac[["ACTIVITY"]] <- CreateAssayObject(counts = gene.activities)
# normalize gene activities
DefaultAssay(pbmc.atac) <- "ACTIVITY"
pbmc.atac <- NormalizeData(pbmc.atac)
pbmc.atac <- ScaleData(pbmc.atac, features = rownames(pbmc.atac))
# Identify anchors
transfer.anchors <- FindTransferAnchors(reference = pbmc.rna, query = pbmc.atac, features = VariableFeatures(object = pbmc.rna),
reference.assay = "RNA", query.assay = "ACTIVITY", reduction = "cca")
After identifying anchors, we can transfer annotations from the scRNA-seq dataset onto the scATAC-seq cells. The annotations are stored in the seurat_annotations
field, and are provided as input to the refdata
parameter. The output will contain a matrix with predictions and confidence scores for each ATAC-seq cell.
celltype.predictions <- TransferData(anchorset = transfer.anchors, refdata = pbmc.rna$seurat_annotations,
weight.reduction = pbmc.atac[["lsi"]], dims = 2:30)
pbmc.atac <- AddMetaData(pbmc.atac, metadata = celltype.predictions)
In FindTransferAnchors()
, we typically project the PCA structure from the reference onto the query when transferring between scRNA-seq datasets. However, when transferring across modalities we find that CCA better captures the shared feature correlation structure and therefore set reduction = 'cca'
here. Additionally, by default in TransferData()
we use the same projected PCA structure to compute the weights of the local neighborhood of anchors that influence each cell’s prediction. In the case of scRNA-seq to scATAC-seq transfer, we use the low dimensional space learned by computing an LSI on the ATAC-seq data to compute these weights as this better captures the internal structure of the ATAC-seq data.
After performing transfer, the ATAC-seq cells have predicted annotations (transferred from the scRNA-seq dataset) stored in the predicted.id
field. Since these cells were measured with the multiome kit, we also have a ground-truth annotation that can be used for evaluation. You can see that the predicted and actual annotations are extremely similar.
pbmc.atac$annotation_correct <- pbmc.atac$predicted.id == pbmc.atac$seurat_annotations
p1 <- DimPlot(pbmc.atac, group.by = "predicted.id", label = TRUE) + NoLegend() + ggtitle("Predicted annotation")
p2 <- DimPlot(pbmc.atac, group.by = "seurat_annotations", label = TRUE) + NoLegend() + ggtitle("Ground-truth annotation")
p1 | p2
In this example, the annotation for an scATAC-seq profile is correctly predicted via scRNA-seq integration ~90% of the time. In addition, the prediction.score.max
field quantifies the uncertainty associated with our predicted annotations. We can see that cells that are correctly annotated are typically associated with high prediction scores (>90%), while cells that are incorrectly annotated are associated with sharply lower prediction scores (<50%). Incorrect assignments also tend to reflect closely related cell types (i.e. Intermediate vs. Naive B cells).
predictions <- table(pbmc.atac$seurat_annotations, pbmc.atac$predicted.id)
predictions <- predictions/rowSums(predictions) # normalize for number of cells in each cell type
predictions <- as.data.frame(predictions)
p1 <- ggplot(predictions, aes(Var1, Var2, fill = Freq)) + geom_tile() + scale_fill_gradient(name = "Fraction of cells",
low = "#ffffc8", high = "#7d0025") + xlab("Cell type annotation (RNA)") + ylab("Predicted cell type label (ATAC)") +
theme_cowplot() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
correct <- length(which(pbmc.atac$seurat_annotations == pbmc.atac$predicted.id))
incorrect <- length(which(pbmc.atac$seurat_annotations != pbmc.atac$predicted.id))
data <- FetchData(pbmc.atac, vars = c("prediction.score.max", "annotation_correct"))
p2 <- ggplot(data, aes(prediction.score.max, fill = annotation_correct, colour = annotation_correct)) +
geom_density(alpha = 0.5) + theme_cowplot() + scale_fill_discrete(name = "Annotation Correct",
labels = c(paste0("FALSE (n = ", incorrect, ")"), paste0("TRUE (n = ", correct, ")"))) + scale_color_discrete(name = "Annotation Correct",
labels = c(paste0("FALSE (n = ", incorrect, ")"), paste0("TRUE (n = ", correct, ")"))) + xlab("Prediction Score")
p1 + p2
In addition to transferring labels across modalities, it is also possible to visualize scRNA-seq and scATAC-seq cells on the same plot. We emphasize that this step is primarily for visualization, and is an optional step. Typically, when we perform integrative analysis between scRNA-seq and scATAC-seq datasets, we focus primarily on label transfer as described above. We demonstrate our workflows for co-embedding below, and again highlight that this is for demonstration purposes, especially as in this particular case both the scRNA-seq profiles and scATAC-seq profiles were actually measured in the same cells.
In order to perform co-embedding, we first ‘impute’ RNA expression into the scATAC-seq cells based on the previously computed anchors, and then merge the datasets.
# note that we restrict the imputation to variable genes from scRNA-seq, but could impute the
# full transcriptome if we wanted to
genes.use <- VariableFeatures(pbmc.rna)
refdata <- GetAssayData(pbmc.rna, assay = "RNA", slot = "data")[genes.use, ]
# refdata (input) contains a scRNA-seq expression matrix for the scRNA-seq cells. imputation
# (output) will contain an imputed scRNA-seq matrix for each of the ATAC cells
imputation <- TransferData(anchorset = transfer.anchors, refdata = refdata, weight.reduction = pbmc.atac[["lsi"]],
dims = 2:30)
pbmc.atac[["RNA"]] <- imputation
coembed <- merge(x = pbmc.rna, y = pbmc.atac)
# Finally, we run PCA and UMAP on this combined object, to visualize the co-embedding of both
# datasets
coembed <- ScaleData(coembed, features = genes.use, do.scale = FALSE)
coembed <- RunPCA(coembed, features = genes.use, verbose = FALSE)
coembed <- RunUMAP(coembed, dims = 1:30)
DimPlot(coembed, group.by = c("orig.ident", "seurat_annotations"))
Session Info
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] cowplot_1.1.1 ggplot2_3.3.5
## [3] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.16.4
## [5] AnnotationFilter_1.16.0 GenomicFeatures_1.44.0
## [7] AnnotationDbi_1.54.1 Biobase_2.52.0
## [9] GenomicRanges_1.44.0 GenomeInfoDb_1.28.1
## [11] IRanges_2.26.0 S4Vectors_0.30.0
## [13] BiocGenerics_0.38.0 Signac_1.3.0
## [15] SeuratObject_4.0.2 Seurat_4.0.4
## [17] thp1.eccite.SeuratData_3.1.5 stxBrain.SeuratData_0.1.1
## [19] ssHippo.SeuratData_3.1.4 pbmcsca.SeuratData_3.0.0
## [21] pbmcMultiome.SeuratData_0.1.1 pbmc3k.SeuratData_3.1.4
## [23] panc8.SeuratData_3.0.2 ifnb.SeuratData_3.1.0
## [25] hcabm40k.SeuratData_3.0.0 bmcite.SeuratData_0.3.0
## [27] SeuratData_0.2.1
##
## loaded via a namespace (and not attached):
## [1] rappdirs_0.3.3 SnowballC_0.7.0
## [3] rtracklayer_1.52.0 scattermore_0.7
## [5] ragg_1.1.3 tidyr_1.1.3
## [7] bit64_4.0.5 knitr_1.33
## [9] irlba_2.3.3 DelayedArray_0.18.0
## [11] data.table_1.14.0 rpart_4.1-15
## [13] KEGGREST_1.32.0 RCurl_1.98-1.3
## [15] generics_0.1.0 RSQLite_2.2.7
## [17] RANN_2.6.1 future_1.21.0
## [19] bit_4.0.4 spatstat.data_2.1-0
## [21] xml2_1.3.2 httpuv_1.6.1
## [23] SummarizedExperiment_1.22.0 assertthat_0.2.1
## [25] xfun_0.25 hms_1.1.0
## [27] jquerylib_0.1.4 evaluate_0.14
## [29] promises_1.2.0.1 fansi_0.5.0
## [31] restfulr_0.0.13 progress_1.2.2
## [33] dbplyr_2.1.1 igraph_1.2.6
## [35] DBI_1.1.1 htmlwidgets_1.5.3
## [37] sparsesvd_0.2 spatstat.geom_2.1-0
## [39] purrr_0.3.4 ellipsis_0.3.2
## [41] RSpectra_0.16-0 dplyr_1.0.7
## [43] backports_1.2.1 biomaRt_2.48.2
## [45] deldir_0.2-10 MatrixGenerics_1.4.2
## [47] vctrs_0.3.8 ROCR_1.0-11
## [49] abind_1.4-5 cachem_1.0.5
## [51] withr_2.4.2 ggforce_0.3.3
## [53] BSgenome_1.60.0 checkmate_2.0.0
## [55] sctransform_0.3.2 GenomicAlignments_1.28.0
## [57] prettyunits_1.1.1 goftest_1.2-2
## [59] cluster_2.1.2 lazyeval_0.2.2
## [61] crayon_1.4.1 labeling_0.4.2
## [63] pkgconfig_2.0.3 slam_0.1-48
## [65] tweenr_1.0.2 nlme_3.1-152
## [67] ProtGenerics_1.24.0 nnet_7.3-16
## [69] rlang_0.4.11 globals_0.14.0
## [71] lifecycle_1.0.0 miniUI_0.1.1.1
## [73] filelock_1.0.2 BiocFileCache_2.0.0
## [75] dichromat_2.0-0 rprojroot_2.0.2
## [77] polyclip_1.10-0 matrixStats_0.60.0
## [79] lmtest_0.9-38 Matrix_1.3-3
## [81] ggseqlogo_0.1 zoo_1.8-9
## [83] base64enc_0.1-3 ggridges_0.5.3
## [85] png_0.1-7 viridisLite_0.4.0
## [87] rjson_0.2.20 bitops_1.0-7
## [89] KernSmooth_2.23-20 Biostrings_2.60.2
## [91] blob_1.2.2 stringr_1.4.0
## [93] parallelly_1.26.0 jpeg_0.1-9
## [95] scales_1.1.1 memoise_2.0.0
## [97] magrittr_2.0.1 plyr_1.8.6
## [99] ica_1.0-2 zlibbioc_1.38.0
## [101] compiler_4.1.0 BiocIO_1.2.0
## [103] RColorBrewer_1.1-2 fitdistrplus_1.1-5
## [105] Rsamtools_2.8.0 cli_3.0.1
## [107] XVector_0.32.0 listenv_0.8.0
## [109] patchwork_1.1.1 pbapply_1.4-3
## [111] htmlTable_2.2.1 formatR_1.11
## [113] Formula_1.2-4 MASS_7.3-54
## [115] mgcv_1.8-35 tidyselect_1.1.1
## [117] stringi_1.7.3 textshaping_0.3.5
## [119] highr_0.9 yaml_2.2.1
## [121] latticeExtra_0.6-29 ggrepel_0.9.1
## [123] grid_4.1.0 VariantAnnotation_1.38.0
## [125] sass_0.4.0 fastmatch_1.1-3
## [127] tools_4.1.0 future.apply_1.7.0
## [129] rstudioapi_0.13 foreign_0.8-81
## [131] lsa_0.73.2 gridExtra_2.3
## [133] farver_2.1.0 Rtsne_0.15
## [135] digest_0.6.27 shiny_1.6.0
## [137] qlcMatrix_0.9.7 Rcpp_1.0.7
## [139] later_1.2.0 RcppAnnoy_0.0.18
## [141] httr_1.4.2 biovizBase_1.40.0
## [143] colorspace_2.0-2 XML_3.99-0.6
## [145] fs_1.5.0 tensor_1.5
## [147] reticulate_1.20 splines_4.1.0
## [149] uwot_0.1.10 RcppRoll_0.3.0
## [151] spatstat.utils_2.1-0 pkgdown_1.6.1
## [153] plotly_4.9.4 systemfonts_1.0.2
## [155] xtable_1.8-4 jsonlite_1.7.2
## [157] R6_2.5.0 Hmisc_4.5-0
## [159] pillar_1.6.2 htmltools_0.5.1.1
## [161] mime_0.11 glue_1.4.2
## [163] fastmap_1.1.0 BiocParallel_1.26.1
## [165] codetools_0.2-18 utf8_1.2.2
## [167] lattice_0.20-44 bslib_0.2.5.1
## [169] spatstat.sparse_2.0-0 tibble_3.1.3
## [171] curl_4.3.2 leiden_0.3.8
## [173] survival_3.2-11 rmarkdown_2.10
## [175] docopt_0.7.1 desc_1.3.0
## [177] munsell_0.5.0 GenomeInfoDbData_1.2.6
## [179] reshape2_1.4.4 gtable_0.3.0
## [181] spatstat.core_2.1-2