Update February 2020: we now have developed a separate package, Signac, for the analysis and integration of scATAC-seq data. See the Signac website for up-to-date vignettes and documentation for analysing scATAC-seq data.

In this vignette, we demonstrate our new data transfer method in the context of scATAC-seq to

Classify cells measured with scATAC-seq based on clustering results from scRNA-seq
Co-embed scATAC-seq and scRNA-seq data

For this example we'll be working with the 10X PBMC datasets that contain ~10K cells for both scRNA-seq and scATAC-seq. You can download those here: scATAC-seq, scATAC-seq metadata, scRNA-seq

Overall our integration procedure consists of the following steps:

Estimate RNA-seq levels from ATAC-seq (quantify gene expression 'activity' from ATAC-seq reads)
Learn the internal structure of the ATAC-seq data on its own (accomplished using LSI)
Identify 'anchors' between the ATAC-seq and RNA-seq datasets
Transfer data between datasets (either transfer labels for classification, or impute RNA levels in the ATAC-seq data to enable co-embedding)

Gene activity quantification

First, we load in the provided peak matrix and collapse the peak matrix to a "gene activity matrix". Here, we make the simplifying assumption that a gene's activity can be quantified by simply summing all counts within the gene body + 2kb upstream but our method is compatible with any method that returns a gene by cell matrix (e.g. Cicero).

library(Seurat)
library(ggplot2)
library(patchwork)
peaks <- Read10X_h5("../data/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
# create a gene activity matrix from the peak matrix and GTF, using chromosomes 1:22, X, and Y.
# Peaks that fall within gene bodies, or 2kb upstream of a gene, are considered
activity.matrix <- CreateGeneActivityMatrix(peak.matrix = peaks, annotation.file = "../data/Homo_sapiens.GRCh37.82.gtf", 
    seq.levels = c(1:22, "X", "Y"), upstream = 2000, verbose = TRUE)

Object setup

Next, we'll set up the Seurat object and store both the original peak counts in the "ATAC" Assay and the gene activity matrix in the "RNA" Assay. As a QC step, we also filter out all cells here with fewer than 5K total counts in the scATAC-seq data, though you may need to modify this threshold for your experiment.

pbmc.atac <- CreateSeuratObject(counts = peaks, assay = "ATAC", project = "10x_ATAC")
pbmc.atac[["ACTIVITY"]] <- CreateAssayObject(counts = activity.matrix)
meta <- read.table("../data/atac_v1_pbmc_10k_singlecell.csv", sep = ",", header = TRUE, row.names = 1, 
    stringsAsFactors = FALSE)
meta <- meta[colnames(pbmc.atac), ]
pbmc.atac <- AddMetaData(pbmc.atac, metadata = meta)
pbmc.atac <- subset(pbmc.atac, subset = nCount_ATAC > 5000)
pbmc.atac$tech <- "atac"

Data preprocessing

Here, we process the gene activity matrix in order to find anchors between cells in the scATAC-seq dataset and the scRNA-seq dataset.

DefaultAssay(pbmc.atac) <- "ACTIVITY"
pbmc.atac <- FindVariableFeatures(pbmc.atac)
pbmc.atac <- NormalizeData(pbmc.atac)
pbmc.atac <- ScaleData(pbmc.atac)

We also process the peak matrix. Here we perform latent semantic indexing to reduce the dimensionality of the scATAC-seq data. This procedure learns an 'internal' structure for the scRNA-seq data, and is important when determining the appropriate weights for the anchors when transferring information. We utilize Latent Semantic Indexing (LSI) to learn the structure of ATAC-seq data, as proposed in Cusanovich et al, Science 2015 and implemented in the RunLSI function. LSI is implemented here by performing computing the term frequency-inverse document frequency (TF-IDF) followed by SVD.

We use all peaks that have at least 100 reads across all cells, and reduce dimensionality to 50. The parameters chosen here are inspired by previous scATAC-seq analyses, and we do not typically change them, but you can do so.

DefaultAssay(pbmc.atac) <- "ATAC"
VariableFeatures(pbmc.atac) <- names(which(Matrix::rowSums(pbmc.atac) > 100))
pbmc.atac <- RunLSI(pbmc.atac, n = 50, scale.max = NULL)
pbmc.atac <- RunUMAP(pbmc.atac, reduction = "lsi", dims = 1:50)

We have previously pre-processed and clustered a scRNA-seq dataset using the standard workflow in Seurat, and provide the object here. You can see that we can identify standard clusters of myeloid and lymphoid PBMCs.

pbmc.rna <- readRDS("../data/pbmc_10k_v3.rds")
pbmc.rna$tech <- "rna"

p1 <- DimPlot(pbmc.atac, reduction = "umap") + NoLegend() + ggtitle("scATAC-seq")
p2 <- DimPlot(pbmc.rna, group.by = "celltype", label = TRUE, repel = TRUE) + NoLegend() + ggtitle("scRNA-seq")
p1 + p2

Now, we can identify anchors between the scATAC-seq dataset and the scRNA-seq dataset and use these anchors to transfer the celltype labels we learned from the 10K scRNA-seq data to the scATAC-seq cells.

transfer.anchors <- FindTransferAnchors(reference = pbmc.rna, query = pbmc.atac, features = VariableFeatures(object = pbmc.rna), 
    reference.assay = "RNA", query.assay = "ACTIVITY", reduction = "cca")

To transfer the cluster ids, we provide a vector of previously annotated cell type labels for the RNA to the refdata parameter. The output will contain a matrix with predictions and confidence scores for each ATAC-seq cell.

celltype.predictions <- TransferData(anchorset = transfer.anchors, refdata = pbmc.rna$celltype, 
    weight.reduction = pbmc.atac[["lsi"]])
pbmc.atac <- AddMetaData(pbmc.atac, metadata = celltype.predictions)

Why do you choose different (non-default) values for reduction and weight.reduction?

In FindTransferAnchors, we typically project the PCA structure from the reference onto the query when transferring between scRNA-seq datasets. However, when transferring across modalities we find that CCA better captures the shared feature correlation structure and therefore set reduction = 'cca' here. Additionally, by default in TransferData we use the same projected PCA structure to compute the weights of the local neighborhood of anchors that influence each cell's prediction. In the case of scRNA-seq to scATA-seq transfer, we use the low dimensional space learned by computing an LSI on the ATAC-seq data to compute these weights as this better captures the "internal" structure of the ATAC-seq data.

We can then examine the distribution of prediction scores and optionally filter out those cells with low scores. Here, we find that over 95% of the cells receive a score of 0.5 or greater.

hist(pbmc.atac$prediction.score.max)
abline(v = 0.5, col = "red")

table(pbmc.atac$prediction.score.max > 0.5)

## 
## FALSE  TRUE 
##   290  7576

We can then view the predicted cell types on a UMAP representation of the scATAC-seq data and find that the transferred labels are highly consistent with the UMAP structure.

pbmc.atac.filtered <- subset(pbmc.atac, subset = prediction.score.max > 0.5)
pbmc.atac.filtered$predicted.id <- factor(pbmc.atac.filtered$predicted.id, levels = levels(pbmc.rna))  # to make the colors match
p1 <- DimPlot(pbmc.atac.filtered, group.by = "predicted.id", label = TRUE, repel = TRUE) + ggtitle("scATAC-seq cells") + 
    NoLegend() + scale_colour_hue(drop = FALSE)
p2 <- DimPlot(pbmc.rna, group.by = "celltype", label = TRUE, repel = TRUE) + ggtitle("scRNA-seq cells") + 
    NoLegend()
p1 + p2

After transferring cell type labels, you can then perform downstream analyses on a cell type specific level. For example, you could find sets of enhancers that are specific for certain cell types and look for motif enrichment. While not all of these types of downstream analyses are directly supported in Seurat, stay tuned for updates in this space.

Co-embedding

Finally, to visualize all the cells together, we can co-embed the scRNA-seq and scATAC-seq cells in the same low dimensional space. Here, we use the same anchors used earlier to transfer cell type labels to impute RNA-seq values for the scATAC-seq cells. We then merge the measured and imputed scRNA-seq data and run a standard UMAP analysis to visualize all the cells together. Note that this step is for visualization purposes only and is not a necessary part of the data transfer analysis.

# note that we restrict the imputation to variable genes from scRNA-seq, but could impute the
# full transcriptome if we wanted to
genes.use <- VariableFeatures(pbmc.rna)
refdata <- GetAssayData(pbmc.rna, assay = "RNA", slot = "data")[genes.use, ]

# refdata (input) contains a scRNA-seq expression matrix for the scRNA-seq cells.  imputation
# (output) will contain an imputed scRNA-seq matrix for each of the ATAC cells
imputation <- TransferData(anchorset = transfer.anchors, refdata = refdata, weight.reduction = pbmc.atac[["lsi"]])

# this line adds the imputed data matrix to the pbmc.atac object
pbmc.atac[["RNA"]] <- imputation
coembed <- merge(x = pbmc.rna, y = pbmc.atac)

# Finally, we run PCA and UMAP on this combined object, to visualize the co-embedding of both
# datasets
coembed <- ScaleData(coembed, features = genes.use, do.scale = FALSE)
coembed <- RunPCA(coembed, features = genes.use, verbose = FALSE)
coembed <- RunUMAP(coembed, dims = 1:30)
coembed$celltype <- ifelse(!is.na(coembed$celltype), coembed$celltype, coembed$predicted.id)

Here we plot all cells colored by either their assigned cell type (from the 10K dataset) or their predicted cell type from the data transfer procedure.

p1 <- DimPlot(coembed, group.by = "tech")
p2 <- DimPlot(coembed, group.by = "celltype", label = TRUE, repel = TRUE)
p1 + p2

How can I further investigate unmatched groups of cells that only appear in one assay?

Upon inspection of the UMAP embeddings, there appeared several groups of cells that appeared to be present only in a single assay. First, the platelet cells appeared only in the scRNA-seq data. These cells are thought to be undergoing biogenesis from megakaryocytes to platelets and therefore either completely lack nuclear material or their chromatin state is uncoupled from their transcriptome. As a result, we would not expect these to align in this analysis.

DimPlot(coembed, split.by = "tech", group.by = "celltype", label = TRUE, repel = TRUE) + NoLegend()

Additionally, there appeared to be a population next to the B cell progenitors that is composed entirely of scATAC-seq cells and is not integrated well with the scRNA-seq cells. Further examination of the metadata for these cells revealed a high number of reads mapping to blacklisted regions (as provided by the 10x Genomics QC metrics). This suggests that these barcodes could represent dead or dying cells, ambient DNA, or another technical artifact not represented in the scRNA-seq dataset.

coembed$blacklist_region_fragments[is.na(coembed$blacklist_region_fragments)] <- 0
FeaturePlot(coembed, features = "blacklist_region_fragments", max.cutoff = 500)

How can I evaluate the success of cross-modality integration?

There are several lines of evidence here that give us confidence in this analysis.

Overall, the prediction scores are high which suggest a high degree of confidence in our cell type assignments.
We observe good agreement between the scATAC-seq only dimensional reduction (UMAP plot above) and the transferred labels.
The co-embedding based on the same set of anchors gives good mixing between the two modalities.
When we collapse the ATAC-seq reads on a per-cluster basis into "pseudo bulk" profiles, we observe good concordance with the chromatin patterns observed in bulk data (Supplementary Figure S3)

Additionally, we have performed a similar analysis, transferring cell type labels from scRNA-seq to scATAC-seq data in the context of the mouse brain and have observed good performance based on similar reasoning to the points above ( Figure 3). These analyses make the assumption that there is generally a positive correlation between chromatin accessibility and gene expression. In cases such as developing systems, where accessibility may not be a good indicator of transcription, the method demonstrated here may not be able to form good anchors and the resulting integration across these specific modalities may be of limited value.

PBMC scATAC-seq Vignette

Compiled: October 02, 2020

Gene activity quantification

Object setup

Data preprocessing

Co-embedding