PBMC scATAC-seq Vignette
Compiled: April 17, 2020
Update February 2020: we now have developed a separate package, Signac, for the analysis and integration of scATAC-seq data. See the Signac website for up-to-date vignettes and documentation for analysing scATAC-seq data.
In this vignette, we demonstrate our new data transfer method in the context of scATAC-seq to
- Classify cells measured with scATAC-seq based on clustering results from scRNA-seq
- Co-embed scATAC-seq and scRNA-seq data
Overall our integration procedure consists of the following steps:
- Estimate RNA-seq levels from ATAC-seq (quantify gene expression ‘activity’ from ATAC-seq reads)
- Learn the internal structure of the ATAC-seq data on its own (accomplished using LSI)
- Identify ‘anchors’ between the ATAC-seq and RNA-seq datasets
- Transfer data between datasets (either transfer labels for classification, or impute RNA levels in the ATAC-seq data to enable co-embedding)
Gene activity quantification
First, we load in the provided peak matrix and collapse the peak matrix to a “gene activity matrix”. Here, we make the simplifying assumption that a gene’s activity can be quantified by simply summing all counts within the gene body + 2kb upstream but our method is compatible with any method that returns a gene by cell matrix (e.g. Cicero).
library(Seurat) library(ggplot2) library(patchwork) peaks <- Read10X_h5("../data/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5") # create a gene activity matrix from the peak matrix and GTF, using chromosomes 1:22, X, and Y. # Peaks that fall within gene bodies, or 2kb upstream of a gene, are considered activity.matrix <- CreateGeneActivityMatrix(peak.matrix = peaks, annotation.file = "../data/Homo_sapiens.GRCh37.82.gtf", seq.levels = c(1:22, "X", "Y"), upstream = 2000, verbose = TRUE)
Next, we’ll set up the
Seurat object and store both the original peak counts in the “ATAC”
Assay and the gene activity matrix in the “RNA”
Assay. As a QC step, we also filter out all cells here with fewer than 5K total counts in the scATAC-seq data, though you may need to modify this threshold for your experiment.
pbmc.atac <- CreateSeuratObject(counts = peaks, assay = "ATAC", project = "10x_ATAC") pbmc.atac[["ACTIVITY"]] <- CreateAssayObject(counts = activity.matrix) meta <- read.table("../data/atac_v1_pbmc_10k_singlecell.csv", sep = ",", header = TRUE, row.names = 1, stringsAsFactors = FALSE) meta <- meta[colnames(pbmc.atac), ] pbmc.atac <- AddMetaData(pbmc.atac, metadata = meta) pbmc.atac <- subset(pbmc.atac, subset = nCount_ATAC > 5000) pbmc.atac$tech <- "atac"
Here, we process the gene activity matrix in order to find anchors between cells in the scATAC-seq dataset and the scRNA-seq dataset.
DefaultAssay(pbmc.atac) <- "ACTIVITY" pbmc.atac <- FindVariableFeatures(pbmc.atac) pbmc.atac <- NormalizeData(pbmc.atac) pbmc.atac <- ScaleData(pbmc.atac)
We also process the peak matrix. Here we perform latent semantic indexing to reduce the dimensionality of the scATAC-seq data. This procedure learns an ‘internal’ structure for the scRNA-seq data, and is important when determining the appropriate weights for the anchors when transferring information. We utilize Latent Semantic Indexing (LSI) to learn the structure of ATAC-seq data, as proposed in Cusanovich et al, Science 2015 and implemented in the
RunLSI function. LSI is implemented here by performing computing the term frequency-inverse document frequency (TF-IDF) followed by SVD.
We use all peaks that have at least 100 reads across all cells, and reduce dimensionality to 50. The parameters chosen here are inspired by previous scATAC-seq analyses, and we do not typically change them, but you can do so.
DefaultAssay(pbmc.atac) <- "ATAC" VariableFeatures(pbmc.atac) <- names(which(Matrix::rowSums(pbmc.atac) > 100)) pbmc.atac <- RunLSI(pbmc.atac, n = 50, scale.max = NULL) pbmc.atac <- RunUMAP(pbmc.atac, reduction = "lsi", dims = 1:50)
We have previously pre-processed and clustered a scRNA-seq dataset using the standard workflow in Seurat, and provide the object here. You can see that we can identify standard clusters of myeloid and lymphoid PBMCs.
pbmc.rna <- readRDS("../data/pbmc_10k_v3.rds") pbmc.rna$tech <- "rna"
p1 <- DimPlot(pbmc.atac, reduction = "umap") + NoLegend() + ggtitle("scATAC-seq") p2 <- DimPlot(pbmc.rna, group.by = "celltype", label = TRUE, repel = TRUE) + NoLegend() + ggtitle("scRNA-seq") p1 + p2