For this tutorial, we will be analyzing a single-cell ATAC-seq dataset of human peripheral blood mononuclear cells (PBMCs) provided by 10x Genomics. The following files are used in this vignette, all available through the 10x Genomics website:
First load in Signac, Seurat, and some other packages we will be using for analyzing human data.
The vignette below presents a detailed workflow for the pre-processing, QC, clustering, visualization, integration, and interpretation of single-cell ATAC-seq data. While there are many steps (some of which are computationally intensive), we note that the core analytical components are similar to single-cell RNA-seq, and can be run quickly in a few lines. For example, the following lines encapsulate our scATAC-seq workflow (but does not include the calculation of advanced QC metrics, or integration with scRNA-seq data).
library(dplyr) counts <- Read10X_h5(filename = "/home/stuartt/github/chrom/vignette_data/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5") metadata <- read.csv( file = "/home/stuartt/github/chrom/vignette_data/atac_v1_pbmc_10k_singlecell.csv", header = TRUE, row.names = 1 ) pbmc_fast <- CreateSeuratObject(counts = counts,meta.data = metadata) %>% subset(peak_region_fragments > 1000) %>% RunTFIDF() %>% FindTopFeatures(cutoff = 'q75') %>% RunSVD(reduction.name='lsi') %>% FindNeighbors(reduction='lsi', dims=1:30) %>% FindClusters(resolution = 0.5,verbose = FALSE) %>% RunUMAP(reduction='lsi', dims=1:30) DimPlot(pbmc_fast, label = FALSE)
Below, we describe these steps in detail, beginning with pre-processing.
When pre-processing chromatin data, Signac uses information from two related input files, both of which are created by CellRanger:
Peak/Cell matrix. This is analogous to the gene expression count matrix used to analyze single-cell RNA-seq. However, instead of genes, each row of the matrix represents a region of the genome (a ‘peak’), that is predicted to represent a region of open chromatin. Each value in the matrix represents the number of Tn5 cut sites for each single barcode (i.e. cell) that map within each peak. You can find more detail on the 10X Website.
Fragment file. This represents a full list of all unique fragments across all single cells. It is a substantially larger file, is slower to work with, and is stored on-disk (instead of in memory). However, the advantage of retaining this file is that it contains all fragments associated with each single cell, as opposed to only reads that map to peaks.
For most analyses we work with the peak/cell matrix, but for some (e.g. creating a ‘Gene Activity Matrix’), we find that restricting only to reads in peaks can adversely affect results. We therefore use both files in this vignette. We start by creating a Seurat object using the peak/cell matrix, and store the path to the fragment file on disk in the Seurat object:
counts <- Read10X_h5(filename = "/home/stuartt/github/chrom/vignette_data/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5") metadata <- read.csv( file = "/home/stuartt/github/chrom/vignette_data/atac_v1_pbmc_10k_singlecell.csv", header = TRUE, row.names = 1 ) pbmc <- CreateSeuratObject( counts = counts, assay = 'peaks', project = 'ATAC', min.cells = 1, meta.data = metadata )
## Warning in CreateSeuratObject(counts = counts, assay = "peaks", project = ## "ATAC", : Some cells in meta.data not present in provided counts matrix.
fragment.path <- '/home/stuartt/github/chrom/vignette_data/atac_v1_pbmc_10k_fragments.tsv.gz' pbmc <- SetFragments( object = pbmc, file = fragment.path )
## An object of class Seurat ## 89307 features across 8728 samples within 1 assay ## Active assay: peaks (89307 features)
Optional step: Filtering the fragment file
To make downstream steps that use this file faster, we can filter the fragments file to contain only reads from cells that we retain in the analysis. This is optional, but slow, and only needs to be performed once. Running this command writes a new file to disk and indexes the file so it is ready to be used by Signac.
fragment_file_filtered <- "/home/stuartt/github/chrom/vignette_data/atac_v1_pbmc_10k_filtered_fragments.tsv.bgz"
FilterFragments( fragment.path = fragment.path, cells = colnames(pbmc), output.path = fragment_file_filtered )
pbmc <- SetFragments(object = pbmc, file = fragment_file_filtered)
We can now compute some QC metrics for the scATAC-seq experiment. We currently suggest the following metrics below to assess data quality. As with scRNA-seq, the expected range of values for these parameters will vary depending on your biological system, cell viability, etc.
Nucleosome banding pattern: The histogram of fragment sizes (determined from the paired-end sequencing reads) should exhibit a strong nucleosome banding pattern. We calculate this per single cell, and quantify the approximate ratio of mononucleosomal to nucleosome-free fragments (stored as
nucleosome_signal). Note that by default, this is calculated only on chr1 reads (see the
region parameter) to save time.
Total number of fragments in peaks: A measure of cellular sequencing depth / complexity. Cells with very few reads may need to be excluded due to low sequencing depth. Cells with extremely high levels may represent doublets or nuclear clumps.
Fraction of fragments in peaks. Represents the fraction of total fragments that fall within ATAC-seq peaks. Cells with low values (i.e. <15-20%) often represent low-quality cells or technical artifacts that should be removed.
Ratio reads in ‘blacklist’ sites. The ENCODE project has provided a list of blacklist regions, representing reads which are often associated with artifactual signal. Cells with a high proportion of reads mapping to these areas (compared to reads mapping to peaks) often represent technical artifacts and should be removed. ENCODE blacklist regions for human (hg19 and GRCh38), mouse (mm10), Drosophila (dm3), and C. elegans (ce10) are included in the Signac package.
Note that the last three metrics can be obtained from the output of CellRanger (which is stored in the object metadata), but can also be calculated for non-10x datasets using Signac (more information at the end of this document).
pbmc <- NucleosomeSignal(object = pbmc)
pbmc$pct_reads_in_peaks <- pbmc$peak_region_fragments / pbmc$total * 100 pbmc$blacklist_ratio <- pbmc$blacklist_region_fragments / pbmc$peak_region_fragments plot1 <- VlnPlot( object = pbmc, features = c('pct_reads_in_peaks', 'blacklist_ratio', 'nucleosome_signal'), pt.size = 0.1) + NoLegend() plot2_a <- VlnPlot( object = pbmc, features = 'peak_region_fragments', pt.size = 0.1, log = TRUE) + NoLegend() plot2_b <- FeatureScatter(pbmc,"peak_region_fragments",'nucleosome_signal', pt.size = 0.1) + NoLegend() plot2_c <- FeatureScatter(pbmc,"peak_region_fragments",'blacklist_ratio', pt.size = 0.1) + NoLegend() plot2 <- CombinePlots(plots = list(plot2_a,plot2_b,plot2_c), ncol = 3) CombinePlots(list(plot1,plot2),ncol = 1)
We can also look at the fragment length periodicity for all the cells, and group by cells with high or low nucleosomal signal strength. You can see that cells which are outliers for the mononucleosomal/ nucleosome-free ratio (based off the plots above) have different banding patterns. The remaining cells exhibit a pattern that is typical for a successful ATAC-seq experiment.
We remove cells that are outliers for these QC metrics.
pbmc$nucleosome_group <- ifelse(pbmc$nucleosome_signal > 10, 'NS > 10', 'NS < 10') PeriodPlot(object = pbmc, group.by = 'nucleosome_group')
#filter out cells pbmc <- subset(pbmc, subset = peak_region_fragments > 1000 & peak_region_fragments < 20000 & pct_reads_in_peaks > 15 & blacklist_ratio < 0.05 & nucleosome_signal < 10)
Normalization: Signac performs term frequency-inverse document frequency (TF-IDF) normalization. This is a two-step normalization procedure, that both normalizes across cells to correct for differences in cellular sequencing depth, and across peaks to give higher values to more rare peaks.
Feature selection: The largely binary nature of scATAC-seq data makes it challenging to perform ‘variable’ feature selection, as we do for scRNA-seq. Instead, we can choose to use only the top n% of features (peaks) for dimensional reduction, or remove features present in less that n cells with the
FindTopFeatures function. Here, we will all features, though we note that we see very similar results when using only a subset of features (try setting min.cutoff to ‘q75’ to use the top 25% all peaks), with faster runtimes. Features used for dimensional reduction are automatically set as
VariableFeatures for the Seurat object by this function.
Dimensional reduction: We next run a singular value decomposition (SVD) on the TD-IDF normalized matrix, using the features (peaks) selected above. This returns a low-dimensional representation of the object (for users who are more familiar with scRNA-seq, you can think of this as analogous to the output of PCA).
Now that the cells are embedded in a low-dimensional space, we can use methods commonly applied for the analysis of scRNA-seq data to perform graph-based clustering, and non-linear dimension reduction for visualization. The functions
FindClusters all come from the Seurat package.
pbmc <- RunUMAP(object = pbmc, reduction = 'lsi', dims = 1:30) pbmc <- FindNeighbors(object = pbmc, reduction = 'lsi', dims = 1:30) pbmc <- FindClusters(object = pbmc, verbose = FALSE) DimPlot(object = pbmc, label = TRUE) + NoLegend()
The UMAP visualization reveals the presence of multiple cell groups in human blood. If you are familiar with our scRNA-seq analyses of PBMC, you may even recognize the presence of certain myeloid and lymphoid populations in the scATAC-seq data. However, annotating and interpreting clusters is more challenging in scATAC-seq data, as much less is known about the functional roles of noncoding genomic regions than is known about protein coding regions (genes).
However, we can try to quantify the activity of each gene in the genome by assessing the chromatin accessibility associated with each gene, and create a new gene activity assay derived from the scATAC-seq data. Here we will use a simple approach of summing the reads intersecting the gene body and promoter region (we also recommend exploring the Cicero tool, which can accomplish a similar goal).
To create a gene activity matrix, we extract gene coordinates for the human genome from EnsembleDB, and extend them to include the 2kb upstream region (as promoter accessibility is often correlated with gene expression). We then count the number of fragments for each cell that map to each of these regions, using the using the
FeatureMatrix function. This takes any set of genomic coordinates, counts the number of reads intersecting these coordinates in each cell, and returns a sparse matrix.
# extract gene coordinates from Ensembl, and ensure name formatting is consistent with Seurat object gene.coords <- genes(EnsDb.Hsapiens.v75, filter = ~ gene_biotype == "protein_coding") seqlevelsStyle(gene.coords) <- 'UCSC' genebody.coords <- keepStandardChromosomes(gene.coords, pruning.mode = 'coarse') genebodyandpromoter.coords <- Extend(x = gene.coords, upstream = 2000, downstream = 0) # create a gene by cell matrix gene.activities <- FeatureMatrix( fragments = fragment.path, features = genebodyandpromoter.coords, cells = colnames(pbmc), chunk = 10 ) # convert rownames from chromsomal coordinates into gene names gene.key <- genebodyandpromoter.coords$gene_name names(gene.key) <- GRangesToString(grange = genebodyandpromoter.coords) rownames(gene.activities) <- gene.key[rownames(gene.activities)] # add the gene activity matrix to the Seurat object as a new assay, and normalize it pbmc[['RNA']] <- CreateAssayObject(counts = gene.activities) pbmc <- NormalizeData( object = pbmc, assay = 'RNA', normalization.method = 'LogNormalize', scale.factor = median(pbmc$nCount_RNA) )
Now, we can visualize the ‘activities’ of canonical marker genes to help interpret our ATAC-seq clusters. Note that the activities will be much noisier than scRNA-seq measurements. This is because they represent predictions from sparse chromatin data, but also because they assume a general correspondence between gene body/promoter accessibility and expression, which may not always be the case. Nonetheless, we can begin to discern populations of monocytes, B, T, and NK cells. However, further subdivision of these cell types is challenging based on supervised analysis alone.
DefaultAssay(pbmc) <- 'RNA' FeaturePlot( object = pbmc, features = c('MS4A1', 'CD3D', 'LEF1', 'NKG7', 'TREM1', 'LYZ'), pt.size = 0.1, max.cutoff = 'q95', ncol = 3 )
To help interpret the scATAC-seq data, we can classify cells based on an scRNA-seq experiment from the same biological system (human PBMC). We utilize methods for cross-modality integration and label transfer, described here, with a more in-depth tutorial here. We aim to identify shared correlation patterns in the gene activity matrix and scRNA-seq dataset in order to identify matched biological states across the two modalities. This procedure returns a classification score for each cell, based on each scRNA-seq defined cluster label. Here we load a pre-processed scRNA-seq dataset for human PBMCs, also provided by 10x Genomics. You can download the raw data for this experiment from the 10x website, and view the code used to construct this object on GitHub. Alternatively, you can download the pre-processed Seurat object here.
# Load the pre-processed scRNA-seq data for PBMCs pbmc_rna <- readRDS("/home/stuartt/github/chrom/vignette_data/pbmc_10k_v3.rds") transfer.anchors <- FindTransferAnchors( reference = pbmc_rna, query = pbmc, reduction = 'cca' ) predicted.labels <- TransferData( anchorset = transfer.anchors, refdata = pbmc_rna$celltype, weight.reduction = pbmc[['lsi']] ) pbmc <- AddMetaData(object = pbmc, metadata = predicted.labels)
library(ggplot2) plot1 <- DimPlot( object = pbmc_rna, group.by = 'celltype', label = TRUE, repel = TRUE) + NoLegend() + ggtitle('scRNA-seq') plot2 <- DimPlot( object = pbmc, group.by = 'predicted.id', label = TRUE, repel = TRUE) + NoLegend() + ggtitle('scATAC-seq') CombinePlots(list(plot1,plot2), ncol = 2)
You can see that the RNA-based classifications are entirely consistent with the UMAP visualization, computed only on the ATAC-seq data. We can now easily annotate our scATAC-seq derived clusters (alternately, we could use the RNA classifications themselves). We note that cluster 16 maps to CD14+ Monocytes, but is a very small cluster with lower QC metrics. As this group is likely representing low-quality cells, we remove it from downstream analysis.
pbmc <- subset(pbmc,idents = 16, invert = TRUE) pbmc <- RenameIdents( object = pbmc, '0' = 'CD14 Mono', '1' = 'CD4 Memory', '2' = 'CD14 Mono', '3' = 'CD8 Naive', '5' = 'CD8 Effector', '4' = 'DN T', '6' = 'NK CD56Dim', '7' = 'CD4 Naive', '8' = 'pre-B', '9' = 'CD16 Mono', '10' = 'CD4 Naive', '11' = 'pro-B', '12' = 'CD8 Effector', '13' = 'DC', '14' = 'NK CD56bright', '15' = 'pDC' )
To find differentially accessible regions between clusters of cells, we can perform a differential accessibility (DA) test. We utilize logistic regression for DA, as suggested by Ntranos et al. 2018 for scRNA-seq data, and add the total number of fragments as a latent variable to mitigate the effect of differential sequencing depth on the result. Here we will focus on comparing Naive CD4 cells and CD14 monocytes, but any groups of cells can be compared using these methods. We can also visualize these chromatin ‘markers’ on a violin plot, feature plot, dot plot, heat map, or any visualization tool in Seurat.
# switch back to working with peaks instead of gene activities DefaultAssay(pbmc) <- 'peaks' da_peaks <- FindMarkers( object = pbmc, ident.1 = "CD4 Naive", ident.2 = "CD14 Mono", min.pct = 0.2, test.use = 'LR', latent.vars = 'peak_region_fragments' ) head(da_peaks)
## p_val avg_logFC pct.1 pct.2 ## chr14:99721608-99741934 9.337072e-306 0.8607908 0.860 0.031 ## chr14:99695477-99720910 1.356438e-244 0.8506591 0.803 0.030 ## chr17:80084198-80086094 1.254994e-237 1.1841526 0.679 0.009 ## chr7:142501666-142511108 9.370214e-229 0.7440743 0.757 0.037 ## chr6:44025105-44028184 3.268806e-215 -0.9358845 0.044 0.619 ## chr2:113581628-113594911 2.761339e-214 -0.9382337 0.042 0.659 ## p_val_adj ## chr14:99721608-99741934 8.338659e-301 ## chr14:99695477-99720910 1.211394e-239 ## chr17:80084198-80086094 1.120797e-232 ## chr7:142501666-142511108 8.368257e-224 ## chr6:44025105-44028184 2.919273e-210 ## chr2:113581628-113594911 2.466069e-209
plot1 <- VlnPlot( object = pbmc, features = rownames(da_peaks), ncol = 3, pt.size = 0.1, idents = c("CD4 Memory","CD14 Mono") ) plot2 <- FeaturePlot( object = pbmc, features = rownames(da_peaks), ncol = 3, pt.size = 0.1 ) CombinePlots(list(plot1,plot2))
Peak coordinates can be difficult to interpret alone. We can find the closest gene to each of these peaks using the
ClosestFeature function and providing an EnsDb annotation. If you explore the gene lists, you will see that peaks open in Naive T cells are close to genes such as BCL11B and GATA3 (key regulator of T cell differentiation), while peaks open in monocytes are close to genes such as CEBPB (key regulator of monocyte differentiation). We could follow up this result further by doing gene ontology enrichment analysis on the gene sets returned by
ClosestFeature, and there are many R packages that can do this (see the
GOstats package for example).
open_cd4naive <- rownames(da_peaks[da_peaks$avg_logFC > 0.5, ]) open_cd14mono <- rownames(da_peaks[da_peaks$avg_logFC < -0.5, ]) closest_genes_cd4naive <- ClosestFeature( regions = open_cd4naive, annotation = EnsDb.Hsapiens.v75, sep = c(':', '-') ) closest_genes_cd14mono <- ClosestFeature( regions = open_cd14mono, annotation = EnsDb.Hsapiens.v75, sep = c(':', '-') )
## gene_id gene_name gene_biotype ## ENSG00000127152 ENSG00000127152 BCL11B protein_coding ## ENSG00000127152.1 ENSG00000127152 BCL11B protein_coding ## ENSG00000176155 ENSG00000176155 CCDC57 protein_coding ## ENSG00000204983 ENSG00000204983 PRSS1 protein_coding ## ENSG00000259003 ENSG00000259003 AE000662.92 protein_coding ## ENSG00000134709 ENSG00000134709 HOOK1 protein_coding ## seq_coord_system symbol entrezid ## ENSG00000127152 chromosome BCL11B 64919 ## ENSG00000127152.1 chromosome BCL11B 64919 ## ENSG00000176155 chromosome CCDC57 284001 ## ENSG00000204983 chromosome PRSS1 5645, 5644 ## ENSG00000259003 chromosome AE000662.92 NA ## ENSG00000134709 chromosome HOOK1 51361 ## region distance ## ENSG00000127152 chr14-99635624-99737861 0 ## ENSG00000127152.1 chr14-99635624-99737861 0 ## ENSG00000176155 chr17-80059336-80170706 0 ## ENSG00000204983 chr7-142457319-142460923 40742 ## ENSG00000259003 chr14-23025534-23027939 0 ## ENSG00000134709 chr1-60280458-60342050 0
## gene_id gene_name gene_biotype ## ENSG00000181577 ENSG00000181577 C6orf223 protein_coding ## ENSG00000125538 ENSG00000125538 IL1B protein_coding ## ENSG00000172216 ENSG00000172216 CEBPB protein_coding ## ENSG00000138621 ENSG00000138621 PPCDC protein_coding ## ENSG00000186350 ENSG00000186350 RXRA protein_coding ## ENSG00000273269 ENSG00000273269 RP11-761B3.1 protein_coding ## seq_coord_system symbol entrezid ## ENSG00000181577 chromosome C6orf223 221416 ## ENSG00000125538 chromosome IL1B 3553 ## ENSG00000172216 chromosome CEBPB 1051 ## ENSG00000138621 chromosome PPCDC 60490 ## ENSG00000186350 chromosome RXRA 6256 ## ENSG00000273269 chromosome RP11-761B3.1 NA ## region distance ## ENSG00000181577 chr6-43968317-43973695 51409 ## ENSG00000125538 chr2-113587328-113594480 0 ## ENSG00000172216 chr20-48807376-48809212 80581 ## ENSG00000138621 chr15-75315896-75409803 0 ## ENSG00000186350 chr9-137208944-137332431 0 ## ENSG00000273269 chr2-47293080-47403650 0
We can also create coverage plots grouped by cluster around any genomic region using the
CoveragePlot function. These represent ‘pseudo-bulk’ accessibility tracks, where all cells within a cluster have been averaged together, in order to visualize a more robust chromatin landscape (thanks to Andrew Hill for giving the inspiration for this function in his excellent blog post).
# set plotting order levels(pbmc) <- c("CD4 Naive","CD4 Memory","CD8 Naive","CD8 Effector","DN T","NK CD56bright","NK CD56Dim","pre-B",'pro-B',"pDC","DC","CD14 Mono",'CD16 Mono') CoveragePlot( object = pbmc, region = rownames(da_peaks)[c(1,5)], sep = c(":", "-"), annotation = EnsDb.Hsapiens.v75, extend.upstream = 20000, extend.downstream = 20000, ncol = 1 )
Working with datasets that were not quantified using CellRanger
The CellRanger software from 10x Genomics generates several useful QC metrics per-cell, as well as a peak/cell matrix and an indexed fragments file. In the above vignette, we utilize the CellRanger outputs, but provide alternative functions in Signac for many of the same purposes here.
FeatureMatrix function can be used to generate a count matrix containing any set of genomic ranges in its rows. These regions could be a set of peaks, or bins that span the entire genome.
# not run # peak_ranges should be a set of genomic ranges spanning the set of peaks to be quantified per cell peak_matrix <- FeatureMatrix( fragments = fragment.path, features = peak_ranges )
For convenience, we also include a
GenomeBinMatrix function that will generate a set of genomic ranges spanning the entire genome for you, and run
FeatureMatrix internally to produce a genome bin/cell matrix.
# not run bin_matrix <- GenomeBinMatrix( fragments = fragment.path, genome = 'hg19', binsize = 5000 )
FRiP will count the fraction of reads in peaks for each cell, given a peak/cell assay and a bin/cell assay. Note that this can be run on a subset of the genome, so that a bin/cell assay does not need to be computed for the whole genome. This will return a Seurat object will metadata added corresponding to the fraction of reads in peaks for each cell.
# not run pbmc <- FRiP( object = pbmc, peak.assay = 'peaks', bin.assay = 'bins' )
The ratio of reads in genomic blacklist regions, that are known to artifactually accumulate reads in genome sequencing assays, can be diagnostic of low-quality cells. We provide blacklist region coordinates for several genomes (hg19, hg38, mm9, mm10, ce10, ce11, dm3, dm6) in the Signac package for convenience. These regions were provided by the ENCODE consortium, and we encourage users to cite their paper if you use the regions in your analysis. The
FractionCountsInRegion function can be used to calculate the fraction of all counts within a given set of regions per cell. We can use this function and the blacklist regions to find the fraction of blacklist counts per cell.
pbmc <- FractionCountsInRegion( object = pbmc, assay = 'peaks', regions = blacklist_hg19 )