For this tutorial, we will be analyzing a single-cell ATAC-seq dataset of human peripheral blood mononuclear cells (PBMCs) provided by 10x Genomics. The following files are used in this vignette, all available through the 10x Genomics website:

First load in Signac, Seurat, and some other packages we will be using for analyzing human data.

library(Signac)
library(Seurat)
library(GenomeInfoDb)
library(EnsDb.Hsapiens.v75)
set.seed(1234)

Quickstart

The vignette below presents a detailed workflow for the pre-processing, QC, clustering, visualization, integration, and interpretation of single-cell ATAC-seq data. While there are many steps (some of which are computationally intensive), we note that the core analytical components are similar to single-cell RNA-seq, and can be run quickly in a few lines. For example, the following lines encapsulate our scATAC-seq workflow (but does not include the calculation of advanced QC metrics, or integration with scRNA-seq data).

library(dplyr)

counts <- Read10X_h5(filename = "/home/stuartt/github/chrom/vignette_data/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
metadata <- read.csv(
  file = "/home/stuartt/github/chrom/vignette_data/atac_v1_pbmc_10k_singlecell.csv",
  header = TRUE,
  row.names = 1
)

pbmc_fast <- CreateSeuratObject(counts = counts,meta.data = metadata) %>%
  subset(peak_region_fragments > 1000) %>%
  RunTFIDF() %>%
  FindTopFeatures(cutoff = 'q75') %>%
  RunSVD(reduction.name='lsi') %>%
  FindNeighbors(reduction='lsi', dims=1:30) %>%
  FindClusters(resolution = 0.5,verbose = FALSE) %>%
  RunUMAP(reduction='lsi', dims=1:30)

DimPlot(pbmc_fast, label = FALSE)

Below, we describe these steps in detail, beginning with pre-processing.

Pre-processing workflow

When pre-processing chromatin data, Signac uses information from two related input files, both of which are created by CellRanger:

  • Peak/Cell matrix. This is analogous to the gene expression count matrix used to analyze single-cell RNA-seq. However, instead of genes, each row of the matrix represents a region of the genome (a ‘peak’), that is predicted to represent a region of open chromatin. Each value in the matrix represents the number of Tn5 cut sites for each single barcode (i.e. cell) that map within each peak. You can find more detail on the 10X Website.

  • Fragment file. This represents a full list of all unique fragments across all single cells. It is a substantially larger file, is slower to work with, and is stored on-disk (instead of in memory). However, the advantage of retaining this file is that it contains all fragments associated with each single cell, as opposed to only reads that map to peaks.

For most analyses we work with the peak/cell matrix, but for some (e.g. creating a ‘Gene Activity Matrix’), we find that restricting only to reads in peaks can adversely affect results. We therefore use both files in this vignette. We start by creating a Seurat object using the peak/cell matrix, and store the path to the fragment file on disk in the Seurat object:

counts <- Read10X_h5(filename = "/home/stuartt/github/chrom/vignette_data/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
metadata <- read.csv(
  file = "/home/stuartt/github/chrom/vignette_data/atac_v1_pbmc_10k_singlecell.csv",
  header = TRUE,
  row.names = 1
)

pbmc <- CreateSeuratObject(
  counts = counts,
  assay = 'peaks',
  project = 'ATAC',
  min.cells = 1,
  meta.data = metadata
)
## Warning in CreateSeuratObject(counts = counts, assay = "peaks", project =
## "ATAC", : Some cells in meta.data not present in provided counts matrix.
fragment.path <- '/home/stuartt/github/chrom/vignette_data/atac_v1_pbmc_10k_fragments.tsv.gz'

pbmc <- SetFragments(
  object = pbmc,
  file = fragment.path
)
pbmc
## An object of class Seurat
## 89307 features across 8728 samples within 1 assay
## Active assay: peaks (89307 features)

Optional step: Filtering the fragment file

To make downstream steps that use this file faster, we can filter the fragments file to contain only reads from cells that we retain in the analysis. This is optional, but slow, and only needs to be performed once. Running this command writes a new file to disk and indexes the file so it is ready to be used by Signac.

fragment_file_filtered <- "/home/stuartt/github/chrom/vignette_data/atac_v1_pbmc_10k_filtered_fragments.tsv.bgz"
FilterFragments(
  fragment.path = fragment.path,
  cells = colnames(pbmc),
  output.path = fragment_file_filtered
)
pbmc <- SetFragments(object = pbmc, file = fragment_file_filtered)

Computing QC Metrics

We can now compute some QC metrics for the scATAC-seq experiment. We currently suggest the following metrics below to assess data quality. As with scRNA-seq, the expected range of values for these parameters will vary depending on your biological system, cell viability, etc.

  • Nucleosome banding pattern: The histogram of fragment sizes (determined from the paired-end sequencing reads) should exhibit a strong nucleosome banding pattern. We calculate this per single cell, and quantify the approximate ratio of mononucleosomal to nucleosome-free fragments (stored as nucleosome_signal). Note that by default, this is calculated only on chr1 reads (see the region parameter) to save time.

  • Total number of fragments in peaks: A measure of cellular sequencing depth / complexity. Cells with very few reads may need to be excluded due to low sequencing depth. Cells with extremely high levels may represent doublets or nuclear clumps.

  • Fraction of fragments in peaks. Represents the fraction of total fragments that fall within ATAC-seq peaks. Cells with low values (i.e. <15-20%) often represent low-quality cells or technical artifacts that should be removed.

  • Ratio reads in ‘blacklist’ sites. The ENCODE project has provided a list of blacklist regions, representing reads which are often associated with artifactual signal. Cells with a high proportion of reads mapping to these areas (compared to reads mapping to peaks) often represent technical artifacts and should be removed. ENCODE blacklist regions for human (hg19 and GRCh38), mouse (mm10), Drosophila (dm3), and C. elegans (ce10) are included in the Signac package.

Note that the last three metrics can be obtained from the output of CellRanger (which is stored in the object metadata), but can also be calculated for non-10x datasets using Signac (more information at the end of this document).

pbmc <- NucleosomeSignal(object = pbmc)
pbmc$pct_reads_in_peaks <- pbmc$peak_region_fragments / pbmc$total * 100
pbmc$blacklist_ratio <- pbmc$blacklist_region_fragments / pbmc$peak_region_fragments

plot1 <- VlnPlot(
  object = pbmc,
  features = c('pct_reads_in_peaks', 'blacklist_ratio', 'nucleosome_signal'),
  pt.size = 0.1) + NoLegend()

plot2_a <- VlnPlot(
  object = pbmc,
  features = 'peak_region_fragments',
  pt.size = 0.1, log = TRUE) + NoLegend()

plot2_b <- FeatureScatter(pbmc,"peak_region_fragments",'nucleosome_signal', pt.size = 0.1) + NoLegend()
plot2_c <- FeatureScatter(pbmc,"peak_region_fragments",'blacklist_ratio', pt.size = 0.1) + NoLegend()
plot2 <- CombinePlots(plots = list(plot2_a,plot2_b,plot2_c), ncol = 3)

CombinePlots(list(plot1,plot2),ncol = 1)

We can also look at the fragment length periodicity for all the cells, and group by cells with high or low nucleosomal signal strength. You can see that cells which are outliers for the mononucleosomal/ nucleosome-free ratio (based off the plots above) have different banding patterns. The remaining cells exhibit a pattern that is typical for a successful ATAC-seq experiment.

We remove cells that are outliers for these QC metrics.

pbmc$nucleosome_group <- ifelse(pbmc$nucleosome_signal > 10, 'NS > 10', 'NS < 10')
PeriodPlot(object = pbmc, group.by = 'nucleosome_group')

#filter out cells 
pbmc <- subset(pbmc, subset = peak_region_fragments > 1000 & peak_region_fragments < 20000 & pct_reads_in_peaks > 15 & blacklist_ratio < 0.05 & nucleosome_signal < 10)

Normalization and linear dimensional reduction

  • Normalization: Signac performs term frequency-inverse document frequency (TF-IDF) normalization. This is a two-step normalization procedure, that both normalizes across cells to correct for differences in cellular sequencing depth, and across peaks to give higher values to more rare peaks.

  • Feature selection: The largely binary nature of scATAC-seq data makes it challenging to perform ‘variable’ feature selection, as we do for scRNA-seq. Instead, we can choose to use only the top n% of features (peaks) for dimensional reduction, or remove features present in less that n cells with the FindTopFeatures function. Here, we will all features, though we note that we see very similar results when using only a subset of features (try setting min.cutoff to ‘q75’ to use the top 25% all peaks), with faster runtimes. Features used for dimensional reduction are automatically set as VariableFeatures for the Seurat object by this function.

  • Dimensional reduction: We next run a singular value decomposition (SVD) on the TD-IDF normalized matrix, using the features (peaks) selected above. This returns a low-dimensional representation of the object (for users who are more familiar with scRNA-seq, you can think of this as analogous to the output of PCA).

pbmc <- RunTFIDF(pbmc)
pbmc <- FindTopFeatures(pbmc, min.cutoff = 'q0')
pbmc <- RunSVD(
  object = pbmc,
  assay = 'peaks',
  reduction.key = 'LSI_',
  reduction.name = 'lsi'
)

Non-linear dimension reduction and clustering

Now that the cells are embedded in a low-dimensional space, we can use methods commonly applied for the analysis of scRNA-seq data to perform graph-based clustering, and non-linear dimension reduction for visualization. The functions RunUMAP, FindNeighbors, and FindClusters all come from the Seurat package.

pbmc <- RunUMAP(object = pbmc, reduction = 'lsi', dims = 1:30)
pbmc <- FindNeighbors(object = pbmc, reduction = 'lsi', dims = 1:30)
pbmc <- FindClusters(object = pbmc, verbose = FALSE)
DimPlot(object = pbmc, label = TRUE) + NoLegend()

Create a gene activity matrix

The UMAP visualization reveals the presence of multiple cell groups in human blood. If you are familiar with our scRNA-seq analyses of PBMC, you may even recognize the presence of certain myeloid and lymphoid populations in the scATAC-seq data. However, annotating and interpreting clusters is more challenging in scATAC-seq data, as much less is known about the functional roles of noncoding genomic regions than is known about protein coding regions (genes).

However, we can try to quantify the activity of each gene in the genome by assessing the chromatin accessibility associated with each gene, and create a new gene activity assay derived from the scATAC-seq data. Here we will use a simple approach of summing the reads intersecting the gene body and promoter region (we also recommend exploring the Cicero tool, which can accomplish a similar goal).

To create a gene activity matrix, we extract gene coordinates for the human genome from EnsembleDB, and extend them to include the 2kb upstream region (as promoter accessibility is often correlated with gene expression). We then count the number of fragments for each cell that map to each of these regions, using the using the FeatureMatrix function. This takes any set of genomic coordinates, counts the number of reads intersecting these coordinates in each cell, and returns a sparse matrix.

# extract gene coordinates from Ensembl, and ensure name formatting is consistent with Seurat object 
gene.coords <- genes(EnsDb.Hsapiens.v75, filter = ~ gene_biotype == "protein_coding")
seqlevelsStyle(gene.coords) <- 'UCSC'
genebody.coords <- keepStandardChromosomes(gene.coords, pruning.mode = 'coarse')
genebodyandpromoter.coords <- Extend(x = gene.coords, upstream = 2000, downstream = 0)

# create a gene by cell matrix
gene.activities <- FeatureMatrix(
  fragments = fragment.path,
  features = genebodyandpromoter.coords,
  cells = colnames(pbmc),
  chunk = 10
)

# convert rownames from chromsomal coordinates into gene names
gene.key <- genebodyandpromoter.coords$gene_name
names(gene.key) <- GRangesToString(grange = genebodyandpromoter.coords)
rownames(gene.activities) <- gene.key[rownames(gene.activities)]

# add the gene activity matrix to the Seurat object as a new assay, and normalize it
pbmc[['RNA']] <- CreateAssayObject(counts = gene.activities)
pbmc <- NormalizeData(
  object = pbmc,
  assay = 'RNA',
  normalization.method = 'LogNormalize',
  scale.factor = median(pbmc$nCount_RNA)
)

Now, we can visualize the ‘activities’ of canonical marker genes to help interpret our ATAC-seq clusters. Note that the activities will be much noisier than scRNA-seq measurements. This is because they represent predictions from sparse chromatin data, but also because they assume a general correspondence between gene body/promoter accessibility and expression, which may not always be the case. Nonetheless, we can begin to discern populations of monocytes, B, T, and NK cells. However, further subdivision of these cell types is challenging based on supervised analysis alone.

DefaultAssay(pbmc) <- 'RNA'

FeaturePlot(
  object = pbmc,
  features = c('MS4A1', 'CD3D', 'LEF1', 'NKG7', 'TREM1', 'LYZ'),
  pt.size = 0.1,
  max.cutoff = 'q95',
  ncol = 3
)

Integrating with scRNA-seq data

To help interpret the scATAC-seq data, we can classify cells based on an scRNA-seq experiment from the same biological system (human PBMC). We utilize methods for cross-modality integration and label transfer, described here, with a more in-depth tutorial here. We aim to identify shared correlation patterns in the gene activity matrix and scRNA-seq dataset in order to identify matched biological states across the two modalities. This procedure returns a classification score for each cell, based on each scRNA-seq defined cluster label. Here we load a pre-processed scRNA-seq dataset for human PBMCs, also provided by 10x Genomics. You can download the raw data for this experiment from the 10x website, and view the code used to construct this object on GitHub. Alternatively, you can download the pre-processed Seurat object here.

# Load the pre-processed scRNA-seq data for PBMCs
pbmc_rna <- readRDS("/home/stuartt/github/chrom/vignette_data/pbmc_10k_v3.rds")

transfer.anchors <- FindTransferAnchors(
  reference = pbmc_rna,
  query = pbmc,
  reduction = 'cca'
)

predicted.labels <- TransferData(
  anchorset = transfer.anchors,
  refdata = pbmc_rna$celltype,
  weight.reduction = pbmc[['lsi']]
)

pbmc <- AddMetaData(object = pbmc, metadata = predicted.labels)
library(ggplot2)

plot1 <- DimPlot(
  object = pbmc_rna,
  group.by = 'celltype',
  label = TRUE,
  repel = TRUE) + NoLegend() + ggtitle('scRNA-seq')

plot2 <- DimPlot(
  object = pbmc,
  group.by = 'predicted.id',
  label = TRUE,
  repel = TRUE) + NoLegend() + ggtitle('scATAC-seq')

CombinePlots(list(plot1,plot2), ncol = 2)

You can see that the RNA-based classifications are entirely consistent with the UMAP visualization, computed only on the ATAC-seq data. We can now easily annotate our scATAC-seq derived clusters (alternately, we could use the RNA classifications themselves). We note that cluster 16 maps to CD14+ Monocytes, but is a very small cluster with lower QC metrics. As this group is likely representing low-quality cells, we remove it from downstream analysis.

pbmc <- subset(pbmc,idents = 16, invert = TRUE)
pbmc <- RenameIdents(
  object = pbmc,
  '0' = 'CD14 Mono',
  '1' = 'CD4 Memory',
  '2' = 'CD14 Mono',
  '3' = 'CD8 Naive',
  '5' = 'CD8 Effector',
  '4' = 'DN T',
  '6' = 'NK CD56Dim',
  '7' = 'CD4 Naive',
  '8' = 'pre-B',
  '9' = 'CD16 Mono',
  '10' = 'CD4 Naive',
  '11' = 'pro-B',
  '12' = 'CD8 Effector',
  '13' = 'DC',
  '14' = 'NK CD56bright',
  '15' = 'pDC'
)

Find differentially accessible peaks between clusters

To find differentially accessible regions between clusters of cells, we can perform a differential accessibility (DA) test. We utilize logistic regression for DA, as suggested by Ntranos et al. 2018 for scRNA-seq data, and add the total number of fragments as a latent variable to mitigate the effect of differential sequencing depth on the result. Here we will focus on comparing Naive CD4 cells and CD14 monocytes, but any groups of cells can be compared using these methods. We can also visualize these chromatin ‘markers’ on a violin plot, feature plot, dot plot, heat map, or any visualization tool in Seurat.

# switch back to working with peaks instead of gene activities
DefaultAssay(pbmc) <- 'peaks'

da_peaks <- FindMarkers(
  object = pbmc,
  ident.1 = "CD4 Naive",
  ident.2 = "CD14 Mono",
  min.pct = 0.2,
  test.use = 'LR',
  latent.vars = 'peak_region_fragments'
)

head(da_peaks)
##                                  p_val  avg_logFC pct.1 pct.2
## chr14:99721608-99741934  9.337072e-306  0.8607908 0.860 0.031
## chr14:99695477-99720910  1.356438e-244  0.8506591 0.803 0.030
## chr17:80084198-80086094  1.254994e-237  1.1841526 0.679 0.009
## chr7:142501666-142511108 9.370214e-229  0.7440743 0.757 0.037
## chr6:44025105-44028184   3.268806e-215 -0.9358845 0.044 0.619
## chr2:113581628-113594911 2.761339e-214 -0.9382337 0.042 0.659
##                              p_val_adj
## chr14:99721608-99741934  8.338659e-301
## chr14:99695477-99720910  1.211394e-239
## chr17:80084198-80086094  1.120797e-232
## chr7:142501666-142511108 8.368257e-224
## chr6:44025105-44028184   2.919273e-210
## chr2:113581628-113594911 2.466069e-209
plot1 <- VlnPlot(
  object = pbmc,
  features = rownames(da_peaks)[1],
  ncol = 3,
  pt.size = 0.1,
  idents = c("CD4 Memory","CD14 Mono")
)
plot2 <- FeaturePlot(
  object = pbmc,
  features = rownames(da_peaks)[1],
  ncol = 3,
  pt.size = 0.1
)
CombinePlots(list(plot1,plot2))

Peak coordinates can be difficult to interpret alone. We can find the closest gene to each of these peaks using the ClosestFeature function and providing an EnsDb annotation. If you explore the gene lists, you will see that peaks open in Naive T cells are close to genes such as BCL11B and GATA3 (key regulator of T cell differentiation), while peaks open in monocytes are close to genes such as CEBPB (key regulator of monocyte differentiation). We could follow up this result further by doing gene ontology enrichment analysis on the gene sets returned by ClosestFeature, and there are many R packages that can do this (see the GOstats package for example).

open_cd4naive <- rownames(da_peaks[da_peaks$avg_logFC > 0.5, ])
open_cd14mono <- rownames(da_peaks[da_peaks$avg_logFC < -0.5, ])

closest_genes_cd4naive <- ClosestFeature(
  regions = open_cd4naive,
  annotation = EnsDb.Hsapiens.v75,
  sep = c(':', '-')
)
closest_genes_cd14mono <- ClosestFeature(
  regions = open_cd14mono,
  annotation = EnsDb.Hsapiens.v75,
  sep = c(':', '-')
)
head(closest_genes_cd4naive)
##                           gene_id   gene_name   gene_biotype
## ENSG00000127152   ENSG00000127152      BCL11B protein_coding
## ENSG00000127152.1 ENSG00000127152      BCL11B protein_coding
## ENSG00000176155   ENSG00000176155      CCDC57 protein_coding
## ENSG00000204983   ENSG00000204983       PRSS1 protein_coding
## ENSG00000259003   ENSG00000259003 AE000662.92 protein_coding
## ENSG00000134709   ENSG00000134709       HOOK1 protein_coding
##                   seq_coord_system      symbol   entrezid
## ENSG00000127152         chromosome      BCL11B      64919
## ENSG00000127152.1       chromosome      BCL11B      64919
## ENSG00000176155         chromosome      CCDC57     284001
## ENSG00000204983         chromosome       PRSS1 5645, 5644
## ENSG00000259003         chromosome AE000662.92         NA
## ENSG00000134709         chromosome       HOOK1      51361
##                                     region distance
## ENSG00000127152    chr14-99635624-99737861        0
## ENSG00000127152.1  chr14-99635624-99737861        0
## ENSG00000176155    chr17-80059336-80170706        0
## ENSG00000204983   chr7-142457319-142460923    40742
## ENSG00000259003    chr14-23025534-23027939        0
## ENSG00000134709     chr1-60280458-60342050        0
head(closest_genes_cd14mono)
##                         gene_id    gene_name   gene_biotype
## ENSG00000181577 ENSG00000181577     C6orf223 protein_coding
## ENSG00000125538 ENSG00000125538         IL1B protein_coding
## ENSG00000172216 ENSG00000172216        CEBPB protein_coding
## ENSG00000138621 ENSG00000138621        PPCDC protein_coding
## ENSG00000186350 ENSG00000186350         RXRA protein_coding
## ENSG00000273269 ENSG00000273269 RP11-761B3.1 protein_coding
##                 seq_coord_system       symbol entrezid
## ENSG00000181577       chromosome     C6orf223   221416
## ENSG00000125538       chromosome         IL1B     3553
## ENSG00000172216       chromosome        CEBPB     1051
## ENSG00000138621       chromosome        PPCDC    60490
## ENSG00000186350       chromosome         RXRA     6256
## ENSG00000273269       chromosome RP11-761B3.1       NA
##                                   region distance
## ENSG00000181577   chr6-43968317-43973695    51409
## ENSG00000125538 chr2-113587328-113594480        0
## ENSG00000172216  chr20-48807376-48809212    80581
## ENSG00000138621  chr15-75315896-75409803        0
## ENSG00000186350 chr9-137208944-137332431        0
## ENSG00000273269   chr2-47293080-47403650        0

We can also create coverage plots grouped by cluster around any genomic region using the CoveragePlot function. These represent ‘pseudo-bulk’ accessibility tracks, where all cells within a cluster have been averaged together, in order to visualize a more robust chromatin landscape (thanks to Andrew Hill for giving the inspiration for this function in his excellent blog post).

# set plotting order
levels(pbmc) <- c("CD4 Naive","CD4 Memory","CD8 Naive","CD8 Effector","DN T","NK CD56bright","NK CD56Dim","pre-B",'pro-B',"pDC","DC","CD14 Mono",'CD16 Mono')

CoveragePlot(
  object = pbmc,
  region = rownames(da_peaks)[c(1,5)],
  sep = c(":", "-"),
  annotation = EnsDb.Hsapiens.v75,
  extend.upstream = 20000,
  extend.downstream = 20000,
  ncol = 1
)

Working with datasets that were not quantified using CellRanger

The CellRanger software from 10x Genomics generates several useful QC metrics per-cell, as well as a peak/cell matrix and an indexed fragments file. In the above vignette, we utilize the CellRanger outputs, but provide alternative functions in Signac for many of the same purposes here.

Generating a peak/cell or bin/cell matrix

The FeatureMatrix function can be used to generate a count matrix containing any set of genomic ranges in its rows. These regions could be a set of peaks, or bins that span the entire genome.

# not run
# peak_ranges should be a set of genomic ranges spanning the set of peaks to be quantified per cell
peak_matrix <- FeatureMatrix(
  fragments = fragment.path,
  features = peak_ranges
)

For convenience, we also include a GenomeBinMatrix function that will generate a set of genomic ranges spanning the entire genome for you, and run FeatureMatrix internally to produce a genome bin/cell matrix.

# not run
bin_matrix <- GenomeBinMatrix(
  fragments = fragment.path,
  genome = 'hg19',
  binsize = 5000
)

Counting fraction of reads in peaks

The function FRiP will count the fraction of reads in peaks for each cell, given a peak/cell assay and a bin/cell assay. Note that this can be run on a subset of the genome, so that a bin/cell assay does not need to be computed for the whole genome. This will return a Seurat object will metadata added corresponding to the fraction of reads in peaks for each cell.

# not run
pbmc <- FRiP(
  object = pbmc,
  peak.assay = 'peaks',
  bin.assay = 'bins'
)

Counting fragments in genome blacklist regions

The ratio of reads in genomic blacklist regions, that are known to artifactually accumulate reads in genome sequencing assays, can be diagnostic of low-quality cells. We provide blacklist region coordinates for several genomes (hg19, hg38, mm9, mm10, ce10, ce11, dm3, dm6) in the Signac package for convenience. These regions were provided by the ENCODE consortium, and we encourage users to cite their paper if you use the regions in your analysis. The FractionCountsInRegion function can be used to calculate the fraction of all counts within a given set of regions per cell. We can use this function and the blacklist regions to find the fraction of blacklist counts per cell.

pbmc <- FractionCountsInRegion(
  object = pbmc,
  assay = 'peaks',
  regions = blacklist_hg19
)