In this tutorial, we will perform DNA sequence motif analysis in Signac. We will explore two complementary options for performing motif analysis: one by finding overrepresented motifs in a set of differentially accessible peaks, one method performing differential motif activity analysis between groups of cells.

In this demonstration we use data from the adult mouse brain. See our vignette for the code used to generate this object, and links to the raw data. First, load the required packages and the pre-computed Seurat object:

library(Signac)
library(Seurat)
library(JASPAR2020)
library(TFBSTools)
library(BSgenome.Mmusculus.UCSC.mm10)
library(patchwork)
set.seed(1234)
mouse_brain <- readRDS("../vignette_data/adult_mouse_brain.rds")
mouse_brain
## An object of class Seurat 
## 178793 features across 3512 samples within 2 assays 
## Active assay: peaks (156815 features, 156815 variable features)
##  1 other assay present: RNA
##  2 dimensional reductions calculated: lsi, umap
p1 <- DimPlot(mouse_brain, label = TRUE, pt.size = 0.1) + NoLegend()
p1

The Motif class

To facilitate motif analysis in Signac, we have create the Motif class to store all the required information, including a list of position weight matrices (PWMs) or position frequency matrices (PFMs) and a motif occurrence matrix. Here, we construct a Motif object and add it to our mouse brain dataset. A motif object can be added to any Seurat assay using the SetAssayData() function. See the object interaction vignette for more information.

# Get a list of motif position frequency matrices from the JASPAR database
pfm <- getMatrixSet(
  x = JASPAR2020,
  opts = list(species = 9606, all_versions = FALSE)
)

# Scan the DNA sequence of each peak for the presence of each motif
motif.matrix <- CreateMotifMatrix(
  features = granges(mouse_brain),
  pwm = pfm,
  genome = 'mm10',
  use.counts = FALSE
)

# Create a new Mofif object to store the results
motif <- CreateMotifObject(
  data = motif.matrix,
  pwm = pfm
)

# Add the Motif object to the assay
mouse_brain <- SetAssayData(
  object = mouse_brain,
  assay = 'peaks',
  slot = 'motifs',
  new.data = motif
)
mouse_brain[["peaks"]]
## ChromatinAssay data with 156815 features for 3512 cells
## Variable features: 156815 
## Genome: mm10 
## Annotation present: TRUE 
## Motifs present: TRUE 
## Fragment files: 1

In order to test for overrepresented motifs, we also need to compute some sequence characteristics of the peaks, such as GC content, sequence length, and dinucleotide frequency. The RegionStats() function computes this information for us and stores the results in the feature metadata in the Seurat object.

mouse_brain <- RegionStats(object = mouse_brain, genome = BSgenome.Mmusculus.UCSC.mm10)

Finding overrepresented motifs

To identify potentially important cell-type-specific regulatory sequences, we can search for DNA motifs that are overrepresented in a set of peaks that are differentially accessible between cell types.

Here, we find differentially accessible peaks between Pvalb and Sst inhibitory interneurons. We then perform a hypergeometric test to test the probability of observing the motif at the given frequency by chance, comparing with a background set of peaks matched for GC content.

da_peaks <- FindMarkers(
  object = mouse_brain,
  ident.1 = 'Pvalb',
  ident.2 = 'Sst',
  only.pos = TRUE,
  test.use = 'LR',
  latent.vars = 'nCount_peaks'
)

# get top differentially accessible peaks
top.da.peak <- rownames(da_peaks[da_peaks$p_val < 0.005, ])
Optional: choosing a set of background peaks

Matching the set of background peaks is essential when finding enriched DNA sequence motifs. By default, we choose a set of peaks matched for GC content, but it can be sometimes be beneficial to further restrict the background peaks to those that are accessible in the groups of cells compared when finding differentially accessible peaks.

The AccessiblePeaks() function can be used to find a set of peaks that are open in a subset of cells. We can use this function to first restrict the set of possible background peaks to those peaks that were open in the set of cells compared in FindMarkers(), and then create a GC-content-matched set of peaks from this larger set using MatchRegionStats().

# find peaks open in Pvalb or Sst cells
open.peaks <- AccessiblePeaks(mouse_brain, idents = c("Pvalb", "Sst"))

# match the overall GC content in the peak set
peaks.matched <- MatchRegionStats(
  meta.feature = GetAssayData(mouse_brain, assay = "peaks", slot = "meta.features")[open.peaks, ],
  regions = top.da.peak,
  n = 50000
)
## Matching GC.percent distribution

peaks.matched can then be used as the background peak set by setting background=peaks.matched in FindMotifs().

# test enrichment
enriched.motifs <- FindMotifs(
  object = mouse_brain,
  features = top.da.peak
)
## Selecting background regions to match input
##               sequence characteristics
## Matching GC.percent distribution
## Testing motif enrichment in 2128 regions
knitr::kable(head(enriched.motifs))
motif observed background percent.observed percent.background fold.enrichment pvalue qvalue motif.name
MA0497.1 MA0497.1 890 9056 41.82331 22.6400 1.847319 0 0 MEF2C
MA0052.4 MA0052.4 842 8672 39.56767 21.6800 1.825077 0 0 MEF2A
MA1151.1 MA1151.1 445 3248 20.91165 8.1200 2.575327 0 0 RORC
MA0773.1 MA0773.1 600 5414 28.19549 13.5350 2.083154 0 0 MEF2D
MA1150.1 MA1150.1 420 3146 19.73684 7.8650 2.509452 0 0 RORB
MA0660.1 MA0660.1 515 4363 24.20113 10.9075 2.218760 0 0 MEF2B

We can also plot the position weight matrices for the motifs, so we can visualize the different motif sequences.

MotifPlot(
  object = mouse_brain,
  motifs = head(rownames(enriched.motifs))
)

We and others have previously shown that Mef-family motifs, particularly Mef2c, are enriched in Pvalb-specific peaks in scATAC-seq data (https://doi.org/10.1016/j.cell.2019.05.031; https://doi.org/10.1101/615179), and further shown that Mef2c is required for the development of Pvalb interneurons (https://www.nature.com/articles/nature25999). Here our results are consistent with these findings, and we observe a strong enrichment of Mef-family motifs in the top results from FindMotifs().

Computing motif activities

We can also compute a per-cell motif activity score by running chromVAR. This allows us to visualize motif activities per cell, and also provides an alternative method of identifying differentially-active motifs between cell types.

ChromVAR identifies motifs associated with variability in chromatin accessibility between cells. See the chromVAR paper for a complete description of the method.

mouse_brain <- RunChromVAR(
  object = mouse_brain,
  genome = BSgenome.Mmusculus.UCSC.mm10
)

DefaultAssay(mouse_brain) <- 'chromvar'

# look at the activity of Mef2c
p2 <- FeaturePlot(
  object = mouse_brain,
  features = "MA0497.1",
  min.cutoff = 'q10',
  max.cutoff = 'q90',
  pt.size = 0.1
)
p1 + p2

We can also directly test for differential activity scores between cell types. This tends to give similar results as performing an enrichment test on differentially accessible peaks between the cell types (shown above).

differential.activity <- FindMarkers(
  object = mouse_brain,
  ident.1 = 'Pvalb',
  ident.2 = 'Sst',
  only.pos = TRUE,
  test.use = 'LR',
  latent.vars = 'nCount_peaks'
)

MotifPlot(
  object = mouse_brain,
  motifs = head(rownames(differential.activity)),
  assay = 'peaks'
)

Session Info
## R version 4.0.1 (2020-06-06)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] patchwork_1.0.1                    BSgenome.Mmusculus.UCSC.mm10_1.4.0
##  [3] BSgenome_1.56.0                    rtracklayer_1.48.0                
##  [5] Biostrings_2.56.0                  XVector_0.28.0                    
##  [7] GenomicRanges_1.40.0               GenomeInfoDb_1.24.2               
##  [9] IRanges_2.22.2                     S4Vectors_0.26.1                  
## [11] BiocGenerics_0.34.0                TFBSTools_1.26.0                  
## [13] JASPAR2020_0.99.10                 Seurat_3.2.0.9006                 
## [15] Signac_1.0.0                      
## 
## loaded via a namespace (and not attached):
##   [1] rappdirs_0.3.1              SnowballC_0.7.0            
##   [3] GGally_2.0.0                R.methodsS3_1.8.0          
##   [5] nabor_0.5.0                 tidyr_1.1.1                
##   [7] ggplot2_3.3.2               acepack_1.4.1              
##   [9] bit64_4.0.2                 knitr_1.29                 
##  [11] irlba_2.3.3                 DelayedArray_0.14.1        
##  [13] R.utils_2.9.2               data.table_1.13.0          
##  [15] rpart_4.1-15                KEGGREST_1.28.0            
##  [17] RCurl_1.98-1.2              AnnotationFilter_1.12.0    
##  [19] generics_0.0.2              GenomicFeatures_1.40.1     
##  [21] cowplot_1.0.0               RSQLite_2.2.0              
##  [23] RANN_2.6.1                  chromVAR_1.10.0            
##  [25] future_1.18.0               bit_4.0.4                  
##  [27] spatstat.data_1.4-3         httpuv_1.5.4               
##  [29] SummarizedExperiment_1.18.2 assertthat_0.2.1           
##  [31] DirichletMultinomial_1.30.0 xfun_0.16                  
##  [33] hms_0.5.3                   evaluate_0.14              
##  [35] promises_1.1.1              progress_1.2.2             
##  [37] caTools_1.18.0              dbplyr_1.4.4               
##  [39] igraph_1.2.5                DBI_1.1.0                  
##  [41] htmlwidgets_1.5.1           reshape_0.8.8              
##  [43] purrr_0.3.4                 ellipsis_0.3.1             
##  [45] dplyr_1.0.1                 backports_1.1.8            
##  [47] annotate_1.66.0             biomaRt_2.44.1             
##  [49] deldir_0.1-28               vctrs_0.3.2                
##  [51] Biobase_2.48.0              ensembldb_2.12.1           
##  [53] ROCR_1.0-11                 abind_1.4-5                
##  [55] checkmate_2.0.0             sctransform_0.2.1          
##  [57] GenomicAlignments_1.24.0    prettyunits_1.1.1          
##  [59] goftest_1.2-2               cluster_2.1.0              
##  [61] ape_5.4                     lazyeval_0.2.2             
##  [63] seqLogo_1.54.3              crayon_1.3.4               
##  [65] pkgconfig_2.0.3             labeling_0.3               
##  [67] nlme_3.1-148                ProtGenerics_1.20.0        
##  [69] nnet_7.3-14                 rlang_0.4.7                
##  [71] globals_0.12.5              lifecycle_0.2.0            
##  [73] miniUI_0.1.1.1              BiocFileCache_1.12.1       
##  [75] rsvd_1.0.3                  dichromat_2.0-0            
##  [77] rprojroot_1.3-2             polyclip_1.10-0            
##  [79] matrixStats_0.56.0          lmtest_0.9-37              
##  [81] graph_1.66.0                Matrix_1.2-18              
##  [83] ggseqlogo_0.1               zoo_1.8-8                  
##  [85] base64enc_0.1-3             ggridges_0.5.2             
##  [87] png_0.1-7                   viridisLite_0.3.0          
##  [89] bitops_1.0-6                R.oo_1.23.0                
##  [91] KernSmooth_2.23-17          blob_1.2.1                 
##  [93] stringr_1.4.0               qvalue_2.20.0              
##  [95] readr_1.3.1                 jpeg_0.1-8.1               
##  [97] CNEr_1.24.0                 scales_1.1.1               
##  [99] memoise_1.1.0               magrittr_1.5               
## [101] plyr_1.8.6                  ica_1.0-2                  
## [103] zlibbioc_1.34.0             compiler_4.0.1             
## [105] RColorBrewer_1.1-2          fitdistrplus_1.1-1         
## [107] Rsamtools_2.4.0             listenv_0.8.0              
## [109] pbapply_1.4-2               htmlTable_2.0.1            
## [111] Formula_1.2-3               MASS_7.3-51.6              
## [113] mgcv_1.8-31                 tidyselect_1.1.0           
## [115] stringi_1.4.6               highr_0.8                  
## [117] yaml_2.2.1                  askpass_1.1                
## [119] latticeExtra_0.6-29         ggrepel_0.8.2              
## [121] grid_4.0.1                  VariantAnnotation_1.34.0   
## [123] fastmatch_1.1-0             tools_4.0.1                
## [125] future.apply_1.6.0          rstudioapi_0.11            
## [127] TFMPvalue_0.0.8             foreign_0.8-80             
## [129] lsa_0.73.2                  gridExtra_2.3              
## [131] farver_2.0.3                Rtsne_0.15                 
## [133] digest_0.6.25               BiocManager_1.30.10        
## [135] shiny_1.5.0                 pracma_2.2.9               
## [137] motifmatchr_1.10.0          Rcpp_1.0.5                 
## [139] later_1.1.0.1               RcppAnnoy_0.0.16           
## [141] OrganismDbi_1.30.0          httr_1.4.2                 
## [143] AnnotationDbi_1.50.3        ggbio_1.36.0               
## [145] biovizBase_1.36.0           colorspace_1.4-1           
## [147] XML_3.99-0.5                fs_1.5.0                   
## [149] tensor_1.5                  reticulate_1.16            
## [151] splines_4.0.1               uwot_0.1.8                 
## [153] RBGL_1.64.0                 RcppRoll_0.3.0             
## [155] spatstat.utils_1.17-0       pkgdown_1.5.1              
## [157] plotly_4.9.2.1              xtable_1.8-4               
## [159] jsonlite_1.7.0              poweRlaw_0.70.6            
## [161] spatstat_1.64-1             R6_2.4.1                   
## [163] Hmisc_4.4-0                 pillar_1.4.6               
## [165] htmltools_0.5.0             mime_0.9                   
## [167] glue_1.4.1                  fastmap_1.0.1              
## [169] DT_0.14                     BiocParallel_1.22.0        
## [171] codetools_0.2-16            lattice_0.20-41            
## [173] tibble_3.0.3                curl_4.3                   
## [175] leiden_0.3.3                gtools_3.8.2               
## [177] GO.db_3.11.4                openssl_1.4.2              
## [179] survival_3.2-3              rmarkdown_2.2              
## [181] desc_1.2.0                  munsell_0.5.0              
## [183] GenomeInfoDbData_1.2.3      reshape2_1.4.4             
## [185] gtable_0.3.0