In this tutorial, we will perform DNA sequence motif analysis in Signac. We will explore two complementary options for performing motif analysis: one by finding overrepresented motifs in a set of differentially accessible peaks, one method performing differential motif activity analysis between groups of cells.

In this demonstration we use data from the adult mouse brain. See our vignette for the code used to generate this object, and links to the raw data. First, load the required packages and the pre-computed Seurat object:

library(Signac)
library(Seurat)
library(JASPAR2018)
library(TFBSTools)
library(BSgenome.Mmusculus.UCSC.mm10)
set.seed(1234)
mouse_brain <- readRDS("../vignette_data/adult_mouse_brain.rds")
mouse_brain
## An object of class Seurat
## 178653 features across 3795 samples within 2 assays
## Active assay: peaks (157203 features)
##  1 other assay present: RNA
##  2 dimensional reductions calculated: lsi, umap
p1 <- DimPlot(mouse_brain, label = TRUE, pt.size = 0.1) + NoLegend()
p1

The Motif class

To facilitate motif analysis in Signac, we have create the Motif class to store all the required information, including a list of position weight matrices (PWMs) or position frequency matrices (PFMs) and a motif occurrence matrix. Here, we construct a Motif object and add it to our mouse brain dataset. A motif object can be added to any Seurat assay using the AddMotifObject function:

# Get a list of motif position frequency matrices from the JASPAR database
pfm <- getMatrixSet(
  x = JASPAR2018,
  opts = list(species = 9606, all_versions = FALSE)
)

# Scan the DNA sequence of each peak for the presence of each motif
motif.matrix <- CreateMotifMatrix(
  features = StringToGRanges(rownames(mouse_brain), sep = c(":", "-")),
  pwm = pfm,
  genome = 'mm10',
  sep = c(":", "-")
)

# Create a new Mofif object to store the results
motif <- CreateMotifObject(
  data = motif.matrix,
  pwm = pfm
)

# Add the Motif object to the assay
mouse_brain[['peaks']] <- AddMotifObject(
  object = mouse_brain[['peaks']],
  motif.object = motif
)

In order to test for overrepresented motifs, we also need to compute some sequence characteristics of the peaks, such as GC content, sequence length, and dinucleotide frequency. The RegionStats function computes this information for us and stores the results in the feature metadata in the Seurat object.

mouse_brain <- RegionStats(
  object = mouse_brain,
  genome = BSgenome.Mmusculus.UCSC.mm10,
  sep = c(":", "-")
)

Finding overrepresented motifs

To identify potentially important cell-type-specific regulatory sequences, we can search for DNA motifs that are overrepresented in a set of peaks that are differentially accessible between cell types.

Here, we find differentially accessible peaks between Pvalb and Sst inhibitory interneurons. We then perform a hypergeometric test to test the probability of observing the motif at the given frequency by chance, comparing with a background set of peaks matched for GC content.

da_peaks <- FindMarkers(
  object = mouse_brain,
  ident.1 = 'Pvalb',
  ident.2 = 'Sst',
  only.pos = TRUE,
  test.use = 'LR',
  latent.vars = 'nCount_peaks'
)

# Test the differentially accessible peaks for overrepresented motifs
enriched.motifs <- FindMotifs(
  object = mouse_brain,
  features = head(rownames(da_peaks), 1000)
)
head(enriched.motifs)
##             motif observed background percent.observed percent.background
## MA0773.1 MA0773.1      317       5398             31.7            13.4950
## MA0497.1 MA0497.1      439       9116             43.9            22.7900
## MA0072.1 MA0072.1      221       3122             22.1             7.8050
## MA1151.1 MA1151.1      226       3273             22.6             8.1825
## MA0660.1 MA0660.1      265       4335             26.5            10.8375
## MA0052.3 MA0052.3      320       5941             32.0            14.8525
##          fold.enrichment       pvalue  motif.name
## MA0773.1        2.349018 3.432237e-51       MEF2D
## MA0497.1        1.926283 1.040556e-50       MEF2C
## MA0072.1        2.831518 1.380437e-46 RORA(var.2)
## MA1151.1        2.761992 6.054530e-46        RORC
## MA0660.1        2.445213 9.259599e-45       MEF2B
## MA0052.3        2.154519 1.167511e-43       MEF2A

We can also plot the position weight matrices for the motifs, so we can visualize the different motif sequences.

MotifPlot(
  object = mouse_brain,
  motifs = head(rownames(enriched.motifs))
)

We and others have previously shown that Mef-family motifs, particularly Mef2c, are enriched in Pvalb-specific peaks in scATAC-seq data (https://doi.org/10.1016/j.cell.2019.05.031; https://doi.org/10.1101/615179), and further shown that Mef2c is required for the development of Pvalb interneurons (https://www.nature.com/articles/nature25999). Here our results are consistent with these findings, and we observe a strong enrichment of Mef-family motifs in the top results from FindMotifs.

Computing motif activities

We can also compute a per-cell motif activity score by running chromVAR. This allows us to visualize motif activities per cell, and also provides an alternative method of identifying differentially-active motifs between cell types.

ChromVAR identifies motifs associated with variability in chromatin accessibility between cells. See the chromVAR paper for a complete description of the method.

mouse_brain <- RunChromVAR(
  object = mouse_brain,
  genome = BSgenome.Mmusculus.UCSC.mm10
)

DefaultAssay(mouse_brain) <- 'chromvar'

# look at the activity of Mef2c, the top hit from the overrepresentation testing
p2 <- FeaturePlot(
  object = mouse_brain,
  features = rownames(enriched.motifs)[[1]],
  min.cutoff = 'q10',
  max.cutoff = 'q90',
  pt.size = 0.1
)
CombinePlots(list(p1, p2), ncol = 2)

We can also directly test for differential activity scores between cell types. This tends to give similar results as performing an enrichment test on differentially accessible peaks between the cell types (shown above)

differential.activity <- FindMarkers(
  object = mouse_brain,
  ident.1 = 'Pvalb',
  ident.2 = 'Sst',
  only.pos = TRUE,
  test.use = 'LR',
  latent.vars = 'nCount_peaks'
)

MotifPlot(
  object = mouse_brain,
  motifs = head(rownames(differential.activity)),
  assay = 'peaks'
)

Session Info

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets
## [8] methods   base
##
## other attached packages:
##  [1] BSgenome.Mmusculus.UCSC.mm10_1.4.0 BSgenome_1.52.0
##  [3] rtracklayer_1.44.4                 Biostrings_2.52.0
##  [5] XVector_0.24.0                     GenomicRanges_1.36.1
##  [7] GenomeInfoDb_1.20.0                IRanges_2.18.3
##  [9] S4Vectors_0.22.1                   BiocGenerics_0.30.0
## [11] TFBSTools_1.22.0                   JASPAR2018_1.1.1
## [13] Seurat_3.1.1.9010                  Signac_0.1.6
##
## loaded via a namespace (and not attached):
##   [1] reticulate_1.13             R.utils_2.9.0
##   [3] tidyselect_0.2.5            poweRlaw_0.70.2
##   [5] RSQLite_2.1.2               AnnotationDbi_1.46.1
##   [7] htmlwidgets_1.5.1           grid_3.6.1
##   [9] BiocParallel_1.18.1         Rtsne_0.15
##  [11] munsell_0.5.0               codetools_0.2-16
##  [13] ica_1.0-2                   DT_0.9
##  [15] miniUI_0.1.1.1              future_1.15.0
##  [17] colorspace_1.4-1            chromVAR_1.6.0
##  [19] OrganismDbi_1.26.0          Biobase_2.44.0
##  [21] knitr_1.25                  rstudioapi_0.10
##  [23] ROCR_1.0-7                  gbRd_0.4-11
##  [25] listenv_0.7.0               labeling_0.3
##  [27] Rdpack_0.11-0               GenomeInfoDbData_1.2.1
##  [29] bit64_0.9-7                 rprojroot_1.3-2
##  [31] vctrs_0.2.0                 xfun_0.10
##  [33] biovizBase_1.32.0           ggseqlogo_0.1
##  [35] R6_2.4.0                    rsvd_1.0.2
##  [37] VGAM_1.1-1                  AnnotationFilter_1.8.0
##  [39] bitops_1.0-6                reshape_0.8.8
##  [41] DelayedArray_0.10.0         assertthat_0.2.1
##  [43] promises_1.1.0              SDMTools_1.1-221.1
##  [45] scales_1.0.0                nnet_7.3-12
##  [47] gtable_0.3.0                npsurv_0.4-0
##  [49] globals_0.12.4              ggbio_1.32.0
##  [51] ensembldb_2.8.1             seqLogo_1.50.0
##  [53] rlang_0.4.1                 zeallot_0.1.0
##  [55] splines_3.6.1               lazyeval_0.2.2
##  [57] acepack_1.4.1               dichromat_2.0-0
##  [59] checkmate_1.9.4             BiocManager_1.30.9
##  [61] yaml_2.2.0                  reshape2_1.4.3
##  [63] GenomicFeatures_1.36.4      backports_1.1.5
##  [65] httpuv_1.5.2                Hmisc_4.3-0
##  [67] RBGL_1.60.0                 tools_3.6.1
##  [69] nabor_0.5.0                 ggplot2_3.2.1
##  [71] ellipsis_0.3.0              gplots_3.0.1.1
##  [73] RColorBrewer_1.1-2          ggridges_0.5.1
##  [75] Rcpp_1.0.3                  plyr_1.8.4
##  [77] base64enc_0.1-3             progress_1.2.2
##  [79] zlibbioc_1.30.0             purrr_0.3.3
##  [81] RCurl_1.95-4.12             prettyunits_1.0.2
##  [83] rpart_4.1-15                pbapply_1.4-2
##  [85] cowplot_1.0.0               zoo_1.8-6
##  [87] SummarizedExperiment_1.14.1 ggrepel_0.8.1
##  [89] cluster_2.1.0               fs_1.3.1
##  [91] motifmatchr_1.6.0           magrittr_1.5
##  [93] data.table_1.12.6           lmtest_0.9-37
##  [95] RANN_2.6.1                  ProtGenerics_1.16.0
##  [97] fitdistrplus_1.0-14         matrixStats_0.55.0
##  [99] mime_0.7                    hms_0.5.2
## [101] lsei_1.2-0                  evaluate_0.14
## [103] xtable_1.8-4                XML_3.98-1.20
## [105] gridExtra_2.3               compiler_3.6.1
## [107] biomaRt_2.40.5              tibble_2.1.3
## [109] KernSmooth_2.23-16          crayon_1.3.4
## [111] R.oo_1.23.0                 htmltools_0.4.0
## [113] later_1.0.0                 Formula_1.2-3
## [115] tidyr_1.0.0                 RcppParallel_4.4.4
## [117] DBI_1.0.0                   MASS_7.3-51.4
## [119] Matrix_1.2-17               readr_1.3.1
## [121] R.methodsS3_1.7.1           gdata_2.18.0
## [123] metap_1.1                   igraph_1.2.4.1
## [125] pkgconfig_2.0.3             pkgdown_1.4.1
## [127] GenomicAlignments_1.20.1    TFMPvalue_0.0.8
## [129] foreign_0.8-72              plotly_4.9.1
## [131] annotate_1.62.0             DirichletMultinomial_1.26.0
## [133] bibtex_0.4.2                VariantAnnotation_1.30.1
## [135] stringr_1.4.0               digest_0.6.22
## [137] sctransform_0.2.0           RcppAnnoy_0.0.14
## [139] tsne_0.1-3                  graph_1.62.0
## [141] CNEr_1.20.0                 rmarkdown_1.16
## [143] leiden_0.3.1                htmlTable_1.13.2
## [145] uwot_0.1.4                  curl_4.2
## [147] shiny_1.4.0                 Rsamtools_2.0.3
## [149] gtools_3.8.1                lifecycle_0.1.0
## [151] nlme_3.1-142                jsonlite_1.6
## [153] desc_1.2.0                  viridisLite_0.3.0
## [155] pillar_1.4.2                lattice_0.20-38
## [157] GGally_1.4.0                fastmap_1.0.1
## [159] KEGGREST_1.24.1             httr_1.4.1
## [161] survival_3.1-7              GO.db_3.8.2
## [163] glue_1.3.1                  png_0.1-7
## [165] bit_1.1-14                  stringi_1.4.3
## [167] blob_1.2.0                  latticeExtra_0.6-28
## [169] caTools_1.17.1.2            memoise_1.1.0
## [171] dplyr_0.8.3                 irlba_2.3.3
## [173] future.apply_1.3.0          ape_5.3