vignettes/pasta_vignette.Rmd
pasta_vignette.Rmd
While scRNA-seq has been widely applied to characterize cellular heterogeneity in gene expression levels, these datasets can also be leveraged to explore variation in transcript structure as well. For example, reads from 3’ scRNA-seq datasets can be leveraged to quantify the relative usage of multiple polyadenylation (polyA) sites within the same gene.
In Kowalski* , Wessels*, Linder* et al, we introduce a statistical framework, based on the Dirichlet multinomial distribution, to characterize heterogeneity in the relative use of polyA sites at single-cell resolution. We calculate polyA-residuals, which indicate - for each polyA site in each cell - the degree of over or underutilization compared to a background model. We use these residuals for both supervised differential analysis (differential polyadenylation between groups of cells), but also unsupervised analysis and visualization.
We have implemented these methods in an R package PASTA (PolyA Site analysis using relative Transcript Abundance) that interfaces directly with Seurat. In this vignette, we demonstrate how to use PASTA and Seurat to analyze a dataset of 49,958 circulating human peripheral blood mononuclear cells (PBMC). This vignette demonstrates how to reproduce results from Figure 6 in our manuscript. While this represents an initial demonstration of PASTA, we will be adding significant functionality and documentation in future releases.
First, we install PASTA and dependencies.
# install remotes and BiocManager if necessary
if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
# additionally install several Bioconductor dependencies
BiocManager::install(c("BSgenome.Hsapiens.UCSC.hg38", "EnsDb.Hsapiens.v86", "rtracklayer", "GenomicFeatures",
"plyranges", "org.Hs.eg.db", "clusterProfiler", "enrichplot"))
remotes::install_github("satijalab/PASTA")
# loading PASTA also loads Seurat and Signac
library(PASTA)
library(ggplot2)
library(EnsDb.Hsapiens.v86)
library(org.Hs.eg.db)
library(clusterProfiler)
library(enrichplot)
The data for this vignette can be found on the following site: The files include:
# dir = directory where files are downloaded
counts.file <- paste0(dir, "PBMC_pA_counts.tab.gz")
peak.file <- paste0(dir, "PBMC_polyA_peaks.gff")
fragment.file <- paste0(dir, "PBMC_fragments.tsv.gz")
polyAdb.file <- paste0(dir, "human_PAS_hg38.txt")
metadata.file <- paste0(dir, "PBMC_meta_data.csv")
First, we create a Seurat object based on a pre-quantified scRNA-seq expression matrix of human PBMC. We then annotate cell types using our Azimuth reference, as described here.
library(Azimuth)
counts = Read10X(dir)
pbmc <- CreateSeuratObject(counts, min.cells = 3, min.features = 200)
# add meta data
meta_data <- read.csv(file = metadata.file, row.names = 1)
pbmc <- AddMetaData(pbmc, metadata = meta_data)
# Run Azimuth for each Donor
obj.list <- SplitObject(pbmc, split.by = "donor")
obj.list <- lapply(obj.list, FUN = RunAzimuth, reference = "pbmcref")
pbmc <- merge(obj.list[[1]], obj.list[2:length(obj.list)], merge.dr = "ref.umap")
# note that Azimuth provides higher-resolution annotations with pbmc$celltype.l2
Idents(pbmc) <- pbmc$celltype.l1
library(RColorBrewer)
cols <- colorRampPalette(brewer.pal(8, "Dark2"))(30)
cols.l1 <- cols[c(1:4, 9, 11, 25, 29, 30)]
names(cols.l1) <- c("NK", "CD4 T", "CD8 T", "Other T", "B", "Plasma", "Mono", "DC", "Other")
DimPlot(pbmc, reduction = "ref.umap", label = TRUE, cols = cols.l1)
Next, we create a polyA assay that we will add to the Seurat object,
which quantifies the usage (counts) of individual polyA sites in single
cells. PASTA currently accepts polyA site quantifications from the polyApipe
pipeline, but we will be adding support for additional tools in
future releases. The ReadPolyApipe
function takes three
input files
blocks
function in sinto.Once we read in these files, we can create a polyA assay, and add it to the Seurat pbmc object.
options(Seurat.object.assay.calcn = TRUE)
polyA.counts = ReadPolyApipe(counts.file = counts.file, peaks.file = peak.file, filter.chromosomes = TRUE,
min.features = 10, min.cells = 25)
polyA.assay = CreatePolyAAssay(counts = polyA.counts, genome = "hg38", fragments = fragment.file,
validate.fragments = FALSE)
# add annotations to polyA.assay
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
genome(annotations) <- "hg38"
Annotation(polyA.assay) <- annotations
# only use cells in both assays
cells = intersect(colnames(pbmc), colnames(polyA.assay))
polyA.assay = subset(polyA.assay, cells = cells)
pbmc = subset(pbmc, cells = cells)
pbmc[["polyA"]] <- polyA.assay #add polyA assay to Seurat object
DefaultAssay(pbmc) <- "polyA"
pbmc
## An object of class Seurat
## 57907 features across 49958 samples within 5 assays
## Active assay: polyA (41480 features, 0 variable features)
## 2 layers present: counts, data
## 4 other assays present: RNA, prediction.score.celltype.l1, prediction.score.celltype.l2, prediction.score.celltype.l3
## 2 dimensional reductions calculated: integrated_dr, ref.umap
We next annotate each polyA site with information from the PolyA_DB v3
database, which catalogs polyA sites in the human genome based on deep
sequencing data. This GetPolyADbAnnotation
function assigns
each polyA site to a gene, and assigns a location within the transcript
(intron, last exon, etc.). This information is stored in the assay, and
can be accessed using the [[]]
feature metadata accessor
function.
Note that for polyA sites that receive an NA
for the
Gene_Symbol
field, these sites did not map to any site
within 50bp in the polyAdbv3 database, and may be spurious.
# max.dist parameter controls maximum distance from polyAdbv3 site
pbmc <- GetPolyADbAnnotation(pbmc, polyAdb.file = polyAdb.file, max.dist = 50)
meta <- pbmc[["polyA"]][[]]
head(meta[, c(1, 13, 15, 18)])
## strand PAS_ID Intron.exon_location
## 10-100000560-100000859 + <NA> <NA>
## 10-100150094-100150393 - chr10:101909851:- 3'_most_exon
## 10-100181533-100181832 - <NA> <NA>
## 10-100188367-100188666 - chr10:101948124:- 3'_most_exon
## 10-100232298-100232597 - chr10:101992055:- 3'_most_exon
## 10-100237155-100237454 - chr10:101996912:- Intron
## Gene_Symbol
## 10-100000560-100000859 <NA>
## 10-100150094-100150393 ERLIN1
## 10-100181533-100181832 <NA>
## 10-100188367-100188666 CHUK
## 10-100232298-100232597 CWF19L1
## 10-100237155-100237454 CWF19L1
We can now calculate polyA residuals, as described in our manuscript. For this analysis we will focus on tandem
polyadenylation events reflecting the usage of distinct polyA sites
located in the 3’ UTR of the terminal exon. Once we select features, we
can run the CalcPolyAResiduals
function. The residuals are
stored in the scale.data
slot of the polyA assay.
# extract features in 3' UTRs
features.last.exon = rownames(subset(meta, Intron.exon_location == "3'_most_exon"))
length(features.last.exon)
## [1] 14985
# gene names are stored in the Gene_Symbol column of the meta features
pbmc <- CalcPolyAResiduals(pbmc, assay = "polyA", features = features.last.exon, gene.names = "Gene_Symbol",
verbose = TRUE)
Now, we can perform dimension reduction directly on the polyA
residuals. The workflow is similar to scRNA-seq: we use
FindVariableFeatures
to identify polyA sites that vary
across cells, and then run perform dimension reduction on the variable
polyA sites. We see that there is a very strong separation between
plasma cells and other celltypes.
Plasma cells are known to change their polyA site usage at the IGHM gene, which was one of the first described instances of alternative polydaenylation (Alt et al, Cell, 1980). They have also been shown to change their 3’UTR usage more broadly (Cheng et al, Nature Communications, 2020). Next, we will perform transcriptome-wide analysis to characterize genes whose 3’UTR length changes between plasma cells and B cells.
pbmc <- FindVariableFeatures(pbmc, selection.method = "residuals", gene.names = "Gene_Symbol")
pbmc <- RunPCA(pbmc)
pbmc <- RunUMAP(pbmc, dims = 1:30, reduction.name = "polyA.umap", reduction.key = "polyAUMAP_")
DimPlot(pbmc, group.by = "celltype.l1", reduction = "polyA.umap", cols = cols.l1) + ggtitle("Level 1 Annotations")
Analagous to differential expression anlaysis, we can also perform differential polyadenylation analysis. As described in our manuscript, this function aims to identify polyA sites whose relative usage within a gene changes across different groups of cells. Here, we identify differentially polyadenylated sites between Plasma cells and B cells.
The FindDifferentialPolyA function returns the following outputs:
Idents(pbmc) <- pbmc$celltype.l1
m.plasma <- FindDifferentialPolyA(pbmc, ident.1 = "Plasma", ident.2 = "B", covariates = "donor")
head(m.plasma, 10)
## Estimate p.value p_val_adj percent.1 percent.2
## 12-49762706-49763005 1.1754624 0.000000e+00 0.000000e+00 0.7703972 0.1868293
## 17-41690699-41690998 1.1412147 0.000000e+00 0.000000e+00 0.7710967 0.4299406
## 3-57572076-57572375 1.0510184 0.000000e+00 0.000000e+00 0.7683768 0.2023810
## 6-7882814-7883113 0.6945195 0.000000e+00 0.000000e+00 0.3646308 0.1254658
## 3-57571365-57571664 -0.9942404 0.000000e+00 0.000000e+00 0.2316232 0.7976190
## 17-41691345-41691644 -1.1093621 0.000000e+00 0.000000e+00 0.2264282 0.5646424
## 12-49764635-49764934 -1.1270990 0.000000e+00 0.000000e+00 0.2296028 0.8131707
## 6-7881521-7881820 -0.5560617 1.024656e-290 8.419598e-287 0.6177370 0.8621118
## 18-59327823-59328122 -0.5820680 3.663558e-272 3.010346e-268 0.3585233 0.8545455
## 18-59330929-59331228 0.6292856 1.420368e-269 1.167116e-265 0.6414767 0.1454545
## symbol
## 12-49762706-49763005 TMBIM6
## 17-41690699-41690998 EIF1
## 3-57572076-57572375 ARF4
## 6-7882814-7883113 TXNDC5
## 3-57571365-57571664 ARF4
## 17-41691345-41691644 EIF1
## 12-49764635-49764934 TMBIM6
## 6-7881521-7881820 TXNDC5
## 18-59327823-59328122 LMAN1
## 18-59330929-59331228 LMAN1
We can also visualize the read coverage across cell groups, which
demonstrates changes in transcript structure across groups of cells. The
PolyACoveragePlot
function is based off the
CoveragePlot
function in Signac, and uses the previously
computed fragment file to compute local coverage.
Beneath the coverage plot, we also display the location of polyA sites (peaks), as well as gene annotations. Note that in each of these examples, plasma cells exhibit increased utilization of the proximal polyA site (i.e. 3’ UTR shortening).
Idents(pbmc) <- pbmc$celltype.l1
# by default shows all polyAsites where we calculated polyA residuals in a gene
PolyACoveragePlot(pbmc, gene = "EIF1") & scale_fill_manual(values = cols.l1)
PolyACoveragePlot(pbmc, gene = "TMBIM6") & scale_fill_manual(values = cols.l1)
To understand if the genes that change their 3’UTR usage in plasma cells compared to B cells are enriched in any particular biological process, we perform GO enrichment analysis, as implemented in the clusterProfiler package.
First, we convert the gene symbol of these genes to Entrez IDs, then use the enrichGO function to find pathways that are enriched. Gene Ontology analysis revealed a strong enrichment for Golgi vesicle transport and protein localization, which are linked to the core secretory phenotypes of plasma cells.
plasma.genes <- unique(subset(m.plasma, p_val_adj < 0.05)$symbol)
g.go <- bitr(plasma.genes, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = org.Hs.eg.db)
go.results <- enrichGO(g.go$ENTREZID, "org.Hs.eg.db", ont = "BP", pvalueCutoff = 0.05)
go.results <- setReadable(go.results, OrgDb = org.Hs.eg.db)
go.results.simp <- simplify(go.results, cutoff = 0.7, by = "p.adjust", select_fun = min)
p <- dotplot(go.results.simp)
p + scale_colour_gradient(low = "maroon", high = "grey")
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] RColorBrewer_1.1-3 rlang_1.1.4
## [3] enrichplot_1.22.0 clusterProfiler_4.10.1
## [5] org.Hs.eg.db_3.18.0 EnsDb.Hsapiens.v86_2.99.0
## [7] ensembldb_2.26.0 AnnotationFilter_1.26.0
## [9] GenomicFeatures_1.54.4 AnnotationDbi_1.64.1
## [11] Biobase_2.62.0 GenomicRanges_1.54.1
## [13] GenomeInfoDb_1.38.8 IRanges_2.36.0
## [15] S4Vectors_0.40.2 BiocGenerics_0.48.1
## [17] ggplot2_3.5.1 PASTA_0.0.0.9000
## [19] Signac_1.13.0 Seurat_5.1.0
## [21] SeuratObject_5.0.2 sp_2.1-4
##
## loaded via a namespace (and not attached):
## [1] fs_1.6.4 ProtGenerics_1.34.0
## [3] matrixStats_1.3.0 spatstat.sparse_3.1-0
## [5] bitops_1.0-7 HDO.db_0.99.1
## [7] httr_1.4.7 backports_1.5.0
## [9] tools_4.3.2 sctransform_0.4.1
## [11] utf8_1.2.4 R6_2.5.1
## [13] lazyeval_0.2.2 uwot_0.2.2
## [15] withr_3.0.0 prettyunits_1.2.0
## [17] gridExtra_2.3 progressr_0.14.0
## [19] MGLM_0.2.1 cli_3.6.3
## [21] textshaping_0.4.0 formatR_1.14
## [23] spatstat.explore_3.2-7 fastDummies_1.7.3
## [25] scatterpie_0.2.1 sandwich_3.1-0
## [27] labeling_0.4.3 sass_0.4.9
## [29] spatstat.data_3.1-2 ggridges_0.5.6
## [31] pbapply_1.7-2 pkgdown_2.0.7
## [33] Rsamtools_2.18.0 systemfonts_1.1.0
## [35] yulab.utils_0.1.4 foreign_0.8-86
## [37] gson_0.1.0 R.utils_2.12.3
## [39] DOSE_3.28.2 dichromat_2.0-0.1
## [41] parallelly_1.37.1 BSgenome_1.70.2
## [43] rstudioapi_0.16.0 RSQLite_2.3.7
## [45] gridGraphics_0.5-1 generics_0.1.3
## [47] BiocIO_1.12.0 ica_1.0-3
## [49] spatstat.random_3.2-3 car_3.1-2
## [51] dplyr_1.1.4 GO.db_3.18.0
## [53] Matrix_1.6-4 fansi_1.0.6
## [55] abind_1.4-5 R.methodsS3_1.8.2
## [57] lifecycle_1.0.4 yaml_2.3.8
## [59] carData_3.0-5 SummarizedExperiment_1.32.0
## [61] qvalue_2.34.0 SparseArray_1.2.4
## [63] BiocFileCache_2.10.2 Rtsne_0.17
## [65] grid_4.3.2 blob_1.2.4
## [67] promises_1.3.0 crayon_1.5.3
## [69] miniUI_0.1.1.1 lattice_0.22-6
## [71] cowplot_1.1.3 KEGGREST_1.42.0
## [73] pillar_1.9.0 knitr_1.47
## [75] fgsea_1.28.0 rjson_0.2.21
## [77] future.apply_1.11.2 codetools_0.2-19
## [79] fastmatch_1.1-4 leiden_0.4.3.1
## [81] glue_1.7.0 ggfun_0.1.5
## [83] data.table_1.15.4 treeio_1.26.0
## [85] vctrs_0.6.5 png_0.1-8
## [87] spam_2.10-0 gtable_0.3.5
## [89] cachem_1.1.0 xfun_0.45
## [91] S4Arrays_1.2.1 mime_0.12
## [93] tidygraph_1.2.3 survival_3.5-7
## [95] RcppRoll_0.3.0 fitdistrplus_1.1-11
## [97] ROCR_1.0-11 nlme_3.1-165
## [99] ggtree_3.10.1 bit64_4.0.5
## [101] progress_1.2.3 filelock_1.0.2
## [103] RcppAnnoy_0.0.22 rprojroot_2.0.4
## [105] bslib_0.7.0 irlba_2.3.5.1
## [107] rpart_4.1.23 KernSmooth_2.23-22
## [109] Hmisc_5.1-3 colorspace_2.1-0
## [111] DBI_1.2.3 nnet_7.3-19
## [113] tidyselect_1.2.1 bit_4.0.5
## [115] compiler_4.3.2 curl_5.2.1
## [117] htmlTable_2.4.2 xml2_1.3.6
## [119] desc_1.4.2 DelayedArray_0.28.0
## [121] plotly_4.10.4 shadowtext_0.1.2
## [123] rtracklayer_1.62.0 checkmate_2.3.1
## [125] scales_1.3.0 lmtest_0.9-40
## [127] rappdirs_0.3.3 stringr_1.5.1
## [129] digest_0.6.36 goftest_1.2-3
## [131] spatstat.utils_3.0-5 rmarkdown_2.27
## [133] XVector_0.42.0 base64enc_0.1-3
## [135] htmltools_0.5.8.1 pkgconfig_2.0.3
## [137] MatrixGenerics_1.14.0 highr_0.11
## [139] dbplyr_2.5.0 fastmap_1.2.0
## [141] htmlwidgets_1.6.4 shiny_1.8.1.1
## [143] farver_2.1.2 jquerylib_0.1.4
## [145] zoo_1.8-12 jsonlite_1.8.8
## [147] BiocParallel_1.36.0 R.oo_1.26.0
## [149] GOSemSim_2.28.1 VariantAnnotation_1.48.1
## [151] RCurl_1.98-1.14 magrittr_2.0.3
## [153] ggplotify_0.1.2 Formula_1.2-5
## [155] GenomeInfoDbData_1.2.11 dotCall64_1.1-1
## [157] patchwork_1.2.0 munsell_0.5.1
## [159] Rcpp_1.0.12 ape_5.7-1
## [161] viridis_0.6.5 reticulate_1.38.0
## [163] stringi_1.8.4 ggraph_2.1.0
## [165] zlibbioc_1.48.2 MASS_7.3-60
## [167] plyr_1.8.9 parallel_4.3.2
## [169] listenv_0.9.1 ggrepel_0.9.5
## [171] deldir_2.0-4 graphlayouts_1.1.1
## [173] Biostrings_2.70.3 splines_4.3.2
## [175] tensor_1.5 hms_1.1.3
## [177] igraph_2.0.3 spatstat.geom_3.2-9
## [179] RcppHNSW_0.6.0 reshape2_1.4.4
## [181] biomaRt_2.58.2 XML_3.99-0.16.1
## [183] evaluate_0.24.0 biovizBase_1.50.0
## [185] AER_1.2-12 tweenr_2.0.2
## [187] httpuv_1.6.15 RANN_2.6.1
## [189] tidyr_1.3.1 purrr_1.0.2
## [191] polyclip_1.10-6 future_1.33.2
## [193] scattermore_1.2 ggforce_0.4.1
## [195] xtable_1.8-4 restfulr_0.0.15
## [197] tidytree_0.4.6 gplm_0.7-4
## [199] RSpectra_0.16-1 later_1.3.2
## [201] viridisLite_0.4.2 ragg_1.3.2
## [203] tibble_3.2.1 aplot_0.2.2
## [205] memoise_2.0.1 plyranges_1.22.0
## [207] GenomicAlignments_1.38.2 cluster_2.1.6
## [209] globals_0.16.3