While scRNA-seq has been widely applied to characterize cellular heterogeneity in gene expression levels, these datasets can also be leveraged to explore variation in transcript structure as well. For example, reads from 3’ scRNA-seq datasets can be leveraged to quantify the relative usage of multiple polyadenylation (polyA) sites within the same gene.

In Kowalski* , Wessels*, Linder* et al, we introduce a statistical framework, based on the Dirichlet multinomial distribution, to characterize heterogeneity in the relative use of polyA sites at single-cell resolution. We calculate polyA-residuals, which indicate - for each polyA site in each cell - the degree of over or underutilization compared to a background model. We use these residuals for both supervised differential analysis (differential polyadenylation between groups of cells), but also unsupervised analysis and visualization.

We have implemented these methods in an R package PASTA (PolyA Site analysis using relative Transcript Abundance) that interfaces directly with Seurat. In this vignette, we demonstrate how to use PASTA and Seurat to analyze a dataset of 49,958 circulating human peripheral blood mononuclear cells (PBMC). This vignette demonstrates how to reproduce results from Figure 6 in our manuscript. While this represents an initial demonstration of PASTA, we will be adding significant functionality and documentation in future releases.

First, we install PASTA and dependencies. Note: At the time of release of this vignette, there was an issue with GenomeInfoDb. You may have to install as described here, if you do not have the latest version of Bioconductor installed.

# install remotes and BiocManager if necessary
if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")

# additionally install several Bioconductor dependencies
BiocManager::install(c("BSgenome.Hsapiens.UCSC.hg38", "EnsDb.Hsapiens.v86", "rtracklayer", "GenomicFeatures",
    "plyranges"))

remotes::install_github("satijalab/PASTA")
# loading PASTA also loads Seurat and Signac
library(PASTA)
library(ggplot2)
library(EnsDb.Hsapiens.v86)

The data for this vignette can be found on the following site: The files include:

# dir = directory where files are downloaded
counts.file <- paste0(dir, "PBMC_pA_counts.tab.gz")
peak.file <- paste0(dir, "PBMC_polyA_peaks.gff")
fragment.file <- paste0(dir, "PBMC_fragments.tsv.gz")
polyAdb.file <- paste0(dir, "human_PAS_hg38.txt")
metadata.file <- paste0(dir, "PBMC_meta_data.csv")

Create Seurat object and annotate cell types

First, we create a Seurat object based on a pre-quantified scRNA-seq expression matrix of human PBMC. We then annotate cell types using our Azimuth reference, as described here.

library(Azimuth)
counts = Read10X(dir)
pbmc <- CreateSeuratObject(counts, min.cells = 3, min.features = 200)

# add meta data
meta_data <- read.csv(file = metadata.file, row.names = 1)
pbmc <- AddMetaData(pbmc, metadata = meta_data)

# Run Azimuth for each Donor
obj.list <- SplitObject(pbmc, split.by = "donor")
obj.list <- lapply(obj.list, FUN = RunAzimuth, reference = "pbmcref")
pbmc <- merge(obj.list[[1]], obj.list[2:length(obj.list)], merge.dr = "ref.umap")

# note that Azimuth provides higher-resolution annotations with pbmc$celltype.l2
Idents(pbmc) <- pbmc$celltype.l1

library(RColorBrewer)
cols <- colorRampPalette(brewer.pal(8, "Dark2"))(30)
cols.l1 <- cols[c(1:4, 9, 11, 25, 29, 30)]
names(cols.l1) <- c("NK", "CD4 T", "CD8 T", "Other T", "B", "Plasma", "Mono", "DC", "Other")
DimPlot(pbmc, reduction = "ref.umap", label = TRUE, cols = cols.l1)

Adding polyA site counts to the Seurat object

Next, we create a polyA assay that we will add to the Seurat object, which quantifies the usage (counts) of individual polyA sites in single cells. PASTA currently accepts polyA site quantifications from the polyApipe pipeline, but we will be adding support for additional tools in future releases. The ReadPolyApipe function takes three input files

  • A polyA site count matrix (produced by polyApipe)
  • A list of polyA site coordinates (or peaks, also produced by polyApipe)
  • A fragment file (produced from the aligned BAM file, we recommend using the blocks function in sinto.

Once we read in these files, we can create a polyA assay, and add it to the Seurat pbmc object.

polyA.counts = ReadPolyApipe(counts.file = counts.file, peaks.file = peak.file, filter.chromosomes = TRUE,
    min.features = 10, min.cells = 25)
polyA.assay = CreatePolyAsiteAssay(counts = polyA.counts, genome = "hg38", fragments = fragment.file,
    validate.fragments = FALSE)
# add annotations to polyA.assay
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
genome(annotations) <- "hg38"
Annotation(polyA.assay) <- annotations

# only use cells in both assays
cells = intersect(colnames(pbmc), colnames(polyA.assay))
polyA.assay = subset(polyA.assay, cells = cells)
pbmc = subset(pbmc, cells = cells)
pbmc[["polyA"]] <- polyA.assay  #add polyA assay to Seurat object 
DefaultAssay(pbmc) <- "polyA"
pbmc
## An object of class Seurat 
## 74239 features across 49958 samples within 6 assays 
## Active assay: polyA (41480 features, 0 variable features)
##  5 other assays present: RNA, refAssay, prediction.score.celltype.l1, prediction.score.celltype.l2, prediction.score.celltype.l3
##  1 dimensional reduction calculated: ref.umap

Annotate polyA sites using polyAdbv3

We next annotate each polyA site with information from the PolyA_DB v3 database, which catalogs polyA sites in the human genome based on deep sequencing data. This GetPolyADbAnnotation function assigns each polyA site to a gene, and assigns a location within the transcript (intron, last exon, etc.). This information is stored in the assay, and can be accessed using the [[]] feature metadata accessor function.

Note that for polyA sites that receive an NA for the Gene_Symbol field, these sites did not map to any site within 50bp in the polyAdbv3 database, and may be spurious.

# max.dist parameter controls maximum distance from polyAdbv3 site
pbmc <- GetPolyADbAnnotation(pbmc, polyAdb.file = polyAdb.file, max.dist = 50)
meta <- pbmc[["polyA"]][[]]
head(meta[, c(1, 13, 15, 18)])
##                        strand            PAS_ID Intron.exon_location
## 10-100000560-100000859      +              <NA>                 <NA>
## 10-100150094-100150393      - chr10:101909851:-         3'_most_exon
## 10-100181533-100181832      -              <NA>                 <NA>
## 10-100188367-100188666      - chr10:101948124:-         3'_most_exon
## 10-100232298-100232597      - chr10:101992055:-         3'_most_exon
## 10-100237155-100237454      - chr10:101996912:-               Intron
##                        Gene_Symbol
## 10-100000560-100000859        <NA>
## 10-100150094-100150393      ERLIN1
## 10-100181533-100181832        <NA>
## 10-100188367-100188666        CHUK
## 10-100232298-100232597     CWF19L1
## 10-100237155-100237454     CWF19L1

Quantify relative polyA site usage at single-cell resolution

We can now calculate polyA residuals, as described in our manuscript. For this analysis we will focus on tandem polyadenylation events reflecting the usage of distinct polyA sites located in the 3’ UTR of the terminal exon. Once we select features, we can run the CalcPolyAResiduals function. The residuals are stored in the scale.data slot of the polyA assay.

# extract 14,985 features in 3' UTRs
features.last.exon = rownames(subset(meta, Intron.exon_location == "3'_most_exon"))
length(features.last.exon)
## [1] 14985
# gene names are stored in the Gene_Symbol column of the meta features
pbmc <- CalcPolyAResiduals(pbmc, assay = "polyA", features = features.last.exon, gene.names = "Gene_Symbol",
    verbose = TRUE)

Perform dimension reduction on polyA residuals

Now, we can perform dimension reduction directly on the polyA residuals. The workflow is similar to scRNA-seq: we use FindVariableFeatures to identify polyA sites that vary across cells, and then run perform dimension reduction on the variable polyA sites. We see that there is a very strong separation between Plasma cell and other celltypes.

pbmc <- FindVariableFeatures(pbmc, selection.method = "residuals", gene.names = "Gene_Symbol")
pbmc <- RunPCA(pbmc)
pbmc <- RunUMAP(pbmc, dims = 1:30, reduction.name = "polyA.umap", reduction.key = "polyAUMAP_")
DimPlot(pbmc, group.by = "celltype.l1", reduction = "polyA.umap", cols = cols.l1) + ggtitle("Level 1 Annotations")

Perform differential polyadenylation analysis

Analagous to differential expression anlaysis, we can also perform differential polyadenylation analysis. As described in our manuscript, this function aims to identify polyA sites whose relative usage within a gene changes across different groups of cells. Here, we identify differentially polyadenylated sites between Plasma cells and B cells.

The FindDifferentialPolyA function returns the following outputs:

  • Estimate: Estimated coefficient from the differential APA linear model (indicates the magnitude of change)
  • p.value : p-value from the linear model
  • p_val_adj: Bonferroni-corrected p-value
  • percent.1: The average percent usage (pseudobulk) of the polyA site in the first group of cells
  • percent.2: The average percent usage (pseudobulk) of the polyA site in the second group of cells
  • symbol: The gene that the polyA site is located in
Idents(pbmc) <- pbmc$celltype.l1
m.plasma <- FindDifferentialPolyA(pbmc, ident.1 = "Plasma", ident.2 = "B", covariates = "donor")
head(m.plasma, 10)
##                         Estimate       p.value     p_val_adj   percent.1
## 17-41690699-41690998   1.1450433  0.000000e+00  0.000000e+00 0.592298722
## 6-7882814-7883113      0.6222369  0.000000e+00  0.000000e+00 0.346950866
## 12-49762706-49763005   0.5352492  0.000000e+00  0.000000e+00 0.770397231
## 4-176331946-176332245 -0.4459568  0.000000e+00  0.000000e+00 0.069836204
## 6-7881521-7881820     -0.4895199  0.000000e+00  0.000000e+00 0.587784615
## 12-49764635-49764934  -0.5325040  0.000000e+00  0.000000e+00 0.229602769
## 17-41691345-41691644  -1.1527642  0.000000e+00  0.000000e+00 0.001901237
## 3-57571365-57571664   -0.5136573 2.588878e-313 2.128058e-309 0.228808536
## 3-57572076-57572375    0.5317318 1.823834e-275 1.499191e-271 0.759039715
## 5-177331562-177331861 -0.4413488 2.480992e-258 2.039376e-254 0.320045711
##                         percent.2 symbol
## 17-41690699-41690998  0.326026942   EIF1
## 6-7882814-7883113     0.117305459 TXNDC5
## 12-49762706-49763005  0.186829268 TMBIM6
## 4-176331946-176332245 0.193798450  SPCS3
## 6-7881521-7881820     0.806039489 TXNDC5
## 12-49764635-49764934  0.813170732 TMBIM6
## 17-41691345-41691644  0.004107767   EIF1
## 3-57571365-57571664   0.722650231   ARF4
## 3-57572076-57572375   0.183359014   ARF4
## 5-177331562-177331861 0.504907975  LMAN2

Visualize polyA site usage across groups of cells

We can also visualize the read coverage across cell groups, which demonstrates changes in transcript structure across groups of cells. The PolyACoveragePlot function is based off the CoveragePlot function in Signac, and uses the previously computed fragment file to compute local coverage. Beneath the coverage plot, we also display the location of polyA sites (peaks), as well as gene annotations. Note that in each of these examples, Plasmablasts exhibit increased utilization of the proximal polyA site (i.e. 3’ UTR shortening)

Idents(pbmc) <- pbmc$celltype.l1
# by default shows all polyAsites where we calculated polyA residuals in a gene
PolyACoveragePlot(pbmc, gene = "EIF1") & scale_fill_manual(values = cols.l1)

PolyACoveragePlot(pbmc, gene = "TMBIM6") & scale_fill_manual(values = cols.l1)

Session Info

## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-3        Azimuth_0.4.6            
##  [3] shinyBS_0.61.1            rlang_1.0.6              
##  [5] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.20.2         
##  [7] AnnotationFilter_1.20.0   GenomicFeatures_1.48.4   
##  [9] AnnotationDbi_1.58.0      Biobase_2.56.0           
## [11] GenomicRanges_1.48.0      GenomeInfoDb_1.35.15     
## [13] IRanges_2.30.0            S4Vectors_0.34.0         
## [15] BiocGenerics_0.42.0       ggplot2_3.4.0            
## [17] PASTA_0.0.0.9000          Signac_1.9.0             
## [19] SeuratObject_4.1.3        Seurat_4.3.0             
## 
## loaded via a namespace (and not attached):
##   [1] rappdirs_0.3.3                  rtracklayer_1.56.1             
##   [3] scattermore_0.8                 R.methodsS3_1.8.2              
##   [5] ragg_1.2.4                      tidyr_1.3.0                    
##   [7] bit64_4.0.5                     knitr_1.42                     
##   [9] irlba_2.3.5.1                   DelayedArray_0.22.0            
##  [11] R.utils_2.12.2                  rpart_4.1.19                   
##  [13] data.table_1.14.6               KEGGREST_1.36.0                
##  [15] RCurl_1.98-1.9                  generics_0.1.3                 
##  [17] cowplot_1.1.1                   RSQLite_2.2.12                 
##  [19] RANN_2.6.1                      future_1.31.0                  
##  [21] bit_4.0.4                       spatstat.data_3.0-0            
##  [23] xml2_1.3.3                      httpuv_1.6.8                   
##  [25] SummarizedExperiment_1.26.1     assertthat_0.2.1               
##  [27] gargle_1.2.0                    xfun_0.37                      
##  [29] hms_1.1.2                       jquerylib_0.1.4                
##  [31] evaluate_0.20                   promises_1.2.0.1               
##  [33] fansi_1.0.4                     restfulr_0.0.15                
##  [35] progress_1.2.2                  dbplyr_2.2.1                   
##  [37] igraph_1.3.5                    DBI_1.1.2                      
##  [39] htmlwidgets_1.6.1               spatstat.geom_3.0-6            
##  [41] googledrive_2.0.0               purrr_1.0.1                    
##  [43] ellipsis_0.3.2                  backports_1.4.1                
##  [45] dplyr_1.1.0                     biomaRt_2.52.0                 
##  [47] deldir_1.0-6                    MatrixGenerics_1.8.0           
##  [49] vctrs_0.5.2                     SeuratDisk_0.0.0.9020          
##  [51] ROCR_1.0-11                     adiposeref.SeuratData_1.0.0    
##  [53] abind_1.4-5                     cachem_1.0.6                   
##  [55] withr_2.5.0                     BSgenome_1.64.0                
##  [57] progressr_0.13.0                checkmate_2.1.0                
##  [59] presto_1.0.0                    sctransform_0.3.5              
##  [61] GenomicAlignments_1.32.1        prettyunits_1.1.1              
##  [63] goftest_1.2-3                   cluster_2.1.4                  
##  [65] lazyeval_0.2.2                  crayon_1.5.2                   
##  [67] hdf5r_1.3.5                     spatstat.explore_3.0-6         
##  [69] labeling_0.4.2                  pkgconfig_2.0.3                
##  [71] nlme_3.1-162                    ProtGenerics_1.28.0            
##  [73] nnet_7.3-18                     globals_0.16.2                 
##  [75] lifecycle_1.0.3                 miniUI_0.1.1.1                 
##  [77] filelock_1.0.2                  BiocFileCache_2.4.0            
##  [79] pbmc3k.SeuratData_3.1.4         SeuratData_0.2.2               
##  [81] dichromat_2.0-0.1               cellranger_1.1.0               
##  [83] rprojroot_2.0.3                 polyclip_1.10-4                
##  [85] matrixStats_0.63.0              lmtest_0.9-40                  
##  [87] hcabm40k.SeuratData_3.0.0       Matrix_1.5-1                   
##  [89] zoo_1.8-11                      base64enc_0.1-3                
##  [91] ggridges_0.5.4                  googlesheets4_1.0.0            
##  [93] png_0.1-8                       viridisLite_0.4.1              
##  [95] rjson_0.2.21                    mousecortexref.SeuratData_1.0.0
##  [97] bitops_1.0-7                    shinydashboard_0.7.2           
##  [99] R.oo_1.25.0                     KernSmooth_2.23-20             
## [101] Biostrings_2.64.0               blob_1.2.3                     
## [103] stringr_1.5.0                   parallelly_1.34.0              
## [105] spatstat.random_3.1-3           jpeg_0.1-10                    
## [107] scales_1.2.1                    memoise_2.0.1                  
## [109] magrittr_2.0.3                  plyr_1.8.8                     
## [111] ica_1.0-3                       zlibbioc_1.42.0                
## [113] compiler_4.2.2                  BiocIO_1.6.0                   
## [115] fitdistrplus_1.1-8              Rsamtools_2.12.0               
## [117] cli_3.6.0                       XVector_0.36.0                 
## [119] listenv_0.9.0                   patchwork_1.1.2                
## [121] pbapply_1.7-0                   htmlTable_2.4.1                
## [123] Formula_1.2-4                   formatR_1.12                   
## [125] MASS_7.3-58.2                   tidyselect_1.2.0               
## [127] stxBrain.SeuratData_0.1.1       stringi_1.7.12                 
## [129] textshaping_0.3.6               highr_0.10                     
## [131] yaml_2.3.7                      latticeExtra_0.6-30            
## [133] ggrepel_0.9.3                   grid_4.2.2                     
## [135] VariantAnnotation_1.42.1        sass_0.4.5                     
## [137] fastmatch_1.1-3                 tools_4.2.2                    
## [139] bmcite.SeuratData_0.3.0         future.apply_1.10.0            
## [141] parallel_4.2.2                  rstudioapi_0.13                
## [143] foreign_0.8-82                  gridExtra_2.3                  
## [145] farver_2.1.1                    plyranges_1.16.0               
## [147] Rtsne_0.16                      digest_0.6.31                  
## [149] rgeos_0.5-9                     shiny_1.7.4                    
## [151] MGLM_0.2.1                      Rcpp_1.0.10                    
## [153] later_1.3.0                     RcppAnnoy_0.0.20               
## [155] httr_1.4.4                      biovizBase_1.44.0              
## [157] panc8.SeuratData_3.0.2          colorspace_2.1-0               
## [159] XML_3.99-0.9                    fs_1.6.0                       
## [161] tensor_1.5                      reticulate_1.28                
## [163] splines_4.2.2                   uwot_0.1.14                    
## [165] RcppRoll_0.3.0                  spatstat.utils_3.0-1           
## [167] pkgdown_2.0.6                   sp_1.6-0                       
## [169] plotly_4.10.1.9000              systemfonts_1.0.4              
## [171] xtable_1.8-4                    jsonlite_1.8.4                 
## [173] pbmcref.SeuratData_1.0.0        R6_2.5.1                       
## [175] Hmisc_4.7-2                     pillar_1.8.1                   
## [177] htmltools_0.5.4                 mime_0.12                      
## [179] glue_1.6.2                      fastmap_1.1.0                  
## [181] DT_0.26                         BiocParallel_1.30.0            
## [183] codetools_0.2-19                ifnb.SeuratData_3.1.0          
## [185] utf8_1.2.3                      lattice_0.20-45                
## [187] bslib_0.4.2                     spatstat.sparse_3.0-0          
## [189] tibble_3.1.8                    curl_5.0.0                     
## [191] leiden_0.4.3                    interp_1.1-3                   
## [193] shinyjs_2.1.0                   survival_3.4-0                 
## [195] rmarkdown_2.20                  desc_1.4.1                     
## [197] munsell_0.5.0                   GenomeInfoDbData_1.2.8         
## [199] thp1.eccite.SeuratData_3.1.5    reshape2_1.4.4                 
## [201] gtable_0.3.1