vignettes/COVID_SCTMapping.Rmd
COVID_SCTMapping.Rmdlibrary(Seurat) library(BPCells) library(dplyr) library(patchwork) library(ggplot2) options(future.globals.maxSize = 10 * 1024^3)
In Seurat v5, we introduce a scalable approach for reference mapping datasets from separate studies or individuals. Reference mapping is a powerful approach to identify consistent labels across studies and perform cross-dataset analysis. We emphasize that while individual datasets are manageable in size, the aggregate of many datasets often amounts to millions of cell which do not fit in-memory. Furthermore, cross-dataset analysis is often challenged by disparate or unique cell type labels. Through reference mapping, we annotate all cells with a common reference for consistent cell type labels. Importantly, we never simultaneously load all of the cells in-memory to maintain low memory usage.
In this vignette, we reference map three publicly available datasets totaling 1,498,064 cells and 277 donors which are available through CZI cellxgene collections: Ahern, et al., Nature 2022, Jin, et al., Science 2021, and Yoshida, et al., Nature 2022. Each dataset consists of PBMCs from both healthy donors and donors diagnosed with COVID-19. Using the harmonized annotations, we demonstrate how to prepare a pseudobulk object to perform differential expression analysis across disease within cell types.
Prior to running this vignette, please install Seurat v5 and see the BPCells vignette to construct the on-disk object used in this vignette. Additionally, we map to our annotated CITE-seq reference containing 162,000 cells and 228 antibodies (Hao, Hao, et al., Cell 2021) which is available for download here.
We first load the reference (available here) and
normalize the query Seurat object prepared in the BPCells interaction
vignette. The query object consists of datasets from three different
studies constructed using the CreateSeuratObject function,
which accepts a list of BPCells matrices as input. Within the Seurat
object, the three datasets reside in the RNA assay in three
separate layers on-disk.
reference <- readRDS("/brahms/hartmana/vignette_data/pbmc_multimodal_2023.rds")
object <- readRDS("/brahms/mollag/seurat_v5/vignette_data/merged_covid_object.rds")
object <- NormalizeData(object, verbose = FALSE)Using the same code from the v4 reference mapping
vignette, we find anchors between the reference and query in the
precomputed supervised PCA. We recommend the use of supervised PCA for
CITE-seq reference datasets, and demonstrate how to compute this
transformation in v4 reference mapping
vignette. In Seurat v5, we only need to call
FindTransferAnchors and MapQuery once to map
all three datasets as they are all contained within the query object.
Furthermore, utilizing the on-disk capabilities of BPCells, we map 1.5 million
cells without ever loading them all into memory.
anchor <- FindTransferAnchors(
reference = reference,
query = object,
reference.reduction = 'spca',
normalization.method = 'SCT',
dims = 1:50)
object <- MapQuery(
anchorset = anchor,
query = object,
reference = reference,
refdata = list(
celltype.l1 = "celltype.l1",
celltype.l2 = "celltype.l2"
),
reduction.model = 'wnn.umap'
)Next, we visualize all cells from the three studies which have been
projected into a UMAP-space defined by the reference. Each cell is
annotated at two levels of granularity
(predicted.celltype.l1 and
predicted.celltype.l2). We can compare the differing
ontologies used in the original annotations (cell_type) to
the now harmonized annotations (predicted.celltype.l2, for
example) that were predicted from reference-mapping. Previously, the
lack of standardization prevented us from directly performing
integrative analysis across studies, but now we can easily compare.
DimPlot(object, reduction = 'ref.umap', group.by = 'cell_type',alpha = 0.1, label = TRUE, split.by = 'publication', ncol = 3, label.size = 3) + NoLegend()
DimPlot(object, reduction = 'ref.umap', group.by = 'predicted.celltype.l2',alpha = 0.1, label = TRUE, split.by = 'publication', ncol = 3, label.size = 3) + NoLegend()
We utilize our harmonized annotations to identify differences in the proportion of different cell types between healthy individuals and COVID-19 patients. For example, we noticed a reduction in MAIT cells as well as an increase in plasmablasts among COVID-19 patients.
df_comp <- as.data.frame.matrix(table(object$donor_id, object$predicted.celltype.l2))
select.donors <- rownames(df_comp)[rowSums(df_comp)> 50]
df_comp <- df_comp[select.donors, ]
df_comp_relative <- sweep(x = df_comp, MARGIN = 1, STATS = rowSums(df_comp), FUN = '/')
df_disease <- as.data.frame.matrix(table(object$donor_id, object$disease))[select.donors, ]
df_comp_relative$disease <- 'other'
df_comp_relative$disease[df_disease$normal!=0] <- 'normal'
df_comp_relative$disease[df_disease$`COVID-19`!=0] <- 'COVID-19'
df_comp_relative$disease <- factor(df_comp_relative$disease, levels = c('normal','COVID-19','other'))
df_comp_relative <- df_comp_relative[df_comp_relative$disease %in% c('normal','COVID-19'),]
p1 <- ggplot(data = df_comp_relative, mapping = aes(x = disease, y = MAIT, fill = disease)) +
geom_boxplot(outlier.shape = NA) +
scale_fill_manual(values = c("#377eb8", "#e41a1c")) +
xlab("") + ylab('relative abundance') +
ggtitle('MAIT') +
geom_jitter(color="black", size=0.4, alpha=0.9 ) +
theme_bw() +
theme( axis.title = element_text(size = 12),
axis.text = element_text(size = 12),
plot.title = element_text(size = 15, hjust = 0.5, face = "bold")
)
p2 <- ggplot(data = df_comp_relative, mapping = aes(x = disease, y = Plasmablast, fill = disease)) +
geom_boxplot(outlier.shape = NA) +
scale_fill_manual(values = c("#377eb8", "#e41a1c")) +
xlab("") + ylab('relative abundance') +
ggtitle('Plasmablast') +
geom_jitter(color="black", size=0.4, alpha=0.9 ) +
theme_bw() +
theme( axis.title = element_text(size = 12),
axis.text = element_text(size = 12),
plot.title = element_text(size = 15, hjust = 0.5, face = "bold")
)
p1 + p2 + plot_layout(ncol = 2)
In addition to composition analysis, we use an aggregation-based
(pseudobulk) workflow to explore differential genes between healthy
individuals and COVID-19 donors. We aggregate all cells within the same
cell type and donor using the AggregateExpression function.
This returns a Seurat object where each ‘cell’ represents the pseudobulk
profile of one cell type in one individual.
bulk <- AggregateExpression(object,
return.seurat = TRUE,
assays = 'RNA',
group.by = c("predicted.celltype.l2", "donor_id", "disease")
)
bulk <- subset(bulk, subset = disease %in% c('normal', 'COVID-19') )
bulk <- subset(bulk, subset = predicted.celltype.l2 != 'Doublet')
bulk$disease <- factor(bulk$disease, levels = c('normal', 'COVID-19'))Once a pseudobulk object is created, we can perform cell type-specific differential expression analysis between healthy individuals and COVID-19 donors. Here, we only visualize certain interferon-stimulated genes which are often upregulated during viral infection.
p1 <- VlnPlot(
object = bulk, features = 'IFI6', group.by = 'predicted.celltype.l2',
split.by = 'disease', cols = c("#377eb8", "#e41a1c"))
p2 <- VlnPlot(
object = bulk, features = c('ISG15'), group.by = 'predicted.celltype.l2',
split.by = 'disease', cols = c("#377eb8", "#e41a1c"))
p3 <- VlnPlot(
object = bulk, features = c('IFIT5'), group.by = 'predicted.celltype.l2',
split.by = 'disease', cols = c("#377eb8", "#e41a1c"))
p1 + p2 + p3 + plot_layout(ncol = 1)
## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] future_1.70.0 ggplot2_4.0.3 patchwork_1.3.2 dplyr_1.2.0
## [5] BPCells_0.3.1 Seurat_5.5.0 testthat_3.3.2 SeuratObject_5.4.0
## [9] sp_2.2-1
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 rstudioapi_0.18.0 jsonlite_2.0.0
## [4] magrittr_2.0.5 ggbeeswarm_0.7.3 spatstat.utils_3.2-2
## [7] farver_2.1.2 rmarkdown_2.30 fs_2.1.0
## [10] ragg_1.5.1 vctrs_0.7.1 ROCR_1.0-12
## [13] memoise_2.0.1 spatstat.explore_3.8-0 htmltools_0.5.9
## [16] usethis_3.2.1 sass_0.4.10 sctransform_0.4.3
## [19] parallelly_1.47.0 KernSmooth_2.23-26 bslib_0.10.0
## [22] htmlwidgets_1.6.4 desc_1.4.3 ica_1.0-3
## [25] plyr_1.8.9 plotly_4.12.0 zoo_1.8-15
## [28] cachem_1.1.0 igraph_2.3.0 mime_0.13
## [31] lifecycle_1.0.5 pkgconfig_2.0.3 Matrix_1.7-5
## [34] R6_2.6.1 fastmap_1.2.0 MatrixGenerics_1.22.0
## [37] fitdistrplus_1.2-6 shiny_1.13.0 digest_0.6.39
## [40] S4Vectors_0.48.1 rprojroot_2.1.1 tensor_1.5.1
## [43] RSpectra_0.16-2 irlba_2.3.7 pkgload_1.5.2
## [46] GenomicRanges_1.62.1 textshaping_1.0.5 labeling_0.4.3
## [49] progressr_0.19.0 spatstat.sparse_3.1-0 httr_1.4.8
## [52] polyclip_1.10-7 abind_1.4-8 compiler_4.5.2
## [55] withr_3.0.2 S7_0.2.1 fastDummies_1.7.6
## [58] pkgbuild_1.4.8 MASS_7.3-65 sessioninfo_1.2.3
## [61] tools_4.5.2 vipor_0.4.7 lmtest_0.9-40
## [64] otel_0.2.0 beeswarm_0.4.0 httpuv_1.6.16
## [67] future.apply_1.20.2 goftest_1.2-3 glue_1.8.0
## [70] nlme_3.1-168 promises_1.5.0 grid_4.5.2
## [73] Rtsne_0.17 cluster_2.1.8.2 reshape2_1.4.5
## [76] generics_0.1.4 gtable_0.3.6 spatstat.data_3.1-9
## [79] tidyr_1.3.2 data.table_1.18.2.1 BiocGenerics_0.56.0
## [82] spatstat.geom_3.7-3 RcppAnnoy_0.0.23 ggrepel_0.9.8
## [85] RANN_2.6.2 pillar_1.11.1 stringr_1.6.0
## [88] spam_2.11-3 RcppHNSW_0.6.0 later_1.4.8
## [91] splines_4.5.2 lattice_0.22-7 survival_3.8-3
## [94] deldir_2.0-4 tidyselect_1.2.1 miniUI_0.1.2
## [97] pbapply_1.7-4 knitr_1.51 gridExtra_2.3
## [100] Seqinfo_1.0.0 IRanges_2.44.0 scattermore_1.2
## [103] stats4_4.5.2 xfun_0.56 brio_1.1.5
## [106] devtools_2.5.1 matrixStats_1.5.0 stringi_1.8.7
## [109] lazyeval_0.2.3 yaml_2.3.12 evaluate_1.0.5
## [112] codetools_0.2-20 tibble_3.3.1 cli_3.6.6
## [115] uwot_0.2.4 xtable_1.8-8 reticulate_1.46.0
## [118] systemfonts_1.3.2 jquerylib_0.1.4 Rcpp_1.1.1-1.1
## [121] globals_0.19.1 spatstat.random_3.4-5 png_0.1-9
## [124] ggrastr_1.0.2 spatstat.univar_3.1-7 parallel_4.5.2
## [127] ellipsis_0.3.3 pkgdown_2.2.0 dotCall64_1.2
## [130] listenv_0.10.1 viridisLite_0.4.3 scales_1.4.0
## [133] ggridges_0.5.7 purrr_1.2.1 rlang_1.2.0
## [136] cowplot_1.2.0