The recent increase in publicly available single-cell datasets poses a significant challenge for integrative analysis. For example, multiple tissues have now been profiled across dozens of studies, representing hundreds of individuals and millions of cells. In Hao et al, 2022, we propose a dictionary learning based method, atomic sketch integration, that can also enable efficient and large-scale integrative analysis. Our procedure enables the integration of large compendiums of datasets without ever needing to load the full scale of data into memory at once. In our manuscript we use atomic sketch integration to integrate millions of scRNA-seq from human lung and human PBMC.

In this vignette, we demonstrate how to use atomic sketch integration to harmonize scRNA-seq experiments from five studies, each profiling of human immune cells (PBMC) from COVID patients. Specifically, we demonstrate how to perform the following steps

  • Sample a representative subset of cells (‘atoms’) from each dataset
  • Integrate the atoms from each dataset
  • Reconstruct (integrate) the full datasets, based on the atoms

First, we install the updated version of Seurat that supports this infrastructure, as well as other packages necessary for this vignette.

if (!requireNamespace("remotes", quietly = TRUE)) {
    install.packages("remotes")
}
remotes::install_github("satijalab/seurat", "feat/dictionary")

Downloading datasets

We obtained datasets in h5seurat format from a public resource compiled by the Gottardo Lab. In this analysis, we use the Arunachalam, Combes, Lee, Wilk, and Yao datasets, but you can download additional data from this resource and include it in the vignette below.

Sample representative atoms from each dataset

Inspired by pioneering work aiming to identify ‘sketches’ of scRNA-seq data, our first step is to sample a representative set of cells from each dataset. We compute a leverage score (estimate of ‘statistical leverage’) for each cell, which helps to identify cells that are likely to be member of rare subpopulations and ensure that these are included in our representative sample. Importantly, the estimation of leverage scores only requires data normalization, can be computed efficiently for sparse datasets, and does not require any intensive computation or dimensional reduction steps.

We load each object separately, perform basic preprocessing (normalization and variable feature selection), and select and store 5,000 representative cells (which we call ‘atoms’) from each dataset. We then delete the full dataset from memory, before loading the next one in.

file.dir <- "/brahms/haoy/vignette_data/PBMCVignette/"
files.set <- c("arunachalam_2020_processed.HDF5", "combes_2021_processed.HDF5", "lee_2020_processed.HDF5",
    "wilk_2020_processed.HDF5", "yao_2021_processed.HDF5")

atoms.list <- list()
for (i in 1:length(files.set)) {

    # load in Seurat object
    object <- LoadH5Seurat(file = paste0(file.dir, files.set[i]), assays = "RNA")
    dataset_name <- gsub("_processed.HDF5", "", files.set[i])
    object$dataset <- dataset_name

    # Rename cells to avoid future conflicts
    object <- RenameCells(object = object, add.cell.id = dataset_name)

    # basic preprocessing
    object <- NormalizeData(object)
    object <- FindVariableFeatures(object)

    # calculate leverage score and sample 5000 cells based on leverage score
    atoms.i <- LeverageScoreSampling(object = object, num.cells = 5000)
    atoms.list[[i]] <- atoms.i

    # delete full object from memory note that this is optional, if you can store the full
    # datasets in memory, you dont have to reload them later
    rm(object)
}

Perform integration on the atoms from different datasets

Next we perform integrative analysis on the ‘atoms’ from each of the datasets. Here, we utilize a new wrapper function that takes a list of Seurat object and runs an optimized version of the Fast integration using reciprocal PCA in Seurat workflow. The function performs all corrections in low-dimensional space (rather than on the expression values themselves) to further improve speed and memory usage, and outputs a merged Seurat object where all cells have been placed in an integrated low-dimensional space (stored as integrated_dr). We perform SCTransform normalization prior to performing integration, but this step is optional.

However, we emphasize that you can perform integration here using any analysis technique that places cells across datasets into a shared space. For example, we also demonstrate below how to use Harmony, as an alternative integration approach.

# optional step: SCTransform normalization
for (i in 1:length(atoms.list)) {
    atoms.list[[i]] <- SCTransform(atoms.list[[i]], verbose = FALSE)
}

# perform integration
features <- SelectIntegrationFeatures(object.list = atoms.list)
atoms.merge <- FastRPCAIntegration(object.list = atoms.list, dims = 1:30, normalization.method = "SCT",
    anchor.features = features)

# we can generate a 2D visualization representing the integrated atoms
atom.reduction <- "integrated_dr"
atoms.merge <- RunUMAP(atoms.merge, reduction = atom.reduction, dims = 1:30, return.model = TRUE)
DimPlot(atoms.merge, group.by = "dataset")

Alternative: integrate atoms using Harmony

As an alternative approach to integrate atoms, and to demonstrate the flexibility of our atomic sketch procedure, we can also use the Harmony within the Seurat workflow to integrate the atoms. The integration procedure returns a Seurat object with a low-dimensional space (stored as the harmony dimensional reduction) that jointly represents atoms from all datasets.

library(harmony)
atoms.merge <- merge(atoms.list[[1]], atoms.list[2:length(atoms.list)])
VariableFeatures(atoms.merge) <- SelectIntegrationFeatures(object.list = atoms.list)
atoms.merge <- ScaleData(atoms.merge)
atoms.merge <- RunPCA(atoms.merge)
atoms.merge <- RunHarmony(atoms.merge, project.dim = FALSE, group.by.vars = "dataset")
atom.reduction <- "harmony"

atoms.merge <- RunUMAP(atoms.merge, reduction = atom.reduction, dims = 1:30, return.model = TRUE)
DimPlot(atoms.merge, group.by = "dataset")

Integrate all cells from all datasets

Now that we have integrated the subset of atoms of each dataset, placing them each in an integrated low-dimensional space, we can now place each cell from each dataset in this space as well. We load the full datasets back in individually, and use the IntegrateSketchEmbeddings function to integrate all cells. After this function is run, each cell in the object has a

integrated_objects <- list()
for (i in 1:length(files.set)) {

    # load in Seurat object / basic preprocessing
    object <- LoadH5Seurat(file = paste0(file.dir, files.set[i]), assays = "RNA")
    dataset_name <- gsub("_processed.HDF5", "", files.set[i])
    object$dataset <- dataset_name
    object <- RenameCells(object = object, add.cell.id = dataset_name)
    object <- NormalizeData(object)

    # Integrate all cells into the same space as the atoms
    object <- IntegrateSketchEmbeddings(object = object, atom.sketch.object = atoms.merge, atom.sketch.reduction = atom.reduction,
        features = features)

    # At this point, you can save the results/delete the object Since we want to compute a
    # joint visualization of all cells later, we save the object with the dimensional
    # reduction and just the top 100 variable features
    object <- DietSeurat(object, features = features[1:100], dimreducs = "integrated_dr")
    integrated_objects[[i]] <- object
    rm(object)
}

We perform UMAP visualization on the integrated embeddings.

obj.merge <- merge(integrated_objects[[1]], integrated_objects[2:length(integrated_objects)], merge.dr = "integrated_dr")
obj.merge <- RunUMAP(obj.merge, reduction = "integrated_dr", dims = 1:30)

Now we can visualize the results, plotting the scRNA-seq cells based on dataset batches and pre-annotated labels annotations on the UMAP embedding. We also add pre-computed cell annotations to this object (you can download the cell annotation metadata at this link).

annotation_data <- read.table("/brahms/haoy/vignette_data/PBMCVignette/pbmc_annotations.txt")
obj.merge <- AddMetaData(obj.merge, metadata = annotation_data)
DimPlot(obj.merge, reduction = "umap", group.by = "dataset", shuffle = TRUE, raster = FALSE)

DimPlot(obj.merge, reduction = "umap", group.by = "celltype.l2", raster = FALSE)

Note that Neutrophils are present primarily in a single dataset (Combes), present at very low frequency in two others (Wilk and Lee), and absent in the remaining datasets. Despite the fact that this population is not present in all samples, it is correctly integrated by our atomic sketch procedure.

Session Info
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] patchwork_1.1.1           SeuratDisk_0.0.0.9019    
## [3] sp_1.4-6                  SeuratObject_4.0.999.9011
## [5] Seurat_4.1.0             
## 
## loaded via a namespace (and not attached):
##   [1] systemfonts_1.0.4     plyr_1.8.7            igraph_1.2.11        
##   [4] lazyeval_0.2.2        splines_4.1.3         RcppHNSW_0.3.0       
##   [7] listenv_0.8.0         scattermore_0.8       ggplot2_3.3.5        
##  [10] digest_0.6.29         htmltools_0.5.2       fansi_1.0.3          
##  [13] magrittr_2.0.3        memoise_2.0.1         tensor_1.5           
##  [16] cluster_2.1.3         ROCR_1.0-11           globals_0.14.0       
##  [19] matrixStats_0.61.0    pkgdown_2.0.2         spatstat.sparse_2.1-0
##  [22] colorspace_2.0-3      ggrepel_0.9.1         textshaping_0.3.6    
##  [25] xfun_0.30             dplyr_1.0.8           crayon_1.5.1         
##  [28] jsonlite_1.8.0        progressr_0.10.0      spatstat.data_2.1-4  
##  [31] survival_3.3-1        zoo_1.8-9             glue_1.6.2           
##  [34] polyclip_1.10-0       gtable_0.3.0          leiden_0.3.9         
##  [37] future.apply_1.8.1    abind_1.4-5           scales_1.1.1         
##  [40] DBI_1.1.2             spatstat.random_2.2-0 miniUI_0.1.1.1       
##  [43] Rcpp_1.0.8.3          viridisLite_0.4.0     xtable_1.8-4         
##  [46] reticulate_1.24       spatstat.core_2.4-0   bit_4.0.4            
##  [49] htmlwidgets_1.5.4     httr_1.4.2            RColorBrewer_1.1-2   
##  [52] ellipsis_0.3.2        ica_1.0-2             farver_2.1.0         
##  [55] pkgconfig_2.0.3       sass_0.4.1            uwot_0.1.11          
##  [58] deldir_1.0-6          utf8_1.2.2            labeling_0.4.2       
##  [61] tidyselect_1.1.2      rlang_1.0.2           reshape2_1.4.4       
##  [64] later_1.3.0           munsell_0.5.0         tools_4.1.3          
##  [67] cachem_1.0.6          cli_3.2.0             generics_0.1.2       
##  [70] ggridges_0.5.3        evaluate_0.15         stringr_1.4.0        
##  [73] fastmap_1.1.0         yaml_2.3.5            ragg_1.2.2           
##  [76] goftest_1.2-3         knitr_1.38            bit64_4.0.5          
##  [79] fs_1.5.2              fitdistrplus_1.1-8    purrr_0.3.4          
##  [82] RANN_2.6.1            pbapply_1.5-0         future_1.24.0        
##  [85] nlme_3.1-157          mime_0.12             formatR_1.11         
##  [88] hdf5r_1.3.5           compiler_4.1.3        rstudioapi_0.13      
##  [91] plotly_4.10.0         png_0.1-7             spatstat.utils_2.3-0 
##  [94] tibble_3.1.6          bslib_0.3.1           stringi_1.7.6        
##  [97] highr_0.9             desc_1.4.1            RSpectra_0.16-0      
## [100] rgeos_0.5-9           lattice_0.20-45       Matrix_1.4-1         
## [103] vctrs_0.4.0           pillar_1.7.0          lifecycle_1.0.1      
## [106] spatstat.geom_2.4-0   lmtest_0.9-40         jquerylib_0.1.4      
## [109] RcppAnnoy_0.0.19      data.table_1.14.2     cowplot_1.1.1        
## [112] irlba_2.3.5           httpuv_1.6.5          R6_2.5.1             
## [115] promises_1.2.0.1      KernSmooth_2.23-20    gridExtra_2.3        
## [118] parallelly_1.30.0     codetools_0.2-18      fastDummies_1.6.3    
## [121] MASS_7.3-56           assertthat_0.2.1      rprojroot_2.0.2      
## [124] withr_2.5.0           sctransform_0.3.3     mgcv_1.8-40          
## [127] parallel_4.1.3        grid_4.1.3            rpart_4.1.16         
## [130] tidyr_1.2.0           rmarkdown_2.13        Rtsne_0.15           
## [133] shiny_1.7.1