As described in Stuart*, Butler*, et al. Cell 2019, Seurat v3 introduces new methods for the integration of multiple single-cell datasets. These methods aim to identify shared cell states that are present across different datasets, even if they were collected from different individuals, experimental conditions, technologies, or even species.
Our method aims to first identify ‘anchors’ between pairs of datasets. These represent pairwise correspondences between individual cells (one in each dataset), that we hypothesize originate from the same biological state. These ‘anchors’ are then used to harmonize the datasets, or transfer information from one dataset to another. Below, we demonstrate multiple applications of integrative analysis, and also introduce new functionality beyond what was described in the 2019 manuscript. To help guide users, we briefly introduce these vignettes below:
Standard Workflow
SCTransform
Reference-based
Reciprocal PCA
In this example workflow, we demonstrate two new methods we recently introduced in our paper, Comprehensive Integration of Single Cell Data:
For the purposes of this example, we’ve chosen human pancreatic islet cell datasets produced across four technologies, CelSeq (GSE81076) CelSeq2 (GSE85241), Fluidigm C1 (GSE86469), and SMART-Seq2 (E-MTAB-5061). For convienence, we distribute this dataset through our SeuratData package.
The code for the new methodology is implemented in Seurat v3. You can download and install from CRAN with install.packages
.
install.packages("Seurat")
In addition to new methods, Seurat v3 includes a number of improvements aiming to improve the Seurat object and user interaction. To help users familiarize themselves with these changes, we put together a command cheat sheet for common tasks.
Load in the dataset. The metadata contains the technology (tech
column) and cell type annotations (celltype
column) for each cell in the four datasets.
library(Seurat)
library(SeuratData)
InstallData("panc8")
To construct a reference, we will identify ‘anchors’ between the individual datasets. First, we split the combined object into a list, with each dataset as an element.
data("panc8")
pancreas.list <- SplitObject(panc8, split.by = "tech")
pancreas.list <- pancreas.list[c("celseq", "celseq2", "fluidigmc1", "smartseq2")]
Prior to finding anchors, we perform standard preprocessing (log-normalization), and identify variable features individually for each. Note that Seurat v3 implements an improved method for variable feature selection based on a variance stabilizing transformation ("vst"
)
for (i in 1:length(pancreas.list)) {
pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst",
nfeatures = 2000, verbose = FALSE)
}
Next, we identify anchors using the FindIntegrationAnchors
function, which takes a list of Seurat objects as input. Here, we integrate three of the objects into a reference (we will use the fourth later in this vignette)
reference.list <- pancreas.list[c("celseq", "celseq2", "smartseq2")]
pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)
We then pass these anchors to the IntegrateData
function, which returns a Seurat object.
Assay
, which holds an integrated (or ‘batch-corrected’) expression matrix for all cells, enabling them to be jointly analyzed.pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)
After running IntegrateData
, the Seurat
object will contain a new Assay
with the integrated expression matrix. Note that the original (uncorrected values) are still stored in the object in the “RNA” assay, so you can switch back and forth.
We can then use this new integrated matrix for downstream analysis and visualization. Here we scale the integrated data, run PCA, and visualize the results with UMAP. The integrated datasets cluster by cell type, instead of by technology.
library(ggplot2)
library(cowplot)
library(patchwork)
# switch to integrated assay. The variable features of this assay are automatically
# set during IntegrateData
DefaultAssay(pancreas.integrated) <- "integrated"
# Run the standard workflow for visualization and clustering
pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)
pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
pancreas.integrated <- RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30)
p1 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "tech")
p2 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "celltype", label = TRUE,
repel = TRUE) + NoLegend()
p1 + p2
Seurat v3 also supports the projection of reference data (or meta data) onto a query object. While many of the methods are conserved (both procedures begin by identifying anchors), there are two important distinctions between data transfer and integration:
After finding anchors, we use the TransferData
function to classify the query cells based on reference data (a vector of reference cell type labels). TransferData
returns a matrix with predicted IDs and prediction scores, which we can add to the query metadata.
pancreas.query <- pancreas.list[["fluidigmc1"]]
pancreas.anchors <- FindTransferAnchors(reference = pancreas.integrated, query = pancreas.query,
dims = 1:30)
predictions <- TransferData(anchorset = pancreas.anchors, refdata = pancreas.integrated$celltype,
dims = 1:30)
pancreas.query <- AddMetaData(pancreas.query, metadata = predictions)
Because we have the original label annotations from our full integrated analysis, we can evaluate how well our predicted cell type annotations match the full reference. In this example, we find that there is a high agreement in cell type classification, with over 97% of cells being labeled correctly.
pancreas.query$prediction.match <- pancreas.query$predicted.id == pancreas.query$celltype
table(pancreas.query$prediction.match)
##
## FALSE TRUE
## 18 620
To verify this further, we can examine some canonical cell type markers for specific pancreatic islet cell populations. Note that even though some of these cell types are only represented by one or two cells (e.g. epsilon cells), we are still able to classify them correctly.
table(pancreas.query$predicted.id)
##
## acinar activated_stellate alpha beta
## 21 17 248 258
## delta ductal endothelial epsilon
## 22 33 13 1
## gamma macrophage mast schwann
## 17 1 2 5
VlnPlot(pancreas.query, c("REG1A", "PPY", "SST", "GHRL", "VWF", "SOX10"), group.by = "predicted.id")
On the previous tab, we demonstrate how to integrate datasets after each has been pre-processed using standard log-normalization. Here, we modify the workflow to take advantage of our improved pre-processing and normalization workflow: SCTransform. You can read more about SCTransform in our recent preprint, and see how to apply it to a single dataset in a separate vignette. We suggest exploring these resources before proceeding.
Conceptually, this workflow is very similar to what we have previously introduced, where we ‘correct’ (or harmonize) log-normalized expression values across datasets. Here, instead, we will harmonize the Pearson residuals that are output from SCTransform. As demonstrated below, the workflow consists of the following steps:
First, setup the Seurat object list, and run SCTransform on each object separately:
library(Seurat)
library(ggplot2)
library(patchwork)
options(future.globals.maxSize = 4000 * 1024^2)
data("panc8")
pancreas.list <- SplitObject(panc8, split.by = "tech")
pancreas.list <- pancreas.list[c("celseq", "celseq2", "fluidigmc1", "smartseq2")]
for (i in 1:length(pancreas.list)) {
pancreas.list[[i]] <- SCTransform(pancreas.list[[i]], verbose = FALSE)
}
Next, select features for downstream integration, and run PrepSCTIntegration
, which ensures that all necessary Pearson residuals have been calculated.
pancreas.features <- SelectIntegrationFeatures(object.list = pancreas.list, nfeatures = 3000)
pancreas.list <- PrepSCTIntegration(object.list = pancreas.list, anchor.features = pancreas.features,
verbose = FALSE)
Next, identify anchors and integrate the datasets. Commands are identical to the standard workflow, but make sure to set normalization.method = 'SCT'
:
pancreas.anchors <- FindIntegrationAnchors(object.list = pancreas.list, normalization.method = "SCT",
anchor.features = pancreas.features, verbose = FALSE)
pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, normalization.method = "SCT",
verbose = FALSE)
Now proceed with downstream analysis (i.e. visualization, clustering) on the integrated dataset. Commands are identical to the standard workflow, but do not run the ScaleData
function after integration. You can see that after integration, cells group by their biological cell type (which has been pre-annotated), instead of by their underlying technology.
pancreas.integrated <- RunPCA(pancreas.integrated, verbose = FALSE)
pancreas.integrated <- RunUMAP(pancreas.integrated, dims = 1:30)
plots <- DimPlot(pancreas.integrated, group.by = c("tech", "celltype"))
plots & theme(legend.position = "top") & guides(color = guide_legend(nrow = 3, byrow = TRUE,
override.aes = list(size = 3)))
To further demonstrate this workflow, we next apply it to a series of human PBMC datasets from eight different technologies, produced as a systematic technology benchmark by the Human Cell Atlas. Data was downloaded from the Broad Single Cell Portal. For convienence, we distribute this dataset through our SeuratData package.
InstallData("pbmcsca")
data("pbmcsca")
pbmc.list <- SplitObject(pbmcsca, split.by = "Method")
for (i in names(pbmc.list)) {
pbmc.list[[i]] <- SCTransform(pbmc.list[[i]], verbose = FALSE)
}
pbmc.features <- SelectIntegrationFeatures(object.list = pbmc.list, nfeatures = 3000)
pbmc.list <- PrepSCTIntegration(object.list = pbmc.list, anchor.features = pbmc.features)
pbmc.anchors <- FindIntegrationAnchors(object.list = pbmc.list, normalization.method = "SCT",
anchor.features = pbmc.features)
pbmc.integrated <- IntegrateData(anchorset = pbmc.anchors, normalization.method = "SCT")
pbmc.integrated <- RunPCA(object = pbmc.integrated, verbose = FALSE)
pbmc.integrated <- RunUMAP(object = pbmc.integrated, dims = 1:30)
plots <- DimPlot(pbmc.integrated, group.by = c("Method", "CellType"))
plots & theme(legend.position = "top") & guides(color = guide_legend(nrow = 4, byrow = TRUE,
override.aes = list(size = 2.5)))
Note that again, cells are grouped together by shared biological subtype, instead of by technology. In this case, the authors labeled “CellType” in each dataset independently, and you can see that this is consistent with our integrated analysis:
DimPlot(pbmc.integrated, group.by = "CellType", split.by = "Method", ncol = 3)