About Install Vignettes Extensions FAQs Contact Search
Integration and Label Transfer

Intro: Seurat v3 Integration

As described in Stuart*, Butler*, et al. Cell 2019, Seurat v3 introduces new methods for the integration of multiple single-cell datasets. These methods aim to identify shared cell states that are present across different datasets, even if they were collected from different individuals, experimental conditions, technologies, or even species.

Our method aims to first identify ‘anchors’ between pairs of datasets. These represent pairwise correspondences between individual cells (one in each dataset), that we hypothesize originate from the same biological state. These ‘anchors’ are then used to harmonize the datasets, or transfer information from one dataset to another. Below, we demonstrate multiple applications of integrative analysis, and also introduce new functionality beyond what was described in the 2019 manuscript. To help guide users, we briefly introduce these vignettes below:

Standard Workflow

  • Describes the standard Seurat v3 integration workflow, and applies it to integrate multiple datasets collected of human pancreatic islets (across different technologies). We also demonstrate how Seurat v3 can be used as a classifier, transferring cluster labels onto a newly collected dataset.
  • We recommend this vignette for new users

SCTransform

  • Describes a modification of the v3 integration workflow, in order to apply to datasets that have been normalized with our new normalization method, SCTransform. We apply this to the same pancreatic islet datasets as described previously, and also integrate human PBMC datasets from eight different technologies, produced as a systematic technology benchmark by the Human Cell Atlas.
  • We recommend this vignette for advanced users who are familiar with our SCTransform normalization method. You can read more about SCTransform in our recent preprint, and see how to apply it to a single dataset in a separate vignette

Reference-based

  • Describes a modification of the v3 integration workflow, where a subset of the datasets (or a single dataset) are listed as a ‘reference’. This approach can result in dramatic speed improvements, particularly when there are a large number of datasets to integrate. We apply this to the eight PBMC datasets described above, and observe identical results, despite a substantial reduction in processing time.
  • We recommend this vignette for users who are integrating many datasets, and are looking for speed improvements.

Reciprocal PCA

  • Describes a modification of the v3 integration workflow, where reciprocal PCA is used in place of canonical correlation analysis for the dimension reduction used in anchor finding. This approach can improve speed and efficiency when working with large datasets.
  • We recommend this vignette for users looking for speed/memory improvements when working with a large number of datasets or cells, for example experimental designs with many experimental conditions, replicates, or patients. However, this workflow may struggle to align highly divergent samples (e.g. cross species, or cross-modality, integration). For a ‘turbo’ mode, consider combining with “reference-based” integration as demonstrated here.

Standard Workflow

In this example workflow, we demonstrate two new methods we recently introduced in our paper, Comprehensive Integration of Single Cell Data:

  • Assembly of multiple distinct scRNA-seq datasets into an integrated reference
  • Transfer of cell type labels from a reference dataset onto a new query dataset

For the purposes of this example, we’ve chosen human pancreatic islet cell datasets produced across four technologies, CelSeq (GSE81076) CelSeq2 (GSE85241), Fluidigm C1 (GSE86469), and SMART-Seq2 (E-MTAB-5061). For convienence, we distribute this dataset through our SeuratData package.

The code for the new methodology is implemented in Seurat v3. You can download and install from CRAN with install.packages.

install.packages("Seurat")

In addition to new methods, Seurat v3 includes a number of improvements aiming to improve the Seurat object and user interaction. To help users familiarize themselves with these changes, we put together a command cheat sheet for common tasks.

Dataset preprocessing

Load in the dataset. The metadata contains the technology (tech column) and cell type annotations (celltype column) for each cell in the four datasets.

library(Seurat)
library(SeuratData)
InstallData("panc8")

To construct a reference, we will identify ‘anchors’ between the individual datasets. First, we split the combined object into a list, with each dataset as an element.

data("panc8")
pancreas.list <- SplitObject(panc8, split.by = "tech")
pancreas.list <- pancreas.list[c("celseq", "celseq2", "fluidigmc1", "smartseq2")]

Prior to finding anchors, we perform standard preprocessing (log-normalization), and identify variable features individually for each. Note that Seurat v3 implements an improved method for variable feature selection based on a variance stabilizing transformation ("vst")

for (i in 1:length(pancreas.list)) {
    pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
    pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst", 
        nfeatures = 2000, verbose = FALSE)
}

Integration of 3 pancreatic islet cell datasets

Next, we identify anchors using the FindIntegrationAnchors function, which takes a list of Seurat objects as input. Here, we integrate three of the objects into a reference (we will use the fourth later in this vignette)

  • We use all default parameters here for identifying anchors, including the ‘dimensionality’ of the dataset (30; feel free to try varying this parameter over a broad range, for example between 10 and 50).
reference.list <- pancreas.list[c("celseq", "celseq2", "smartseq2")]
pancreas.anchors <- FindIntegrationAnchors(object.list = reference.list, dims = 1:30)

We then pass these anchors to the IntegrateData function, which returns a Seurat object.

  • The returned object will contain a new Assay, which holds an integrated (or ‘batch-corrected’) expression matrix for all cells, enabling them to be jointly analyzed.
pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:30)

After running IntegrateData, the Seurat object will contain a new Assay with the integrated expression matrix. Note that the original (uncorrected values) are still stored in the object in the “RNA” assay, so you can switch back and forth.

We can then use this new integrated matrix for downstream analysis and visualization. Here we scale the integrated data, run PCA, and visualize the results with UMAP. The integrated datasets cluster by cell type, instead of by technology.

library(ggplot2)
library(cowplot)
# switch to integrated assay. The variable features of this assay are automatically
# set during IntegrateData
DefaultAssay(pancreas.integrated) <- "integrated"

# Run the standard workflow for visualization and clustering
pancreas.integrated <- ScaleData(pancreas.integrated, verbose = FALSE)
pancreas.integrated <- RunPCA(pancreas.integrated, npcs = 30, verbose = FALSE)
pancreas.integrated <- RunUMAP(pancreas.integrated, reduction = "pca", dims = 1:30)
p1 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "tech")
p2 <- DimPlot(pancreas.integrated, reduction = "umap", group.by = "celltype", label = TRUE, 
    repel = TRUE) + NoLegend()
plot_grid(p1, p2)