About Install Get Started Frequently Asked Questions Frequently Requested Vignettes Contact
Tutorial: Integrating stimulated vs. control PBMC datasets to learn cell-type specific responses

This tutorial walks through an alignment of two groups of PBMCs from Kang et al, 2017. In this experiment, PBMCs were split into a stimulated and control group and the stimulated group was treated with interferon beta. The response to interferon caused cell type specific gene expression changes that makes a joint analysis of all the data difficult, with cells clustering both by stimulation condition and by cell type. Here, we demonstrate our alignment strategy, as described in Butler et al, 2017, for performing integrated analyses to promote the identification of common cell types and enable comparative analyses. While this example demonstrates the integration of two datasets (conditions), these methods have been extended to multiple datasets. This workflow provides an example of integrating four pancreatic islet datasets (raw data here) generated using different technologies. To view the previously posted tutorial aligning two PBMC datasets generated using 10X and SeqWell, click here.

Integration goals

The following tutorial is designed to give you an overview of the kinds of comparative analyses on complex cell types that are possible using the Seurat integration procedure. Here, we address three main goals:

  • Identify cell types that are present in both datasets
  • Obtain cell type markers that are conserved in both control and stimulated cells
  • Compare the datasets to find cell-type specific responses to stimulation

Setup the Seurat objects

The gene expression matrices can be found here. We first read in the two count matrices and set up the Seurat objects.

Next, we select the genes we want to use in the alignment procedure. Here we take the union of the top 1,000 genes with the highest dispersion (var/mean) from both datasets.

ctrl.data <- read.table("~/Downloads/immune_control_expression_matrix.txt.gz", 
    sep = "\t")
stim.data <- read.table("~/Downloads/immune_stimulated_expression_matrix.txt.gz", 
    sep = "\t")

# Set up control object
ctrl <- CreateSeuratObject(raw.data = ctrl.data, project = "IMMUNE_CTRL", min.cells = 5)
ctrl@meta.data$stim <- "CTRL"
ctrl <- FilterCells(ctrl, subset.names = "nGene", low.thresholds = 500, high.thresholds = Inf)
ctrl <- NormalizeData(ctrl)
ctrl <- ScaleData(ctrl, display.progress = F)
# Set up stimulated object
stim <- CreateSeuratObject(raw.data = stim.data, project = "IMMUNE_STIM", min.cells = 5)
stim@meta.data$stim <- "STIM"
stim <- FilterCells(stim, subset.names = "nGene", low.thresholds = 500, high.thresholds = Inf)
stim <- NormalizeData(stim)
stim <- ScaleData(stim, display.progress = F)

# Gene selection for input to CCA
ctrl <- FindVariableGenes(ctrl, do.plot = F)
stim <- FindVariableGenes(stim, do.plot = F)
g.1 <- head(rownames(ctrl@hvg.info), 1000)
g.2 <- head(rownames(stim@hvg.info), 1000)
genes.use <- unique(c(g.1, g.2))
genes.use <- intersect(genes.use, rownames(ctrl@scale.data))
genes.use <- intersect(genes.use, rownames(stim@scale.data))

Perform a canonical correlation analysis (CCA)

We next run a canonical correlation analysis to identify common sources of variation between the two datasets. RunCCA will also combine the two objects into a single object and store the canonical correlation vectors (the vectors that project each dataset into the maximally correlated subspaces). We also store the original dataset identity as a column in object@meta.data

immune.combined <- RunCCA(ctrl, stim, genes.use = genes.use, num.cc = 30)
# visualize results of CCA plot CC1 versus CC2 and look at a violin plot
p1 <- DimPlot(object = immune.combined, reduction.use = "cca", group.by = "stim", 
    pt.size = 0.5, do.return = TRUE)
p2 <- VlnPlot(object = immune.combined, features.plot = "CC1", group.by = "stim", 
    do.return = TRUE)
plot_grid(p1, p2)