This tutorial walks through an alignment of two groups of PBMCs from Kang et al, 2017. In this experiment, PBMCs were split into a stimulated and control group and the stimulated group was treated with interferon beta. The response to interferon caused cell type specific gene expression changes that makes a joint analysis of all the data difficult, with cells clustering both by stimulation condition and by cell type. Here, we demonstrate our alignment strategy, as described in Butler et al, 2018, for performing integrated analyses to promote the identification of common cell types and enable comparative analyses. While this example demonstrates the integration of two datasets (conditions), these methods have been extended to multiple datasets. This workflow provides an example of integrating four pancreatic islet datasets (raw data here) generated using different technologies. To view the previously posted tutorial aligning two PBMC datasets generated using 10X and SeqWell, click here.

Integration goals

The following tutorial is designed to give you an overview of the kinds of comparative analyses on complex cell types that are possible using the Seurat integration procedure. Here, we address three main goals:

Setup the Seurat objects

The gene expression matrices can be found here. We first read in the two count matrices and set up the Seurat objects.

Next, we select the genes we want to use in the alignment procedure. Here we take the union of the top 1,000 genes with the highest dispersion (var/mean) from both datasets.

library(Seurat)
ctrl.data <- read.table("~/Downloads/immune_control_expression_matrix.txt.gz", 
    sep = "\t")
stim.data <- read.table("~/Downloads/immune_stimulated_expression_matrix.txt.gz", 
    sep = "\t")

# Set up control object
ctrl <- CreateSeuratObject(raw.data = ctrl.data, project = "IMMUNE_CTRL", min.cells = 5)
ctrl@meta.data$stim <- "CTRL"
ctrl <- FilterCells(ctrl, subset.names = "nGene", low.thresholds = 500, high.thresholds = Inf)
ctrl <- NormalizeData(ctrl)
ctrl <- ScaleData(ctrl, display.progress = F)
# Set up stimulated object
stim <- CreateSeuratObject(raw.data = stim.data, project = "IMMUNE_STIM", min.cells = 5)
stim@meta.data$stim <- "STIM"
stim <- FilterCells(stim, subset.names = "nGene", low.thresholds = 500, high.thresholds = Inf)
stim <- NormalizeData(stim)
stim <- ScaleData(stim, display.progress = F)

# Gene selection for input to CCA
ctrl <- FindVariableGenes(ctrl, do.plot = F)
stim <- FindVariableGenes(stim, do.plot = F)
g.1 <- head(rownames(ctrl@hvg.info), 1000)
g.2 <- head(rownames(stim@hvg.info), 1000)
genes.use <- unique(c(g.1, g.2))
genes.use <- intersect(genes.use, rownames(ctrl@scale.data))
genes.use <- intersect(genes.use, rownames(stim@scale.data))

Perform a canonical correlation analysis (CCA)

We next run a canonical correlation analysis to identify common sources of variation between the two datasets. RunCCA will also combine the two objects into a single object and store the canonical correlation vectors (the vectors that project each dataset into the maximally correlated subspaces). We also store the original dataset identity as a column in object@meta.data

immune.combined <- RunCCA(ctrl, stim, genes.use = genes.use, num.cc = 30)
# visualize results of CCA plot CC1 versus CC2 and look at a violin plot
p1 <- DimPlot(object = immune.combined, reduction.use = "cca", group.by = "stim", 
    pt.size = 0.5, do.return = TRUE)
p2 <- VlnPlot(object = immune.combined, features.plot = "CC1", group.by = "stim", 
    do.return = TRUE)
plot_grid(p1, p2)

PrintDim(object = immune.combined, reduction.type = "cca", dims.print = 1:2, 
    genes.print = 10)
## [1] "CC1"
##  [1] "RPS6"   "RPS18"  "RPL21"  "RPL13"  "RPL13A" "RPS14"  "RPL3"  
##  [8] "RPS3"   "RPS2"   "RPL32" 
## [1] ""
##  [1] "TYROBP"   "FCER1G"   "C15orf48" "SOD2"     "FTL"      "ANXA5"   
##  [7] "CST3"     "CD63"     "TIMP1"    "TYMP"    
## [1] ""
## [1] ""
## [1] "CC2"
##  [1] "NKG7"  "GNLY"  "GZMB"  "CCL5"  "PRF1"  "CST7"  "KLRD1" "GZMH" 
##  [9] "CLIC3" "GZMA" 
## [1] ""
##  [1] "RPS9"    "RPL8"    "HLA-DRA" "RPS23"   "RPL11"   "RPS8"    "RPL18A" 
##  [8] "RPL32"   "RPL12"   "RPL26"  
## [1] ""
## [1] ""

You can see, for example, that CC1 and CC2 separate myeloid from lymphoid cells in both datasets, but the values remain ‘shifted’ relative to each other. We now need to choose CCs for downstream analysis and then ‘align them’

The problem of choosing CCs for downstream analysis such as clustering is analogous to that of choosing PCs and we provide similar functions for exploring which CCs to include. When choosing PCs, often we look for a saturation in the relationship between the number of principle components and the percentage of the variance explained. Here, we provide MetageneBicorPlot, which examines a measure of correlation strength for each CC and find that this statistic generally saturates after a reasonable number of CCs. Here, we begin to see drop-off in signal after CC20, so we chose CC1-20 for analysis. You can try modifying this parameter (i.e. 1-15 or 1-25) without significant changes in the results.

p3 <- MetageneBicorPlot(immune.combined, grouping.var = "stim", dims.eval = 1:30, 
    display.progress = FALSE)

As with PC selection, it is often also useful to examine heatmaps of the top genes driving each CC. Here, we visualize the first 9 CCs.

DimHeatmap(object = immune.combined, reduction.type = "cca", cells.use = 500, 
    dim.use = 1:9, do.balanced = TRUE)

Align the CCA subspaces

Now we align the CCA subspaces, which returns a new dimensional reduction called cca.aligned that can be used for downstream analyses such as clustering.

immune.combined <- AlignSubspace(immune.combined, reduction.type = "cca", grouping.var = "stim", 
    dims.align = 1:20)

We can visualize the aligned CCA and perform an integrated analysis.

p1 <- VlnPlot(object = immune.combined, features.plot = "ACC1", group.by = "stim", 
    do.return = TRUE)
p2 <- VlnPlot(object = immune.combined, features.plot = "ACC2", group.by = "stim", 
    do.return = TRUE)
plot_grid(p1, p2)

Perform an integrated analysis

Now we can run a single integrated analysis on all cells!

# t-SNE and Clustering
immune.combined <- RunTSNE(immune.combined, reduction.use = "cca.aligned", dims.use = 1:20, 
    do.fast = T)
immune.combined <- FindClusters(immune.combined, reduction.type = "cca.aligned", 
    resolution = 0.6, dims.use = 1:20)
# Visualization
p1 <- TSNEPlot(immune.combined, do.return = T, pt.size = 0.5, group.by = "stim")
p2 <- TSNEPlot(immune.combined, do.label = T, do.return = T, pt.size = 0.5)
plot_grid(p1, p2)

Identify conserved cell type markers

To identify canonical cell type marker genes that are conserved across conditions, we provide the FindConservedMarkers function. This function performs differential gene expression testing for each dataset/group and combines the p-values using meta-analysis methods from the MetaDE R package. For example, we can calculated the genes that are conserved markers irrespective of stimulation condition in cluster 7 (NK cells).

nk.markers <- FindConservedMarkers(immune.combined, ident.1 = 7, grouping.var = "stim", 
    print.bar = FALSE)
head(nk.markers)
##        CTRL_p_val CTRL_avg_logFC CTRL_pct.1 CTRL_pct.2 CTRL_p_val_adj
## GNLY            0       4.133571      0.943      0.043              0
## NKG7            0       3.173254      0.956      0.082              0
## GZMB            0       2.895198      0.839      0.042              0
## CLIC3           0       2.395846      0.599      0.023              0
## FGFBP2          0       2.188426      0.489      0.020              0
## CTSW            0       2.062551      0.533      0.029              0
##           STIM_p_val STIM_avg_logFC STIM_pct.1 STIM_pct.2 STIM_p_val_adj
## GNLY    0.000000e+00       4.041565      0.950      0.058   0.000000e+00
## NKG7    0.000000e+00       2.887415      0.962      0.078   0.000000e+00
## GZMB    0.000000e+00       3.101812      0.899      0.058   0.000000e+00
## CLIC3   0.000000e+00       2.392933      0.601      0.030   0.000000e+00
## FGFBP2 7.353053e-180       1.531844      0.266      0.015  1.033325e-175
## CTSW    0.000000e+00       2.131213      0.574      0.035   0.000000e+00
##             max_pval minimump_p_val
## GNLY    0.000000e+00              0
## NKG7    0.000000e+00              0
## GZMB    0.000000e+00              0
## CLIC3   0.000000e+00              0
## FGFBP2 7.353053e-180              0
## CTSW    0.000000e+00              0

We can explore these marker genes for each cluster and use them to annotate our clusters as specific cell types.

FeaturePlot(object = immune.combined, features.plot = c("CD3D", "SELL", "CREM", 
    "CD8A", "GNLY", "CD79A", "FCGR3A", "CCL2", "PPBP"), min.cutoff = "q9", cols.use = c("lightgrey", 
    "blue"), pt.size = 0.5)

new.ident <- c("CD14 Mono", "CD4 Naive T", "CD4 Memory T", "B", "CD16 Mono", 
    "T activated", "CD8 T", "NK", "DC", "B activated", "Mk", "pDC", "Eryth")
for (i in 0:12) {
    immune.combined <- RenameIdent(object = immune.combined, old.ident.name = i, 
        new.ident.name = new.ident[i + 1])
}

TSNEPlot(immune.combined, do.label = T, pt.size = 0.5)

The SplitDotPlotGG function can be useful for viewing conserved cell type markers across conditions, showing both the expression level and the percentage of cells in a cluster expressing any given gene. Here we plot 2-3 strong marker genes for each of our 13 clusters.

immune.combined@ident <- factor(immune.combined@ident, levels = (c("pDC", "Eryth", 
    "Mk", "DC", "CD14 Mono", "CD16 Mono", "B activated", "B", "CD8 T", "NK", 
    "T activated", "CD4 Naive T", "CD4 Memory T")))
markers.to.plot <- c("CD3D", "CREM", "HSPH1", "SELL", "GIMAP5", "CACYBP", "GNLY", 
    "NKG7", "CCL5", "CD8A", "MS4A1", "CD79A", "MIR155HG", "NME1", "FCGR3A", 
    "VMO1", "CCL2", "S100A9", "HLA-DQA1", "GPR183", "PPBP", "GNG11", "HBA2", 
    "HBB", "TSPAN13", "IL3RA", "IGJ")
sdp <- SplitDotPlotGG(immune.combined, genes.plot = rev(markers.to.plot), cols.use = c("blue", 
    "red"), x.lab.rot = T, plot.legend = T, dot.scale = 8, do.return = T, grouping.var = "stim")

Identify differential expressed genes across conditions

Now that we’ve aligned the stimulated and control cells, we can start to do comparative analyses and look at the differences induced by stimulation. One way to look broadly at these changes is to plot the average expression of both the stimulated and control cells and look for genes that are visual outliers on a scatter plot. First we define some plotting functions to make labeling genes on the plots a bit easier.

LabelPoint <- function(plot, genes, exp.mat, adj.x.t = 0, adj.y.t = 0, adj.x.s = 0, 
    adj.y.s = 0, text.size = 2.5, segment.size = 0.1) {
    for (i in genes) {
        x1 <- exp.mat[i, 1]
        y1 <- exp.mat[i, 2]
        plot <- plot + annotate("text", x = x1 + adj.x.t, y = y1 + adj.y.t, 
            label = i, size = text.size)
        plot <- plot + annotate("segment", x = x1 + adj.x.s, xend = x1, y = y1 + 
            adj.y.s, yend = y1, size = segment.size)
    }
    return(plot)
}

LabelUR <- function(plot, genes, exp.mat, adj.u.t = 0.1, adj.r.t = 0.15, adj.u.s = 0.05, 
    adj.r.s = 0.05, ...) {
    return(LabelPoint(plot, genes, exp.mat, adj.y.t = adj.u.t, adj.x.t = adj.r.t, 
        adj.y.s = adj.u.s, adj.x.s = adj.r.s, ...))
}

LabelUL <- function(plot, genes, exp.mat, adj.u.t = 0.1, adj.l.t = 0.15, adj.u.s = 0.05, 
    adj.l.s = 0.05, ...) {
    return(LabelPoint(plot, genes, exp.mat, adj.y.t = adj.u.t, adj.x.t = -adj.l.t, 
        adj.y.s = adj.u.s, adj.x.s = -adj.l.s, ...))
}

Then, we take the average expression of both the stimulated and control naive T cells and CD14 monocyte populations and generate the scatter plots, highlighting genes that exhibit dramatic responses to interferon stimulation.

t.cells <- SubsetData(immune.combined, ident.use = "CD4 Naive T", subset.raw = T)
t.cells <- SetAllIdent(t.cells, id = "stim")
avg.t.cells <- log1p(AverageExpression(t.cells, show.progress = FALSE))
avg.t.cells$gene <- rownames(avg.t.cells)

cd14.mono <- SubsetData(immune.combined, ident.use = "CD14 Mono", subset.raw = T)
cd14.mono <- SetAllIdent(cd14.mono, id = "stim")
avg.cd14.mono <- log1p(AverageExpression(cd14.mono, show.progress = FALSE))
avg.cd14.mono$gene <- rownames(avg.cd14.mono)

genes.to.label1 = c("ISG15", "LY6E", "IFI6", "ISG20", "MX1")
genes.to.label2 = c("IFIT2", "IFIT1")
genes.to.label3 = c("CXCL10", "CCL8")
p1 <- ggplot(avg.t.cells, aes(CTRL, STIM)) + geom_point() + ggtitle("CD4 Naive T Cells")
p1 <- LabelUR(p1, genes = c(genes.to.label1, genes.to.label2), avg.t.cells, 
    adj.u.t = 0.3, adj.u.s = 0.23)
p1 <- LabelUL(p1, genes = genes.to.label3, avg.t.cells, adj.u.t = 0.5, adj.u.s = 0.4, 
    adj.l.t = 0.25, adj.l.s = 0.25)
p2 <- ggplot(avg.cd14.mono, aes(CTRL, STIM)) + geom_point() + ggtitle("CD14 Monocytes")
p2 <- LabelUR(p2, genes = c(genes.to.label1, genes.to.label3), avg.cd14.mono, 
    adj.u.t = 0.3, adj.u.s = 0.23)
p2 <- LabelUL(p2, genes = genes.to.label2, avg.cd14.mono, adj.u.t = 0.5, adj.u.s = 0.4, 
    adj.l.t = 0.25, adj.l.s = 0.25)
plot_grid(p1, p2)

As you can see, many of the same genes are upregulated in both of these cell types and likely represent a conserved interferon response pathway.

Because we are confident in having identified common cell types across condition, we can ask what genes change in different conditions for cells of the same type. First, we create a column in the meta.data slot to hold both the cell type and stimulation information and switch the current ident to that column. Then we use FindMarkers to find the genes that are different between stimulated and control B cells. Notice that many of the top genes that show up here are the same as the ones we plotted earlier as core interferon response genes. Additionally, genes like CXCL10 which we saw were specific to monocyte and B cell interferon response show up as highly significant in this list as well.

immune.combined@meta.data$celltype.stim <- paste0(immune.combined@ident, "_", 
    immune.combined@meta.data$stim)
immune.combined <- StashIdent(immune.combined, save.name = "celltype")
immune.combined <- SetAllIdent(immune.combined, id = "celltype.stim")
b.interferon.response <- FindMarkers(immune.combined, ident.1 = "B_STIM", ident.2 = "B_CTRL", 
    print.bar = FALSE)
head(b.interferon.response, 15)
##                 p_val avg_logFC pct.1 pct.2     p_val_adj
## ISG15   1.507485e-166 3.2430775 0.998 0.235 2.118469e-162
## IFIT3   1.197089e-160 3.1181217 0.960 0.051 1.682269e-156
## IFI6    6.168886e-158 2.9295219 0.959 0.078 8.669135e-154
## ISG20   1.915755e-156 2.0416453 1.000 0.664 2.692210e-152
## IFIT1   5.639346e-146 2.8623180 0.905 0.030 7.924973e-142
## MX1     8.264417e-131 2.2634637 0.912 0.122 1.161398e-126
## LY6E    8.874270e-126 2.1880659 0.893 0.143 1.247101e-121
## IFIT2   3.854507e-116 2.5589116 0.793 0.037 5.416739e-112
## TNFSF10 1.053693e-113 2.6153254 0.769 0.021 1.480755e-109
## B2M     4.751051e-104 0.4375721 1.000 1.000 6.676652e-100
## IRF7    1.019016e-101 1.8399457 0.845 0.184  1.432023e-97
## PLSCR1   4.454230e-99 1.9390512 0.793 0.120  6.259529e-95
## CXCL10   9.071726e-94 3.7461707 0.663 0.007  1.274850e-89
## UBE2L6   1.293306e-86 1.4651257 0.855 0.304  1.817484e-82
## HERC5    4.014929e-82 1.9416847 0.625 0.023  5.642180e-78

Another useful way to visualize these changes in gene expression is with the FeatureHeatmap function. This will display FeaturePlots of the list of given genes, split by a grouping variable (stimulation condition here). Genes such as CD3D and GNLY are canonical cell type markers (for T cells and NK/CD8 T cells) that are virtually unaffected by interferon stimulation and display similar gene expression patterns in the control and stimulated group. IFI6 and ISG15, on the other hand, are core interferon response genes and are upregulated accordingly in all cell types. Finally, CD14 and CXCL10 are genes that show a cell type specific interferon response. CD14 expression decreases after stimulation in CD14 monocytes, which could lead to misclassification in a supervised analysis framework, underscoring the value of integrated analysis. CXCL10 shows a distinct upregulation in monocytes and B cells after interferon stimulation but not in other cell types.

FeatureHeatmap(immune.combined, features.plot = c("CD3D", "GNLY", "IFI6", "ISG15", 
    "CD14", "CXCL10"), group.by = "stim", pt.size = 0.25, key.position = "top", 
    max.exp = 3)

saveRDS(immune.combined, file = "~/Projects/datasets/immune_combined.rds")