This tutorial walks through a basic alignment of two different datasets of peripheral blood mononuclear cells (PBMCs): one provided by 10X genomics and one from the recent Seq-well paper Gierahn et. al, 2017. The gene expression matrices can be found here. We first read in the two count matrices and set up the Seurat objects.

Next, we select the genes we want to use in the alignment procedure. Here we take the union of the top 2,000 genes with the highest dispersion (var/mean) from both datasets.

library(Seurat)
# cowplot enables side-by-side ggplots
library(cowplot)

# load data
seqwell.data <- read.table(file = paste0("~/Downloads/IntegratedAnalysis_ExpressionMatrices/", 
    "pbmc_SeqWell.expressionMatrix.txt"))
tenx.data <- read.table(file = paste0("~/Downloads/IntegratedAnalysis_ExpressionMatrices/", 
    "pbmc_10X.expressionMatrix.txt"))

# setup Seurat objects since both count matrices have already filtered
# cells, we do no additional filtering here

seqwell <- CreateSeuratObject(raw.data = seqwell.data)
seqwell <- NormalizeData(object = seqwell)
seqwell <- ScaleData(object = seqwell)
seqwell <- FindVariableGenes(object = seqwell, do.plot = FALSE)

tenx <- CreateSeuratObject(raw.data = tenx.data)
tenx <- NormalizeData(object = tenx)
tenx <- ScaleData(object = tenx)
tenx <- FindVariableGenes(object = tenx, do.plot = FALSE)

# we will take the union of the top 2k variable genes in each dataset for
# alignment note that we use 1k genes in the manuscript examples, you can
# try this here with negligible changes to the overall results
hvg.seqwell <- rownames(x = head(x = seqwell@hvg.info, n = 2000))
hvg.tenx <- rownames(x = head(x = tenx@hvg.info, n = 2000))
hvg.union <- union(x = hvg.seqwell, y = hvg.tenx)

# lastly, we set the 'protocol' in each dataset for easy identification
# later it will be transferred to the merged object in RunCCA
tenx@meta.data[, "protocol"] <- "10X"
seqwell@meta.data[, "protocol"] <- "SeqWell"

We next run a canonical correlation analysis to identify common sources of variation between the two datasets. RunCCA will also combine the two objects into a single object and stores the canonical correlation vectors (the vectors that project each dataset into the maximally correlated subspaces). We also store the original dataset identity as a column in object@meta.data

pbmc <- RunCCA(object = tenx, object2 = seqwell, genes.use = hvg.union)
# visualize results of CCA plot CC1 versus CC2 and look at a violin plot
p1 <- DimPlot(object = pbmc, reduction.use = "cca", group.by = "protocol", pt.size = 0.5, 
    do.return = TRUE)
p2 <- VlnPlot(object = pbmc, features.plot = "CC1", group.by = "protocol", do.return = TRUE)
plot_grid(p1, p2)