This tutorial walks through a basic alignment of two different datasets of peripheral blood mononuclear cells (PBMCs): one provided by 10X genomics and one from the recent Seq-well paper Gierahn et. al, 2017. The gene expression matrices can be found here. We first read in the two count matrices and set up the Seurat objects.

Next, we select the genes we want to use in the alignment procedure. Here we take the union of the top 2,000 genes with the highest dispersion (var/mean) from both datasets.

library(Seurat)
# cowplot enables side-by-side ggplots
library(cowplot)

# load data
seqwell.data <- read.table(file = paste0("~/Downloads/IntegratedAnalysis_ExpressionMatrices/", 
    "pbmc_SeqWell.expressionMatrix.txt"))
tenx.data <- read.table(file = paste0("~/Downloads/IntegratedAnalysis_ExpressionMatrices/", 
    "pbmc_10X.expressionMatrix.txt"))

# setup Seurat objects since both count matrices have already filtered
# cells, we do no additional filtering here

seqwell <- CreateSeuratObject(raw.data = seqwell.data)
seqwell <- NormalizeData(object = seqwell)
seqwell <- ScaleData(object = seqwell)
seqwell <- FindVariableGenes(object = seqwell, do.plot = FALSE)

tenx <- CreateSeuratObject(raw.data = tenx.data)
tenx <- NormalizeData(object = tenx)
tenx <- ScaleData(object = tenx)
tenx <- FindVariableGenes(object = tenx, do.plot = FALSE)

# we will take the union of the top 2k variable genes in each dataset for
# alignment note that we use 1k genes in the manuscript examples, you can
# try this here with negligible changes to the overall results
hvg.seqwell <- rownames(x = head(x = seqwell@hvg.info, n = 2000))
hvg.tenx <- rownames(x = head(x = tenx@hvg.info, n = 2000))
hvg.union <- union(x = hvg.seqwell, y = hvg.tenx)

# lastly, we set the 'protocol' in each dataset for easy identification
# later it will be transferred to the merged object in RunCCA
tenx@meta.data[, "protocol"] <- "10X"
seqwell@meta.data[, "protocol"] <- "SeqWell"

We next run a canonical correlation analysis to identify common sources of variation between the two datasets. RunCCA will also combine the two objects into a single object and stores the canonical correlation vectors (the vectors that project each dataset into the maximally correlated subspaces). We also store the original dataset identity as a column in object@meta.data

pbmc <- RunCCA(object = tenx, object2 = seqwell, genes.use = hvg.union)
# visualize results of CCA plot CC1 versus CC2 and look at a violin plot
p1 <- DimPlot(object = pbmc, reduction.use = "cca", group.by = "protocol", pt.size = 0.5, 
    do.return = TRUE)
p2 <- VlnPlot(object = pbmc, features.plot = "CC1", group.by = "protocol", do.return = TRUE)
plot_grid(p1, p2)

PrintDim(object = pbmc, reduction.type = "cca", dims.print = 1:2, genes.print = 10)
## [1] "CC1"
##  [1] "FTL"    "TYROBP" "AIF1"   "FCER1G" "LYZ"    "FTH1"   "CTSS"  
##  [8] "CST3"   "S100A9" "CD68"  
## [1] ""
##  [1] "RPL31"  "RPS27A" "RPS29"  "RPS27"  "RPS12"  "RPS25"  "RPS6"  
##  [8] "RPL30"  "RPS15A" "RPS14" 
## [1] ""
## [1] ""
## [1] "CC2"
##  [1] "HLA-DRA"  "RPS16"    "HLA-DRB1" "RPS11"    "PABPC1"   "HLA-DRB5"
##  [7] "RPS9"     "HLA-DQA1" "RPS8"     "RPL8"    
## [1] ""
##  [1] "NKG7"   "CCL5"   "CST7"   "GZMB"   "PRF1"   "GZMA"   "GNLY"  
##  [8] "FGFBP2" "KLRD1"  "CTSW"  
## [1] ""
## [1] ""

You can see, for example, that CC1 and CC2 separate myeloid from lymphoid cells in both datasets, but the values remain ‘shifted’ relative to each other. We need to choose CCs for downstream analysis and then ‘align them’

Before this, we search for cells whose expression level cannot be well-explained by low-dimensional CCA, compared to low-dimensional PCA.

The problem of choosing CCs for downstream analysis is similar to choosing PCs for clustering. We are developing resampling based procedures for this, but here explore the CC dimensions as we have previously demonstrated for PCA. We begin to see drop-off in signal after CC13, so we chose CC1-13 for analysis. You can try modifying this parameter (i.e. 1-15 or even 1-20) without significant changes in the results

DimHeatmap(object = pbmc, reduction.type = "cca", cells.use = 500, dim.use = 1:9, 
    do.balanced = TRUE)

DimHeatmap(object = pbmc, reduction.type = "cca", cells.use = 500, dim.use = 10:18, 
    do.balanced = TRUE)

Before we align the subspaces, we first search for cells whose expression profile cannot be well-explained by low-dimensional CCA, compared to low-dimensional PCA.

pbmc <- CalcVarExpRatio(object = pbmc, reduction.type = "pca", grouping.var = "protocol", 
    dims.use = 1:13)

# We discard cells where the variance explained by CCA is <2-fold (ratio <
# 0.5) compared to PCA
pbmc.all.save <- pbmc
pbmc <- SubsetData(object = pbmc, subset.name = "var.ratio.pca", accept.low = 0.5)

For illustrative purposes, we can also look at the discarded cells. Note that discarded cells tend to have lower gene counts, but a subset also express high levels of PF4. This is because megakaryocytes are only present in the 10X dataset, and thus are correctly discarded as ‘dataset-specific’]

pbmc.discard <- SubsetData(object = pbmc.all.save, subset.name = "var.ratio.pca", 
    accept.high = 0.5)
median(x = pbmc@meta.data[, "nGene"])
## [1] 843
median(x = pbmc.discard@meta.data[, "nGene"])
## [1] 624
VlnPlot(object = pbmc.discard, features.plot = "PF4", group.by = "protocol")

Now we align the CCA subspaces, which returns a new dimensional reduction called cca.aligned

pbmc <- AlignSubspace(object = pbmc, reduction.type = "cca", grouping.var = "protocol", 
    dims.align = 1:13)

Visualize the aligned CCA and perform integrated analysis

p1 <- VlnPlot(object = pbmc, features.plot = "ACC1", group.by = "protocol", 
    do.return = TRUE)
p2 <- VlnPlot(object = pbmc, features.plot = "ACC2", group.by = "protocol", 
    do.return = TRUE)
plot_grid(p1, p2)

Now we can run a single integrated analysis on all cells!

pbmc <- RunTSNE(object = pbmc, reduction.use = "cca.aligned", dims.use = 1:13, 
    do.fast = TRUE)
pbmc <- FindClusters(object = pbmc, reduction.type = "cca.aligned", dims.use = 1:13, 
    save.SNN = TRUE)
p1 <- TSNEPlot(object = pbmc, group.by = "protocol", do.return = TRUE, pt.size = 0.5)
p2 <- TSNEPlot(object = pbmc, do.return = TRUE, pt.size = 0.5)
plot_grid(p1, p2)

Now, we annotate the clusters as before based on canonical markers.

FeaturePlot(object = pbmc, features.plot = c("CD3D", "SELL", "S100A4", "CD8A", 
    "GNLY", "MS4A1", "FCGR3A", "HSP90AB1", "CCR7"), min.cutoff = "q9", cols.use = c("lightgrey", 
    "blue"), pt.size = 0.5)

new.ident <- c("CD14 Mono", "Memory CD4 T", "Naive CD4 T", "CD8 T", "NK", "B", 
    "CD16 Mono", "HS_Stress", "DC")
for (i in 0:8) {
    pbmc <- RenameIdent(object = pbmc, old.ident.name = i, new.ident.name = new.ident[i + 
        1])
}

We now find all PBMC clusters we found in both datasets in a single integrated analysis, but since we double our cell number we can identify

  1. A separation between memory/naive T cells
  2. A subpopulation of cells dominated by heat-shock stress in both datasets
  3. If you increase resolution higher (i.e. 1.2), you can also see distinct populations of monocytes (with strong type I IFN response), in both datasets

Find markers of memory vs naive cells in both datasets

mem_vs_naive <- FindConservedMarkers(object = pbmc, ident.1 = "Memory CD4 T", 
    ident.2 = "Naive CD4 T", grouping.var = "protocol")
head(x = mem_vs_naive, n = 10)
##           10X_p_val 10X_avg_logFC 10X_pct.1 10X_pct.2 10X_p_val_adj
## S100A4 7.384647e-77     1.1217560     0.926     0.640  2.424084e-72
## B2M    2.878392e-69     0.3531416     1.000     1.000  9.448611e-65
## ANXA1  2.795491e-40     0.8545913     0.781     0.450  9.176479e-36
## VIM    1.164031e-34     0.5229072     0.939     0.814  3.821048e-30
## IL32   3.245351e-33     0.6011188     0.926     0.758  1.065319e-28
## KLF6   3.599007e-33     0.6823026     0.777     0.452  1.181410e-28
## ANXA2  8.831785e-33     0.8377726     0.462     0.133  2.899122e-28
## HLA-A  3.829306e-31     0.3367590     0.998     0.974  1.257008e-26
## CD52   1.884301e-28     0.4021332     0.975     0.906  6.185407e-24
## CLIC1  1.005830e-27     0.7376683     0.631     0.315  3.301738e-23
##        SeqWell_p_val SeqWell_avg_logFC SeqWell_pct.1 SeqWell_pct.2
## S100A4  3.041466e-05         0.4425637         0.404         0.262
## B2M     1.912835e-15         0.2561213         1.000         1.000
## ANXA1   3.280437e-14         0.5700460         0.714         0.465
## VIM     3.439491e-06         0.3406627         0.832         0.718
## IL32    1.044413e-02         0.2705182         0.619         0.535
## KLF6    3.310565e-07         0.4886604         0.360         0.191
## ANXA2   6.924482e-08         0.5629709         0.289         0.126
## HLA-A   1.623153e-07         0.2511752         0.959         0.906
## CD52    3.264981e-04         0.2885633         0.841         0.756
## CLIC1   7.399546e-04         0.2951819         0.239         0.138
##        SeqWell_p_val_adj     max_pval minimump_p_val
## S100A4      9.983918e-01 3.041466e-05   1.476929e-76
## B2M         6.279071e-11 1.912835e-15   5.756785e-69
## ANXA1       1.076836e-09 3.280437e-14   5.590982e-40
## VIM         1.129047e-01 3.439491e-06   2.328062e-34
## IL32        1.000000e+00 1.044413e-02   6.490702e-33
## KLF6        1.086726e-02 3.310565e-07   7.198015e-33
## ANXA2       2.273031e-03 6.924482e-08   1.766357e-32
## HLA-A       5.328161e-03 1.623153e-07   7.658613e-31
## CD52        1.000000e+00 3.264981e-04   3.768602e-28
## CLIC1       1.000000e+00 7.399546e-04   2.011660e-27

We can also compare proportional shifts in the data. As can be seen in the barplot, the two patients profiled have very different composition

freq_table <- prop.table(x = table(pbmc@ident, pbmc@meta.data[, "protocol"]), 
    margin = 2)
barplot(height = freq_table)

freq_table
##               
##                        10X     SeqWell
##   CD14 Mono    0.179702404 0.386684783
##   Memory CD4 T 0.212132774 0.092119565
##   Naive CD4 T  0.206791301 0.092391304
##   CD8 T        0.121327738 0.104347826
##   NK           0.068294544 0.102989130
##   B            0.125524609 0.060869565
##   CD16 Mono    0.067149943 0.090217391
##   HS_Stress    0.009156810 0.042663043
##   DC           0.009919878 0.027717391
# A useful plotting functions ( valuable to confirm the alignment makes
# sense)
FeatureHeatmap(object = pbmc, features.plot = c("CD3D", "FCGR3A", "MS4A1"), 
    group.by = "protocol", sep.scale = TRUE, pt.size = 0.5, cols.use = c("lightgrey", 
        "blue"))

saveRDS(pbmc, file = "~/Projects/datasets/pbmc_alignment.rds")