Load in the data

This vignette demonstrates new features that allow users to analyze and explore multi-modal data with Seurat. While this represents an initial release, we are excited to release significant new functionality for multi-modal datasets in the future.

Here, we analyze a dataset of 8,617 cord blood mononuclear cells (CBMCs), produced with CITE-seq, where we simultaneously measure the single cell transcriptomes alongside the expression of 11 surface proteins, whose levels are quantified with DNA-barcoded antibodies. First, we load in two count matrices : one for the RNA measurements, and one for the antibody-derived tags (ADT). You can download the ADT file here and the RNA file here

# Load in the RNA UMI matrix

# Note that this dataset also contains ~5% of mouse cells, which we can use
# as negative controls for the protein measurements. For this reason, the
# gene expression matrix has HUMAN_ or MOUSE_ appended to the beginning of
# each gene.
cbmc.rna <- read.csv("~/Downloads/GSE100866_CBMC_8K_13AB_10X-RNA_umi.csv.gz", 
    sep = ",", header = TRUE, row.names = 1)

# To make life a bit easier going forward, we're going to discard all but
# the top 100 most highly expressed mouse genes, and remove the 'HUMAN_'
# from the CITE-seq prefix
cbmc.rna.collapsed <- CollapseSpeciesExpressionMatrix(cbmc.rna)

# Load in the ADT UMI matrix
cbmc.adt <- read.csv("~/Downloads/GSE100866_CBMC_8K_13AB_10X-ADT_umi.csv.gz", 
    sep = ",", header = TRUE, row.names = 1)

# To avoid any confusion where genes and proteins might have the same name,
# we'll append 'CITE_' to each of the ADT rownames. This is not strictly
# necessary, but it helps for clarity
cbmc.citeseq <- cbmc.adt
rownames(cbmc.citeseq) <- paste0("CITE_", rownames(cbmc.adt))

# Lastly, we observed poor enrichments for CCR5, CCR7, and CD10 - and
# therefore remove them from the matrix (optional)
cbmc.citeseq <- cbmc.citeseq[setdiff(rownames(cbmc.citeseq), c("CITE_CCR5", 
    "CITE_CCR7", "CITE_CD10")), ]

Setup a Seurat object, and cluster cells based on RNA expression

The steps below represent a quick clustering of the PBMCs based on the scRNA-seq data. For more detail on individual steps or more advanced options, see our PBMC clustering guided tutorial here

cbmc <- CreateSeuratObject(raw.data = cbmc.rna.collapsed)

# standard log-normalization
cbmc <- NormalizeData(cbmc)

# choose ~1k variable genes
cbmc <- FindVariableGenes(cbmc, do.plot = FALSE, y.cutoff = 0.5)

# standard scaling (no regression)
cbmc <- ScaleData(cbmc, display.progress = FALSE)

# Run PCA, select 13 PCs for tSNE visualization and graph-based clustering
cbmc <- RunPCA(cbmc, pcs.print = 0)

cbmc <- FindClusters(cbmc, dims.use = 1:13, print.output = FALSE)
cbmc <- RunTSNE(cbmc, dims.use = 1:13)

# Find the markers that define each cluster, and use these to annotate the
# clusters, we use max.cells.per.ident to speed up the process
cbmc.rna.markers <- FindAllMarkers(cbmc, max.cells.per.ident = 100, logfc.threshold = log(2), 
    only.pos = TRUE, min.diff.pct = 0.3, do.print = F)

current.cluster.ids <- 0:15
# Note, for simplicity we are merging two CD14+ Mono clusters (that differ
# in the expression of HLA-DR genes), and two NK clusters (that differ in
# cell cycle stage)
new.cluster.ids <- c("CD4 T", "CD14+ Mono", "CD14+ Mono", "NK", "Mouse", "B", 
    "CD8 T", "CD16+ Mono", "Unknown", "CD34+", "Mk", "Eryth", "DC", "Mouse", 
    "pDC", "NK")
cbmc@ident <- plyr::mapvalues(x = cbmc@ident, from = current.cluster.ids, to = new.cluster.ids)
TSNEPlot(cbmc, do.label = TRUE, pt.size = 0.5)