Seurat - Guided Clustering Tutorial
Compiled: March 30, 2017
Setup the Seurat Object
For this tutorial, we will be analyzing the a dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. The raw data can be found here.
We start by reading in the data. All features in Seurat have been configured to work with both regular and sparse matrices, but sparse matrices result in significant memory and speed savings for Drop-seq/inDrop/10x data.
library(Seurat) library(dplyr) library(Matrix) # Load the PBMC dataset pbmc.data <- Read10X("~/Downloads/filtered_gene_bc_matrices/hg19/") #Examine the memory savings between regular and sparse matrices dense.size <- object.size(as.matrix(pbmc.data)) dense.size
## 709264728 bytes
sparse.size <- object.size(pbmc.data) sparse.size
## 38715120 bytes
## 18.3200963344554 bytes
# Initialize the Seurat object with the raw (non-normalized data) # Note that this is slightly different than the older Seurat workflow, where log-normalized values were passed in directly. # You can continue to pass in log-normalized values, just set do.logNormalize=F in the next step. pbmc <- new("seurat", raw.data = pbmc.data) # Keep all genes expressed in >= 3 cells, keep all cells with >= 200 genes # Perform log-normalization, first scaling each cell to a total of 1e4 molecules (as in Macosko et al. Cell 2015) pbmc <- Setup(pbmc, min.cells = 3, min.genes = 200, do.logNormalize = T, total.expr = 1e4, project = "10X_PBMC")
Basic QC and selecting cells for further analysis
While the setup function imposes a basic minimum gene-cutoff, you may want to filter out cells at this stage based on technical or biological parameters. Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. In the example below, we visualize gene and molecule counts, plot their relationship, and exclude cells with a clear outlier number of genes detected as potential multiplets. Of course this is not a guarenteed method to exclude cell doublets, but we include this as an example of filtering user-defined outlier cells. We also filter cells based on the percentage of mitochondrial genes present.
# The number of genes and UMIs (nGene and nUMI) are automatically calculated for every object by Seurat. # For non-UMI data, nUMI represents the sum of the non-normalized values within a cell # We calculate the percentage of mitochondrial genes here and store it in percent.mito using the AddMetaData. # The % of UMI mapping to MT-genes is a common scRNA-seq QC metric. # NOTE: You must have the Matrix package loaded to calculate the percent.mito values. mito.genes <- grep("^MT-", rownames(pbmc@data), value = T) percent.mito <- colSums(expm1(pbmc@data[mito.genes, ]))/colSums(expm1(pbmc@data)) #AddMetaData adds columns to email@example.com, and is a great place to stash QC stats pbmc <- AddMetaData(pbmc, percent.mito, "percent.mito") VlnPlot(pbmc, c("nGene", "nUMI", "percent.mito"), nCol = 3)