Seurat - Guided Clustering Tutorial
Compiled: July 26, 2017
Setup the Seurat Object
For this tutorial, we will be analyzing the a dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. The raw data can be found here.
We start by reading in the data. All features in Seurat have been configured to work with sparse matrices which results in significant memory and speed savings for Drop-seq/inDrop/10x data.
library(Seurat) library(dplyr) library(Matrix) # Load the PBMC dataset pbmc.data <- Read10X(data.dir = "~/Downloads/filtered_gene_bc_matrices/hg19/") # Examine the memory savings between regular and sparse matrices dense.size <- object.size(x = as.matrix(x = pbmc.data)) dense.size
## 709264728 bytes
sparse.size <- object.size(x = pbmc.data) sparse.size
## 38715120 bytes
## 18.3 bytes
# Initialize the Seurat object with the raw (non-normalized data). Keep all # genes expressed in >= 3 cells (~0.1% of the data). Keep all cells with at # least 200 detected genes pbmc <- CreateSeuratObject(raw.data = pbmc.data, min.cells = 3, min.genes = 200, project = "10X_PBMC")
Standard pre-processing workflow
The steps below encompass the standard pre-processing workflow for scRNA-seq data in Seurat. These represent the creation of a Seurat object, the selection and filtration of cells based on QC metrics, data normalization and scaling, and the detection of highly variable genes. In previous versions, we grouped many of these steps together in the
Setup function, but in v2, we separate these steps into a clear and sequential workflow.
QC and selecting cells for further analysis
CreateSeuratObject imposes a basic minimum gene-cutoff, you may want to filter out cells at this stage based on technical or biological parameters. Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. In the example below, we visualize gene and molecule counts, plot their relationship, and exclude cells with a clear outlier number of genes detected as potential multiplets. Of course this is not a guaranteed method to exclude cell doublets, but we include this as an example of filtering user-defined outlier cells. We also filter cells based on the percentage of mitochondrial genes present.
# The number of genes and UMIs (nGene and nUMI) are automatically calculated # for every object by Seurat. For non-UMI data, nUMI represents the sum of # the non-normalized values within a cell We calculate the percentage of # mitochondrial genes here and store it in percent.mito using AddMetaData. # We use firstname.lastname@example.org since this represents non-transformed and # non-log-normalized counts The % of UMI mapping to MT-genes is a common # scRNA-seq QC metric. NOTE: You must have the Matrix package loaded to # calculate the percent.mito values. mito.genes <- grep(pattern = "^MT-", x = rownames(x = pbmc@data), value = TRUE) percent.mito <- colSums(email@example.com[mito.genes, ])/colSums(firstname.lastname@example.org) # AddMetaData adds columns to email@example.com, and is a great place to # stash QC stats pbmc <- AddMetaData(object = pbmc, metadata = percent.mito, col.name = "percent.mito") VlnPlot(object = pbmc, features.plot = c("nGene", "nUMI", "percent.mito"), nCol = 3)