Seurat - Guided Clustering Tutorial
Compiled: December 10, 2018
Setup the Seurat Object
For this tutorial, we will be analyzing the a dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. The raw data can be found here.
We start by reading in the data. All features in Seurat have been configured to work with sparse matrices which results in significant memory and speed savings for Drop-seq/inDrop/10x data.
library(dplyr) library(Seurat) # Load the PBMC dataset pbmc.data <- Read10X(data.dir = "~/Downloads/pbmc3k/filtered_gene_bc_matrices/hg19/") # Examine the memory savings between regular and sparse matrices dense.size <- object.size(x = as.matrix(x = pbmc.data)) dense.size
## 709548272 bytes
sparse.size <- object.size(x = pbmc.data) sparse.size
## 29861992 bytes
dense.size / sparse.size
## 23.8 bytes
# Initialize the Seurat object with the raw (non-normalized data). # Keep all features expressed in >= 3 cells (~0.1% of the data). Keep all cells with at least 200 detected features pbmc <- CreateSeuratObject(counts = pbmc.data, min.cells = 3, min.features = 200, project = "10X_PBMC") pbmc
## An object of class Seurat ## 13714 features across 2700 samples within 1 assay ## Active assay: RNA (13714 features)
Standard pre-processing workflow
The steps below encompass the standard pre-processing workflow for scRNA-seq data in Seurat. These represent the creation of a Seurat object, the selection and filtration of cells based on QC metrics, data normalization and scaling, and the detection of highly variable features.
QC and selecting cells for further analysis
CreateSeuratObject imposes a basic minimum feature-cutoff, you may want to filter out cells at this stage based on technical or biological parameters. Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. In the example below, we visualize feature and molecule counts, plot their relationship, and exclude cells with a clear outlier number of features detected as potential multiplets. Of course this is not a guaranteed method to exclude cell doublets, but we include this as an example of filtering user-defined outlier cells. We also filter cells based on the percentage of mitochondrial features present.
# The number of features and UMIs (nFeature_RNA and nCount_RNA) are automatically calculated for every object by Seurat. # For non-UMI data, nCount_RNA represents the sum of the non-normalized values within a cell # We calculate the percentage of mitochondrial features here and store it in object metadata as `percent.mito`. # We use raw count data since this represents non-transformed and non-log-normalized counts # The % of UMI mapping to MT-features is a common scRNA-seq QC metric. mito.features <- grep(pattern = "^MT-", x = rownames(x = pbmc), value = TRUE) percent.mito <- Matrix::colSums(x = GetAssayData(object = pbmc, slot = 'counts')[mito.features, ]) / Matrix::colSums(x = GetAssayData(object = pbmc, slot = 'counts')) # The [[ operator can add columns to object metadata, and is a great place to stash QC stats pbmc[['percent.mito']] <- percent.mito VlnPlot(object = pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mito"), ncol = 3)