Pan-Human Azimuth

The Azimuth project represents a series of computational tools for reference-mapping of single cell data. As a powerful complement to manual annotation and exploration workflows,reference-mapping pipelines aim to utilize existing knowledge to help automate the annotation and interpretation of new datasets.

We are excited to introduce our initial release of Pan-Human Azimuth, a neural network classifier that aims to annotate human single cell/nuclus RNA-sequencing experiments - across tissues and technologies - into a consistent and hierarchical cell ontology. Our first iteration was trained on data from 23 different tissues, encompassing 380 different cell types (highest resolution). Cell types are organized into a unified cell ontology based on an adapted version of the DISCO reference.

While further releases and a preprint are coming soon, we are releasing an open-source python package along with an R interface for users to annotate their datasets. The development of Azimuth is led by the New York Genome Center Mapping Component as part of the NIH Human Biomolecular Atlas Project (HuBMAP).

Quick Start Guide: R interface

For R users, The CloudAzimuth function runs directly on Seurat objects, and adds annotations and embeddings directly to the object. Computation occurs on the cloud.

# Installation
devtools::install_github("satijalab/AzimuthAPI")

# Annotate a Seurat object
pbmc3k <- CloudAzimuth(pbmc3k)

# View results
head(pbmc3k@meta.data)

For more information, please see our Pan-Human Azimuth R API Vignette

Quick Start Guide: Python package

For python users, the panhumanpy package enables users to run Pan-Human Azimuth locally. The python package takes an anndata object as input, and can be run from either the command line or interactively in python.

# Installation in a fresh conda env with python 3.9+
# Basic installation (CPU only)
pip install git+https://github.com/satijalab/panhumanpy.git

# Installation with GPU support
pip install git+https://github.com/satijalab/panhumanpy.git#egg=panhumanpy[gpu]

Running from the command line:

# view all command line options
annotate --help

# generates pbmc3k_ANN.h5ad
annotate pbmc3k.h5ad

Usage in python

import panhumanpy as ph
import anndata as ad

# View documentation on high-level interface
help(ph.AzimuthNN)

# View documentation on low-level flexible interface
help(ph.AzimuthNN_base)

# Read in anndata object (can also pass filepath directly)
pbmc3k = ad.read_h5ad("pbmc3k.h5ad")

# High-level interface
azimuth = ph.AzimuthNN(pbmc3k)
cell_metadata = azimuth.cells_meta
embeddings = azimuth.azimuth_embed()
umap = azimuth.azimuth_umap()

Support and Resources

For more information and support:

GitHub Repository: panhumanpy
R API Vignette

FAQ

What data types are supported?
- Pan-human Azimuth was trained on data from 23 human tissues: Adipose, Adrenal gland, Bladder, Blood, Bone marrow, Brain, Breast, Eye, Gut, Heart, Kidney, Liver, Lung, Lymph node, Ovary, Pancreas, Placenta, Skeletal muscle, Skin, Spleen, Stomach, Testis, Thymus
- The training dataset encompasses both scRNA-seq and snRNA-seq data and encompasses droplet-based, plate-based, and combinatorial indexing-based technologies.
- We excluded cancer datasets from the training process
- In general , we find that query datasets that fall within these parameters can be well-annotated using Pan-Human Azimuth

What pre-processing steps are required
- The R interface takes in a Seurat object, and the python interface takes an anndata object as input
- Users should perform QC (i.e. filtering cells on UMI and mitochondrial thresholds) prior to annotation.
- R users should perform standard log-normalization with the NormalizeData function prior to annotating, as described in the vignette
- Python users can perform standard normalization of their data prior to annotation, or provide an un-normalized object of integer counts (the package will then perform normalization)
- Variable gene selection is not required (users should keep all quantified genes in the object prior to annotation)

What are the outputs of the package
- full_hierarchical_labels: A hierarchical cell type annotation that includes predictions at multiple levels of the cell type hierarhcy. Example : Immune cell|Lymphoid cell|T/NK cell|T cell|CD4 T cell|Naive CD4 T cell
- final_level_softmax_prob: A prediction probability associated with the hierarchical label
- azimuth_embed: A 128-dimensional embedding of cells that is generated from the encoding layer of the neural network, and is decoded to obtain annotations.
- To aid in interpretation, we also post-process the labels to make them shorter/easier to read, and also to normalize their levels of granularity across cells. These categories include final_level_labels, azimuth_fine, azimuth_medium, and azimuth_broad. These categories are described in more detail in our vignette.

How can I perform QC on my annotations
- The final_level_softmax_prob is a helpful QC measure. We often exclude cells with a score of < 0.5
- No QC measure is perfect, and we therefore strongly encourage users to identify/visualize differential expressed genes for each defined cell group, as a manual confirmation of annotation accuracy. To assist users, we include the make_azimuth_QC_heatmaps function in the R API package, and demonstrate how to use it in our vignette.

Can I user Pan-Human Azimuth to annotate disease data
- Yes. We routinely use Pan-Human Azimuth to annotate disease data. In many of not most cases, malignant cells retain a component of their molecular profiles from their normal counterparts. Reference-mapping workflows (including Pan-Human Azimuth) are widely used to annotate both healthy and diseased cells into a common set of cell types, and then to perform within-cell type differential expression to identify malignant shifts.
- However, Pan-Human Azimuth reference does not contain disease-specific cell types. Malignant cells will be mapped to the most similar normal cell type. The QC measures we discuss above can be helpful to ensure that the annotations for malignant cells are useful for facilitating comparisons with healthy samples and the identification of disease-specific signatures.
- The Pan-Human Azimuth training dataset does not include cancer samples. Tumor cells often undergo substantial molecular transformation and do not resemble normal cells, and therefore may map poorly with Pan-Human Azimuth.

Are there size limitations to the datasets that can be mapped
- Both the R and python workflows are highly scalable. The R workflow runs in the cloud and processes ~50,000-100,000 cells/minute. The python workflow achieves similar scalability on a standard laptop computer.
- Pan-Human Azimuth annotates and embeds each cell independently, so its possible to break massive datasets into chunks and map them serially. We used this strategy to map >80M cells from the scBaseCamp repository in under a day.

Are future releases planned?:
- Yes! We’re excited to include new features soon, including:
- Improved QC: Improved QC metrics that exhibit improved calibration compared to softmax probabilities, and capture uncertainty at multiple levels of the hierarchy as well.
- Identification of low-quality cells. In the current release, low-quality cells (for example with <100 UMI), largely map to other cell types with low amounts of RNA, including Plasma cells and abT (entry) cells. We are working to address in the future, and to explicitly label these cells as low-quality.
- Higher-resolution annotations: We are continuing to update our reference with higher-resolution annotations, and to include more recently discovered cell types as well.