Pathway and Functional Enrichment Analysis Methods

Mikhail Dozmorov, Ph.D.
mikhail.dozmorov@vcuhealth.org

https://github.com/mdozmorov/presentations

Overview

Why enrichment analysis?
What is enrichment analysis?
Gene ontology and pathways
GENE ontology and pathways enrichment
GENOMIC REGIONS enrichment
Tools and references

Overview

Why enrichment analysis?
What is enrichment analysis?
Gene ontology and pathways
GENE ontology and pathways enrichment
GENOMIC REGIONS enrichment
Tools and references

Why enrichment analysis?

Human genome contains ~20,000-25,000 genes
Each gene has multiple functions
If 1,000 genes have changed in an experimental condition, it may be difficult to understand what they do

Birds of a feather flock together

Genes with similar expression patterns share similar functions
Similar (common) functions characterize a group of genes

Birds of a feather flock together

Genes with similar expression patterns share similar functions
Similar (common) functions characterize a group of genes

People with similar genetic patterns are likely friends
Christakis NA, Fowler JH. "Friendship and natural selection." PNAS 2014 https://www.ncbi.nlm.nih.gov/pubmed/25024208

Why enrichment analysis?

High level understanding of the biology behind gene expression – Interpretation!
Translating changes of hundreds/thousands of differentially expressed genes into a few biological processes (reducing dimensionality)

Overview

Why enrichment analysis?
What is enrichment analysis?
Gene ontology and pathways
Enrichment analysis
GENE ontology and pathways enrichment
GENOMIC REGIONS enrichment
Tools and references

What is enrichment analysis

Enrichment analysis - summarizing common functions associated with a group of objects

What is enrichment analysis? – statistical definition

Enrichment analysis – detection whether a group of objects has certain properties more (or less) frequent than can be expected by chance

Classification of genes

Gene set - a priori classification of genes into biologically relevant groups (sets)

Members of the same biochemical pathways
Genes annotated with the same molecular function
Transcripts expressed in the same cellular compartments
Co-regulated/co-expressed genes
Genes located on the same cytogenetic band
…

Overview

Why enrichment analysis?
What is enrichment analysis?
Gene ontology and pathways
GENE ontology and pathways enrichment
GENOMIC REGIONS enrichment
Tools and references

Annotation databases and ontologies

An annotation database annotates genes with functions or properties - sets of genes with shared functions
Structured prior knowledge about genes

Gene ontology

An ontology is a formal (hierarchical) representation of concepts and the relationships between them.
The objective of GO is to provide controlled vocabularies of terms for the description of gene products.
These terms are to be used as attributes of gene products, facilitating uniform queries across them.

Gene ontology hierarchy

Terms are related within a hierarchy using "is-a", "part-of" and other connectors

Gene ontology structure

Gene ontology describes multiple levels of detail of gene function.

Molecular Function - the tasks performed by individual gene products; examples are transcription factor and DNA helicase

Biological Process - broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions

Cellular Component - subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex

Gene ontology database

http://geneontology.org/

https://www.ebi.ac.uk/QuickGO/

Gene ontologies are not created equal

Different levels of evidence:
- Experimental
- Computational analysis
- Author Statement
- Curator Statement
- Inferred from electronic annotation

http://geneontology.org/page/evidence-code-decision-tree

Gene ontologies are not created equal

http://amigo.geneontology.org/amigo/base_statistics

Gene ontologies for model organisms

Mouse Genome Database (MGD) and Gene Expression Database (GXD) (Mus musculus) http://www.informatics.jax.org/
Rat Genome Database (RGD) (Rattus norvegicus) http://rgd.mcw.edu/
FlyBase (Drosophila melanogaster) http://flybase.org/
Berkeley Drosophila Genome Project (BDGP) http://www.fruitfly.org/
WormBase (Caenorhabditis elegans) http://www.wormbase.org/
Zebrafish Information Network (ZFIN) (Danio rerio) http://zfin.org/
Saccharomyces Genome Database (SGD) (Saccharomyces cerevisiae) http://www.yeastgenome.org/
The Arabidopsis Information Resource (TAIR) (Arabidopsis thaliana) https://www.arabidopsis.org/
Gramene (grains, including rice, Oryza) http://www.gramene.org/
dictyBase (Dictyostelium discoideum) http://dictybase.org/
GeneDB (Schizosaccharomyces pombe, Plasmodium falciparum, Leishmania major and Trypanosoma brucei) http://www.genedb.org/

MSigDb - Molecular Signatures Database

http://software.broadinstitute.org/gsea/msigdb/

MSigDb - Molecular Signatures Database

https://github.com/stephenturner/msigdf

H, hallmark gene sets are coherently expressed signatures derived by aggregating many MSigDB gene sets to represent well-defined biological states or processes.
C1, positional gene sets for each human chromosome and cytogenetic band.
C2, curated gene sets from online pathway databases, publications in PubMed, and knowledge of domain experts.
C3, motif gene sets based on conserved cis-regulatory motifs from a comparative analysis of the human, mouse, rat, and dog genomes.
C4, computational gene sets defined by mining large collections of cancer-oriented microarray data.
C5, GO gene sets consist of genes annotated by the same GO terms.
C6, oncogenic signatures defined directly from microarray gene expression data from cancer gene perturbations.
C7, immunologic signatures defined directly from microarray gene expression data from immunologic studies.

Pathways

An ordered series of molecular events that leads to the creation new molecular product, or a change in a cellular state or process.
Genes often participate in multiple pathways – think about genes having multiple functions

http://biochemical-pathways.com/#/map/1

KEGG pathway database

KEGG: Kyoto Encyclopedia of Genes and Genomes is a collection of biological information compiled from published material = curated database.
Includes information on genes, proteins, metabolic pathways, molecular interactions, and biochemical reactions associated with specific organisms
Provides a relationship (map) for how these components are organized in a cellular structure or reaction pathway.

http://www.genome.jp/kegg/

KEGG pathway diagram

Reactome

Curated human pathways encompassing metabolism, signaling, and other biological processes.
Every pathway is traceable to primary literature.

http://www.reactome.org/

Reactome pathway diagram

Other pathway databases

PathwayCommons, version 8 has over 42,000 pathways from 22 data sources, http://www.pathwaycommons.org/
PathGuide, lists ~550 pathway related databases, http://www.pathguide.org/
WikiPathways, community-curated pathways, http://wikipathways.org/

Genes to networks

GeneMania, networks based on different properties, http://genemania.org
STRING, protein-protein interaction networks, http://string-db.org
Genes2Networks, protein-protein interaction networks, http://amp.pharm.mssm.edu/X2K/#g2n

Overview

Why enrichment analysis?
What is enrichment analysis?
Gene ontology and pathways
GENE ontology and pathways enrichment
GENOMIC REGIONS enrichment
Tools and references

Enrichment analysis

Null hypothesis

Self-contained \(H_0\): genes in the gene set do not have any association with the pheontype
Problem: restrictive, use information only from a gene set