Overview

  • Why enrichment analysis?
  • What is enrichment analysis?
  • Gene ontology and pathways
  • GENE ontology and pathways enrichment
  • GENOMIC REGIONS enrichment
  • Tools and references

Overview

  • Why enrichment analysis?
  • What is enrichment analysis?
  • Gene ontology and pathways
  • GENE ontology and pathways enrichment
  • GENOMIC REGIONS enrichment
  • Tools and references

Why enrichment analysis?

  • Human genome contains ~20,000-25,000 genes
  • Each gene has multiple functions
  • If 1,000 genes have changed in an experimental condition, it may be difficult to understand what they do

Birds of a feather flock together

  • Genes with similar expression patterns share similar functions
  • Similar (common) functions characterize a group of genes

Birds of a feather flock together

  • Genes with similar expression patterns share similar functions
  • Similar (common) functions characterize a group of genes

 

Why enrichment analysis?

  • High level understanding of the biology behind gene expression – Interpretation!
  • Translating changes of hundreds/thousands of differentially expressed genes into a few biological processes (reducing dimensionality)

Overview

  • Why enrichment analysis?
  • What is enrichment analysis?
  • Gene ontology and pathways
  • Enrichment analysis
  • GENE ontology and pathways enrichment
  • GENOMIC REGIONS enrichment
  • Tools and references

What is enrichment analysis

  • Enrichment analysis - summarizing common functions associated with a group of objects

What is enrichment analysis? – statistical definition

Enrichment analysis – detection whether a group of objects has certain properties more (or less) frequent than can be expected by chance

Classification of genes

Gene set - a priori classification of genes into biologically relevant groups (sets)

  • Members of the same biochemical pathways
  • Genes annotated with the same molecular function
  • Transcripts expressed in the same cellular compartments
  • Co-regulated/co-expressed genes
  • Genes located on the same cytogenetic band

Overview

  • Why enrichment analysis?
  • What is enrichment analysis?
  • Gene ontology and pathways
  • GENE ontology and pathways enrichment
  • GENOMIC REGIONS enrichment
  • Tools and references

Annotation databases and ontologies

  • An annotation database annotates genes with functions or properties - sets of genes with shared functions
  • Structured prior knowledge about genes

Gene ontology

  • An ontology is a formal (hierarchical) representation of concepts and the relationships between them.

  • The objective of GO is to provide controlled vocabularies of terms for the description of gene products.

  • These terms are to be used as attributes of gene products, facilitating uniform queries across them.

Gene ontology hierarchy

  • Terms are related within a hierarchy using "is-a", "part-of" and other connectors

Gene ontology structure

Gene ontology describes multiple levels of detail of gene function.

  • Molecular Function - the tasks performed by individual gene products; examples are transcription factor and DNA helicase
  • Biological Process - broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions
  • Cellular Component - subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex

Gene ontology database

Gene ontologies are not created equal

Gene ontologies are not created equal

Gene ontologies for model organisms

MSigDb - Molecular Signatures Database

MSigDb - Molecular Signatures Database

https://github.com/stephenturner/msigdf

  • H, hallmark gene sets are coherently expressed signatures derived by aggregating many MSigDB gene sets to represent well-defined biological states or processes.
  • C1, positional gene sets for each human chromosome and cytogenetic band.
  • C2, curated gene sets from online pathway databases, publications in PubMed, and knowledge of domain experts.
  • C3, motif gene sets based on conserved cis-regulatory motifs from a comparative analysis of the human, mouse, rat, and dog genomes.
  • C4, computational gene sets defined by mining large collections of cancer-oriented microarray data.
  • C5, GO gene sets consist of genes annotated by the same GO terms.
  • C6, oncogenic signatures defined directly from microarray gene expression data from cancer gene perturbations.
  • C7, immunologic signatures defined directly from microarray gene expression data from immunologic studies.

Pathways

  • An ordered series of molecular events that leads to the creation new molecular product, or a change in a cellular state or process.
  • Genes often participate in multiple pathways – think about genes having multiple functions

http://biochemical-pathways.com/#/map/1

KEGG pathway database

  • KEGG: Kyoto Encyclopedia of Genes and Genomes is a collection of biological information compiled from published material = curated database.
  • Includes information on genes, proteins, metabolic pathways, molecular interactions, and biochemical reactions associated with specific organisms
  • Provides a relationship (map) for how these components are organized in a cellular structure or reaction pathway.

http://www.genome.jp/kegg/

KEGG pathway diagram

Reactome

  • Curated human pathways encompassing metabolism, signaling, and other biological processes.
  • Every pathway is traceable to primary literature.

http://www.reactome.org/

Reactome pathway diagram

Other pathway databases

Genes to networks

Overview

  • Why enrichment analysis?
  • What is enrichment analysis?
  • Gene ontology and pathways
  • GENE ontology and pathways enrichment
  • GENOMIC REGIONS enrichment
  • Tools and references

Enrichment analysis

Null hypothesis

  • Self-contained \(H_0\): genes in the gene set do not have any association with the pheontype

  • Problem: restrictive, use information only from a gene set

Enrichment analysis

Null hypothesis

  • Competitive \(H_0\): genes in the gene set have the same level of association with a given phenotype as genes in the complement gene set

  • Problem: wrong assumption of independent gene sampling

Approach 1

Overrepresentation analysis, Hypergeometric test

  • \(m\) is the total number of genes
  • \(j\) is the number of genes are in the functional category
  • \(n\) is the number of differentially expressed genes
  • \(k\) is the number of differentially expressed genes in the category

Approach 1

Overrepresentation analysis, Hypergeometric test

  • \(m\) is the total number of genes
  • \(j\) is the number of genes are in the functional category
  • \(n\) is the number of differentially expressed genes
  • \(k\) is the number of differentially expressed genes in the category

The expected value of \(k\) would be \(k_e=(n/m)*j\).

If \(k > k_e\), functional category is said to be enriched, with a ratio of enrichment \(r=k/k_e\)

Approach 1

Overrepresentation analysis, Hypergeometric test

  • \(m\) is the total number of genes
  • \(j\) is the number of genes are in the functional category
  • \(n\) is the number of differentially expressed genes
  • \(k\) is the number of differentially expressed genes in the category
Diff. exp. genes Not Diff. exp. genes Total
In gene set k j-k j
Not in gene set n-k m-n-j+k m-j
Total n m-n m

Approach 1

Overrepresentation analysis, Hypergeometric test

  • \(m\) is the total number of genes
  • \(j\) is the number of genes are in the functional category
  • \(n\) is the number of differentially expressed genes
  • \(k\) is the number of differentially expressed genes in the category

What is the probability of having \(k\) or more genes from the category in the selected \(n\) genes?

\[P = \sum_{i=k}^n{\frac{\binom{m-j}{n-i}\binom{j}{i}}{{m \choose n}}}\]

Approach 1

Overrepresentation analysis, Hypergeometric test

  • \(m\) is the total number of genes
  • \(j\) is the number of genes are in the functional category
  • \(n\) is the number of differentially expressed genes
  • \(k\) is the number of differentially expressed genes in the category

\(k < (n/m)*j\) - underrepresentation. Probability of \(k\) or less genes from the category in the selected \(n\) genes?

\[P = \sum_{i=0}^k{\frac{\binom{m-j}{n-i}\binom{j}{i}}{{m \choose n}}}\]

Approach 1

Overrepresentation analysis (ORA)

  1. Find a set of differentially expressed genes (DEGs)
  2. Are DEGs in a set more common than DEGs not in a set?
  • Fisher test stats::fisher.test()
  • Conditional hypergeometric test, to account for directed hierachy of GO GOstats::hyperGTest()

 

Example: https://github.com/mdozmorov/MDmisc/blob/master/R/gene_enrichment.R

Approach 1

Problems with Fisher's exact test

  • The outcome of the overrepresentation test depends on the significance threshold used to declare genes differentially expressed.

  • Functional categories in which many genes exhibit small changes may go undetected.

  • Genes are not independent, so a key assumption of the Fisher’s exact tests is violated.

Many GO enrichment tools

Approach 2

Functional Class Scoring (FCS)

  • Gene set analysis (GSA). Mootha et al., 2003; modified by Subramanian, et al. "Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles." PNAS 2005 http://www.pnas.org/content/102/43/15545.abstract

  • Main rationale – functionally related genes often display a coordinated expression to accomplish their roles in the cells

  • Aims to identify gene sets with "subtle but coordinated" expression changes that would be missed by DEGs threshold selection

GSEA: Gene set enrichment analysis

  • The null hypothesis is that the rank ordering of the genes in a given comparison is random with regard to the case-control assignment.
  • The alternative hypothesis is that the rank ordering of genes sharing functional/pathway membership is associated with the case-control assignment.

GSEA: Gene set enrichment analysis

  1. Sort genes by log fold change
  2. Calculate running sum - increment when gene in a set, decrement when not
  3. Maximum of the runnig sum is the enrichment score - larger means genes in a set are toward top of the sorted list
  4. Permute subject labels to calculate significance p-value

GSEA: Gene set enrichment analysis

  • Compute a statistic (difference between 2 clinical groups) for each gene that measures the degree of differential expression between treatments.
  • Create a list \(L\) of all genes ordered according to these statistics.
  • Given a set of genes \(S\) we can see if these genes are non-randomly distributed in our list \(L\)
  • If the experiment produced random results, we don’t expect gene order to have biological coherence

GSEA: Gene set enrichment analysis

  • Calculate an enrichment score (\(ES\)) that reflects the degree to which a set \(S\) is overrepresented at the extremes (top or bottom) of the entire ranked list \(L\).
  • The score is calculated by walking down the list \(L\) and …
    • Increase a running-sum statistic when we encounter a gene in \(S\)
    • Decrease it when we encounter genes not in \(S\).
  • The magnitude of the increment depends on the correlation of the gene with the phenotype.
  • The final enrichment score is the maximum deviation from zero encountered in the random walk
    • Corresponds to a weighted Kolmogorov–Smirnov-like statistics

GSEA: Gene set enrichment analysis

Enrichment Score

  • Consider genes \(R_1, ..., R_N\) ordered by the difference metric
  • Consider a gene set \(S\) of size \(G\), containing functionally similar genes or pathway members.
  • If \(R_i\) is not a member of \(S\), define

\[X_{Ri}=-\sqrt{\frac{G}{N-G}}\]

  • If \(R_i\) is a member of \(S\), define

\[X_{Ri}=\sqrt{\frac{N-G}{G}}\]

GSEA: Gene set enrichment analysis

Enrichment Score

  • Compute running sum across all \(N\) genes. The \(ES\) is defined as

\[\max_{1 \le j \le N} \sum_{i=1}^j{X_{Ri}}\]

  • or the maximum observed positive deviation of the running sum.
  • \(ES\) is measured for every gene set considered. To determine whether any of the given gene sets shows association with the class phenotype distinction, permute the class labels 1,000 times, each time recording the maximum \(ES\) over all gene sets.

Other approaches

Linear model-based

  • CAMERA (Wu and Smyth 2012)
  • Correlation-Adjusted MEan RAnk gene set test
  • Estimating the variance inflation factor associated with inter-gene correlation, and incorporating this into parametric or rank-based test procedures

Other approaches

Linear model-based

  • ROAST (Wu et.al. 2010)
  • Under the null hypothesis (and assuming a linear model) the residuals are independent and identically distributed \(N(0,\sigma_g^2)\).
  • We can rotate the residual vector for each gene in a gene set, such that gene-gene expression correlations are preserved.

Other approaches

Impact analysis - incorporates topology of the pathway.

  • Gene's fold change
  • Classical enrichment statistics
  • The topology of the signaling pathway

Other approaches

Overview

  • Why enrichment analysis?
  • What is enrichment analysis?
  • Gene ontology and pathways
  • GENE ontology and pathways enrichment
  • GENOMIC REGIONS enrichment
  • Tools and references

Gene enrichment vs. genome enrichment

  • Gene set enrichment analysis - summarizing many genes of interest, such as differentially expressed genes, with a few common gene annotations (molecular functions, canonical pathways)

 

  • Epigenomic enrichment analysis - summarizing many genomic regions of interest, such as disease-associated genomic variants, with a few common genome annotations (chromatin states, transcription factor binding sites)

Genomic regions

  • Gene/exon boundaries, promoters
  • Single Nucleotide Polymorphisms (SNPs)
  • Transcription Factor Binding Sites (TFBS)
  • Differentially methylated regions
  • CpG islands

Each genomic region has coordinates (unique IDs):

Chromosome, Start, End

Annotations of genomic regions

  • Epigenomic (regulatory) regions - genomic regions annotated as carrying functional and/or regulatory potential

  • DNaseI hypersensitive sites
  • Histone modification marks
  • Transcription Factor Binding Sites
  • DNA methylation
  • Enhancers

Genome annotation consortia

Why "genomic region enrichment analysis"?

Enrichment = functional impact

  • Hypothesis: SNPs in epigenomic regions may disrupt regulation
  • More significant enrichment = more SNPs in epigenomic regions = more regulation is disrupted (SNP burden)

 

Statistics of epigenomic enrichments

 

  • 6 out of 7 disease-associated SNPs overlap with epigenomic marks
  • How likely this to be observed by chance? (Chi-square test/Binomial test/Permutation test)

Overview

  • Why enrichment analysis?
  • What is enrichment analysis?
  • Gene ontology and pathways
  • GENE ontology and pathways enrichment
  • GENOMIC REGIONS enrichment
  • Tools and references

Gene set enrichment analysis

Web

Gene set enrichment analysis

DIY

Gene annotation databases

Genomic regions enrichment analysis

Genomic regions enrichment analysis

Learn more

 

FINE