Functional Enrichment Analysis
Author: Ashley Schwartz
Date: Originally Developed September 2023, Last Revised March 2024
Purpose and Background
This tutorial goes over how to retrieve the genes in a particular pathway from the Kyoto Encyclopedia of Genes and Genomes (KEGG), retrieve the genes in a particular function from Gene Ontology (GO), and perform gene enrichment analysis on these gene sets using a variety of different methods.
Databases Supported
Database |
Abbreviation |
Description |
Link |
---|---|---|---|
Kyoto Encyclopedia of Genes and Genomes Pathway Database |
KEGG Pathway |
Molecular pathways and interactions |
|
Kyoto Encyclopedia of Genes and Genomes Disease Database |
KEGG Disease |
Genes and pathways associated with diseases |
|
Gene Ontology Biological Process Database |
GO BP |
Molecular events in biological processes |
|
Gene Ontology Cellular Component Database |
GO CC |
Cellular structures and locations |
|
Gene Ontology Molecular Function Database |
GO MF |
Specific activities or functions of gene products |
Database Naming Conventions
Each database follows naming conventions for their pathways and functions. For a quick overview of how each database names a particular pathway or function, we provide some examples here:
Database |
ID Format |
Example ID |
Pathway/Function Name |
Notes |
---|---|---|---|---|
KEGG Pathway |
Numeric with Prefix |
dre00120 (Zebrafish), hsa00120 (Human) |
Primary bile acid biosynthesis |
KEGG pathway IDs are species-specific. |
KEGG Disease |
Alphanumeric |
H00001 (Human) |
B-cell acute lymphoblastic leukemia |
KEGG disease IDs are identified for human and are not defined for zebrafish. |
GO BP |
Alphanumeric |
GO:0007165 |
Signal transduction |
GO BP IDs are unique to each species. |
GO CC |
Alphanumeric |
GO:0005634 |
Nucleus |
GO CC IDs are unique to each species. |
GO MF |
Alphanumeric |
GO:0003824 |
Catalytic activity |
GO MF IDs are unique to each species. |
Enrichment Methods Supported
Enrichment Method |
Description |
---|---|
Fisher’s Exact Test |
A statistical method for assessing the significance of gene enrichment within predefined functional categories or pathways using categorical gene set data. |
Logistic Regression |
A modeling technique for predicting gene enrichment in specific functional categories, offering flexibility to handle diverse data types within gene sets. This method does not depend on a p-value cutoff for gene significance. |
Organisms Supported
danRerLib
is built for zebrafish and supports three organism types:
The table you’ve provided is clear and informative, but there appears to be a minor typographical error in the descriptions for “zebrafish” and “mapped zebrafish.” Here’s a corrected version:
Organisms Supported
danRerLib
is built for zebrafish and supports three organism types:
Organism |
Abbreviation |
Description |
---|---|---|
Zebrafish |
‘dre’ |
The zebrafish taxonomy |
Human |
‘hsa’ |
The human taxonomy |
Mapped Zebrafish |
‘dreM’ |
An organism defined through orthology |
Requirements
In this tutorial we will be utilizing:
the required python package
see install notes if not currently installed.
# IMPORT PYTHON PACKAGE
# ---------------------
from danrerlib import KEGG, utils, GO
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
Download Gene Sets
If you are interested in identifying the genes within a gene set, you can download genes for a given KEGG pathway id, KEGG disease id, or GO id (BP, CC, or MF). A key benefit and advantage to danRerLib
is we can also generate the gene list for a pathway that might not exist for zebrafish, but does exist for human. We can then map the genes to zebrafish genes if desired.
KEGG Pathway
Purpose: Given a KEGG pathway id, retrieve a list of all genes in said pathway.
As a reminder, danRerLib
supports three organism types (see background for more information): hsa, dre, and dreM. Of course, you are likely most interested in the zebrafish genes, but human genes are also provided for comparisons. Lets look at the KEGG pathway 00400
which is defined as phenylalanine, tyrosine and tryptophan biosynthesis. This pathway exists for human hsa00400
and zebrafish dre00400
. Here are a few examples of gathering the genes in this pathway for different organisms:
kegg_id = '00400'
human_genes = KEGG.get_genes_in_pathway(kegg_id, 'hsa')
human_genes
Human NCBI Gene ID | |
---|---|
0 | 137362 |
1 | 259307 |
2 | 2805 |
3 | 2806 |
4 | 5053 |
5 | 6898 |
dre_genes = KEGG.get_genes_in_pathway(kegg_id, 'dre')
dre_genes
NCBI Gene ID | |
---|---|
0 | 335974 |
1 | 337166 |
2 | 378962 |
3 | 406330 |
4 | 406688 |
5 | 561410 |
6 | 791730 |
mapped_genes = KEGG.get_genes_in_pathway(kegg_id, 'dreM')
mapped_genes
NCBI Gene ID | |
---|---|
0 | 561410 |
1 | 335974 |
2 | 337166 |
3 | 378962 |
4 | 406330 |
5 | 406688 |
6 | 791730 |
It is also possible to include the organism identifier in the KEGG id sent to the function and omit the organism identifier. As an example of this, see below where we only provide the full KEGG id desired.
KEGG.get_genes_in_pathway('hsa00400')
Human NCBI Gene ID | |
---|---|
0 | 137362 |
1 | 259307 |
2 | 2805 |
3 | 2806 |
4 | 5053 |
5 | 6898 |
KEGG Disease
Purpose: Given a KEGG disease id, retrieve a list of all genes in said disease.
We can also get the list of genes for a particular disease listed in the KEGG database. This function takes the disease id and the organism. Not that the zebrafish organism in this case comes from mapped values as the disease gene lists are not annotated for zebrafish within the KEGG database.
# human organism
KEGG.get_genes_in_disease('H00001', 'hsa')
Human NCBI Gene ID | |
---|---|
0 | 25 |
1 | 4297 |
2 | 4299 |
3 | 6929 |
4 | 5087 |
5 | 861 |
6 | 4609 |
7 | 64109 |
8 | 5079 |
# zebrafish organism
KEGG.get_genes_in_disease('H00001', 'dre')
NCBI Gene ID | |
---|---|
0 | 100000720 |
1 | 557048 |
2 | 100537394 |
3 | 664768 |
4 | 30310 |
5 | 58138 |
6 | 570960 |
7 | 58126 |
8 | 393141 |
9 | 30686 |
10 | 100294510 |
11 | 60636 |
It is expected, as described in the mapping tutorial, that the number of genes in the human gene set may not always be equal to the number of genes in the zebrafish set. This is because there is not always a 1:1 mapping in orthology between human ans zebrafish taxonomy.
Gene Ontology
Purpose: Given a Go concept id, retrieve a list of all genes in said GO concept.
danRerLib
supports Gene Ontology Biological Processes (BP), Cellular Components (CC), and Molecular Functions (MF). There is one primary function for the GO module that will retrieve the genes in any given concept, regardless if it comes from BP, CC, or MF. This can be done for either organism (hsa, dre, or dreM). Retrieval of the GO id GO:0033554
, cellular response to stress, is shown below for the zebrafish.
# zebrafish organism
GO.get_genes_in_GO_concept('GO:0033554', 'dre')
ZFIN ID | |
---|---|
0 | ZDB-GENE-030826-18 |
1 | ZDB-GENE-050320-35 |
2 | ZDB-GENE-040718-255 |
3 | ZDB-GENE-040825-4 |
4 | ZDB-GENE-091204-246 |
5 | ZDB-GENE-030827-3 |
6 | ZDB-GENE-030131-8638 |
7 | ZDB-GENE-030326-2 |
8 | ZDB-GENE-030131-9531 |
9 | ZDB-GENE-030131-1096 |
10 | ZDB-GENE-030131-6701 |
11 | A0A2R8S050 |
Conduct Functional Enrichment Analysis
Functional enrichment analysis is a powerful method for uncovering biological insights from a set of genes. Whether you have a specific gene set of interest or an entire dataset, you are able to perform functional enrichment analysis on pathways from KEGG and Gene Ontology. In this section, we’ll explore different scenarios, from analyzing a small gene set to conducting enrichment analysis on a larger scale using the danRerLib package.
Enrichment Module Overview
Import Enrichment Module
from danrerlib.enrichment import enrich_fishers, enrich_logistic
# can also import as:
# from danrerlib import enrichment
# import a module to help with plotting
from danrerlib import enrichplots
# just a notebook display option
from IPython.display import Markdown
Recommended Data Format: The functional enrichment package is designed to take in your differential expression results and output the pathways significantly altered. Therefore, your data should include every gene detected from your study, the associated p-value that defines significance of differential expression (Recommended to be an adjusted p-value), and the log2FC (or another metric that defines up/down regulation of the gene). This minimizes the work from the user with regard to data filtering for directional tests and background gene sets. The required data format is:
Gene ID |
p-value |
log2FC |
---|---|---|
any supported gene id type: NCBI Gene ID, ZFIN ID, Ensembl ID, Symbol |
0 < p-value < 1 |
positive or negative |
The data is required to have the gene id in the first column, the pvalue in the second, and the log2FC in the third. The column names can be defined by the user._
The data I will be using in this tutorial is a publicly available differential expression dataset from the Gene Expression Omnibus (GSE:…). The study investigated the effects of the chemical TPP on the developing zebrafish and obtained whole-embryo RNA-sequencing.
Required Parameters
Parameter Name |
Parameter Type |
Parameter Description |
---|---|---|
|
|
A DataFrame containing gene information, including gene IDs, p-values, and log2FC. |
|
|
The functional annotation database(s). Options include:’KEGG Pathway’, ‘KEGG Disease’, ‘GO BP’, ‘GO CC’, ‘GO MF’, ‘GO’ (all GO databases), ‘all’ ( all databases). Can also input a list of databases, as an example: [‘KEGG Pathway’, KEGG Disease’] |
|
|
The type of gene ID in the gene universe. The recommended gene id type is NCBI Gene ID (NCBI_ID). Must be one of: NCBI Gene ID, ZFIN ID, Ensembl ID, Symbol, or for human: Human NCBI Gene ID. |
Advanced, Optional Parameters
Parameter Name |
Parameter Type |
Parameter Description |
---|---|---|
|
|
The organism code (‘dre’ for zebrafish, ‘dreM’ for mapped zebrafish, ‘hsa’ for human, and ‘vairbale’ for the hybrid approach). Choose |
|
|
The direction of statistical test for enrichment. Unique parameter for enrichment testing options. See individual explanations below for more information. |
|
|
The significance cutoff for gene inclusion based on p-values. Default is 0.05. |
|
|
The log2 fold change cutoff value for gene inclusion. Default is 0. |
|
|
A list of concept IDs (e.g., pathway IDs or disease IDs) to analyze. Include this if you are interested in testing a specific amount of concept IDs. Default is None. |
|
|
A DataFrame representing a background gene set. Include if your gene universe does not contain all genes detected. Default is None. |
|
|
The significance cutoff for concept IDs based on p-values that will be returned to you. Default is 0.05. |
|
|
Whether to order the results by p-value. Default is True. |
|
|
The minimum number of genes in a concept for it to be considered. Default is 10. |
|
|
Include all results without filtering based on significance. Default is False. |
|
|
The base organism code for orthology comparisons. Default is ‘dre’. |
Functional Enrichment with Fisher’s Exact Test
Purpose: Conduct functional enrichment analysis using Fisher’s Exact Test to identify overrepresented pathways associated with a set of genes.
This method is ideal for scenarios where you have a gene set of interest and want to explore the significant enrichment of pathways from databases like KEGG and Gene Ontology. The Fisher’s Exact Test helps reveal statistically significant relationships between the gene set and specific pathways.
Overview of Fisher’s Exact Test for Overrepresentation
Contingency Table:
Consider a scenario where we are assessing the enrichment of a specific KEGG Pathway (let’s call it “Pathway A”) in a set of differentially expressed genes. The contingency table could look like this:
In this table:
\(a\) is the number of genes both in the pathway and significantly expressed.
\(b\) is the number of genes not in the pathway but significantly expressed.
\(c\) is the number of genes in the pathway but not significantly expressed.
\(d\) is the number of genes neither in the pathway nor significantly expressed.
Statistical Formulas:
Odds Ratio:
\[\text{Odds Ratio} = \frac{ad}{bc}\]Fisher’s Exact Test:
\[P(X \leq a) = \frac{\binom{a+b}{a} \binom{c+d}{c}}{\binom{N}{a+c}}\]where \(N = a+b+c+d\) is the total number of genes.
P-value Calculation:
\[P(\text{Observed or More Extreme}) = \sum_{I=0}^{a} \frac{\binom{a+b}{i} \binom{c+d}{a-i}}{\binom{N}{a+c}}\]A low p-value suggests that the observed distribution is unlikely to occur by random chance, indicating potential enrichment or depletion.
Primary Function and Unique Parameters
The primary function is enrich_fishers
with all the required and advanced parameters discussed above in addition to one unique parameter, direction
.
Parameter Name |
Parameter Type |
Parameter Description |
---|---|---|
|
|
The direction of statistical test for enrichment (‘up’, ‘down’, or ‘non-directional’). If you choose |
Conduct Fishers Exact Test for Enrichment Analysis (enriched/depleted)
Step 1: Loading Data. Data should be loaded into a Pandas Dataframe with the structure as describes above. This is defined as your gene_universe_tpp
as is contains all genes detected in your study.
gene_universe_tpp = pd.read_csv('data/test_data/TPP_DE.txt', sep = '\t')
To get a sneak peak about what the data looks like, we can look at the first 5 elements:
gene_universe_tpp.head(5)
NCBI Gene ID | PValue | logFC | |
---|---|---|---|
0 | 100000006 | 0.792615 | 0.115009 |
1 | 100000009 | 0.607285 | -0.144714 |
2 | 100000026 | 0.021338 | 0.603871 |
3 | 100000030 | 0.007880 | -2.083141 |
4 | 100000044 | 0.015286 | 0.803879 |
Step 2: Define necessary parameters and choices. To run the enrichment function, there are some key parameters you have the option of using (see above).
# the gene id type you have in your gene universe.
# In this case, we have NCBI Gene IDs.
# This is the name of the column that holds your gene ids.
ncbi_id = 'NCBI Gene ID'
# the database you wish to test enrichment for
database_choice = 'KEGG Pathway'
Step 3: Run enrichment function.
result = enrich_fishers(gene_universe = gene_universe_tpp,
database = database_choice,
gene_id_type = ncbi_id)
Step 4 Visualize results. In this case, we assessed whether or not a KEGG Pathway was enriched or depleted. This means was there significant up and/or down regulation of a pathway.
# count the number of significantly altered pathways
num_altered = len(result)
num_enriched = len(result[result['Direction'] == 'enriched'])
num_depleted = len(result[result['Direction'] == 'depleted'])
print(f'The total number of altered pathways is {num_altered}, the number enriched is {num_enriched}, and the number depleted is {num_depleted}.')
The total number of altered pathways is 13, the number enriched is 9, and the number depleted is 4.
%%capture
# visualize the top 5 most significantly altered KEGG pathways
# Markdown(result.head(5).to_markdown())
result.head(5)
Concept Type |
Concept ID |
Concept Name |
# Genes in Concept in Universe |
# Sig Genes Belong to Concept |
Proportion of Sig Genes in Set |
Odds Ratio |
P-value |
Direction |
Bonferroni |
BH FDR |
|
---|---|---|---|---|---|---|---|---|---|---|---|
0 |
KEGG Pathway |
dre00190 |
Oxidative phosphorylation |
128 |
10 |
0.078125 |
0.326766 |
0.000137991 |
depleted |
0.0227685 |
0.0149447 |
1 |
KEGG Pathway |
dre04080 |
Neuroactive ligand-receptor interaction |
417 |
56 |
0.134293 |
0.590132 |
0.000197834 |
depleted |
0.0326427 |
0.0149447 |
2 |
KEGG Pathway |
dre00030 |
Pentose phosphate pathway |
30 |
15 |
0.5 |
3.94897 |
0.000271721 |
enriched |
0.044834 |
0.0149447 |
3 |
KEGG Pathway |
dre04512 |
ECM-receptor interaction |
101 |
35 |
0.346535 |
2.10563 |
0.000684854 |
enriched |
0.113001 |
0.0282502 |
4 |
KEGG Pathway |
dre04142 |
Lysosome |
160 |
19 |
0.11875 |
0.520786 |
0.00532119 |
depleted |
0.877996 |
0.175599 |
# create a bar chart to visualize the enriched results
enriched_results = result[result['Direction'] == 'enriched']
enrichplots.bar_chart(enriched_results,
title = 'Top Enriched KEGG Pathways',
xlab = 'KEGG Pathway',
max_num_pathways=9)
Step 5 Interpretation. Interpret the results based on the direction of enrichment. Consider the biological context of the pathways and their relevance to your study.
For example, the Pentose phosphate pathway is significantly enriched in the set of genes that are differentially expressed in zebrafish exposed to TPP. The odds ratio of 3.94 and a very low p-value (0.0002) suggest a strong association between the pathway and the significant genes. The term “enriched” indicates that there is an overrepresentation of significant genes in this pathway compared to what would be expected by chance.
Testing for Upregulated Pathways
Contingency Table for Upregulated Genes:
Assume we are interested in assessing whether a KEGG Pathway (let’s call it “Pathway A”) is enriched with upregulated genes. The contingency table for this scenario would be:
Here:
\(a\) is the number of genes both in the pathway and upregulated.
\(b\) is the number of genes not in the pathway but upregulated.
\(c\) is the number of genes in the pathway but not upregulated.
\(d\) is the number of genes neither in the pathway nor upregulated.
Step 1: Load the data. We will use the same dataset as before.
gene_universe_tpp = pd.read_csv('data/test_data/TPP_DE.txt', sep = '\t')
Step 2: Define necessary parameters and choices. The key difference here is the test direction.
# same as before
ncbi_id = 'NCBI Gene ID'
database_choice = 'KEGG Pathway'
# define the directional test
test_direction = 'up'
Step 3 Run enrichment function.
upregulated_results = enrich_fishers(gene_universe = gene_universe_tpp,
gene_id_type = ncbi_id,
database = database_choice,
direction = test_direction
)
Step 4 Visualize results. In this case, we assessed whether or not a KEGG Pathway was upregulated.
# count the number of significantly altered pathways
num_upregulated = len(upregulated_results)
print(f'The total number of up-regulated pathways is {num_upregulated}.')
The total number of up-regulated pathways is 13.
%%capture
# visualize the top 5 most significantly altered KEGG pathways
# Markdown(result.head(5).to_markdown())
result.head(5)
Concept Type |
Concept ID |
Concept Name |
# Genes in Concept in Universe |
# Sig Genes Belong to Concept |
Proportion of Sig Genes in Set |
Odds Ratio |
P-value |
Direction |
Bonferroni |
BH FDR |
|
---|---|---|---|---|---|---|---|---|---|---|---|
0 |
KEGG Pathway |
dre00190 |
Oxidative phosphorylation |
128 |
10 |
0.078125 |
0.326766 |
0.000137991 |
depleted |
0.0227685 |
0.0149447 |
1 |
KEGG Pathway |
dre04080 |
Neuroactive ligand-receptor interaction |
417 |
56 |
0.134293 |
0.590132 |
0.000197834 |
depleted |
0.0326427 |
0.0149447 |
2 |
KEGG Pathway |
dre00030 |
Pentose phosphate pathway |
30 |
15 |
0.5 |
3.94897 |
0.000271721 |
enriched |
0.044834 |
0.0149447 |
3 |
KEGG Pathway |
dre04512 |
ECM-receptor interaction |
101 |
35 |
0.346535 |
2.10563 |
0.000684854 |
enriched |
0.113001 |
0.0282502 |
4 |
KEGG Pathway |
dre04142 |
Lysosome |
160 |
19 |
0.11875 |
0.520786 |
0.00532119 |
depleted |
0.877996 |
0.175599 |
# create a bar chart to visualize the enriched results
enrichplots.bar_chart(upregulated_results,
title = 'Upregulated KEGG Pathways',
xlab = 'KEGG Pathways',
max_num_pathways=13)
Testing for Downregulated Pathways
Contingency Table for Downregulated Genes:
Assume we are interested in assessing whether a KEGG Pathway (let’s call it “Pathway A”) is enriched with downregulated genes. The contingency table for this scenario would be:
Here:
\(a\) is the number of genes both in the pathway and downregulated.
\(b\) is the number of genes not in the pathway but downregulated.
\(c\) is the number of genes in the pathway but not downregulated.
\(d\) is the number of genes neither in the pathway nor downregulated.
Steps 1-6: You can repeat the steps above to test for down-regulated genes.
downregulated_results = enrich_fishers(gene_universe = gene_universe_tpp,
gene_id_type = ncbi_id,
database = database_choice,
direction = 'down'
)
# count the number of significantly altered pathways
num_downregulated = len(downregulated_results)
print(f'The total number of downregulated pathways is {num_downregulated}.')
The total number of downregulated pathways is 19.
enrichplots.bar_chart(downregulated_results,
title = 'Top 15 Down-Regulated KEGG Pathways',
xlab = 'KEGG Pathways',
max_num_pathways=15)
Advanced Options
The Fisher’s Exact Method, as described in the overview, is greatly dependent on the significance threshold. The definition of a gene that is condisered to be significant can depend on a variety of different factors related to your study. A common significane threshold for determining if a gene is considered to be significantly differentially expressed is a combination of both the adjusted p-value (0.05) and the log2FC (|log2FC|>1). If you would like to incorpoorate this into the analysis, you can!
enrichment_results = enrich_fishers(gene_universe_tpp,
database='KEGG Pathway',
gene_id_type='NCBI Gene ID',
sig_gene_cutoff_pvalue=0.05,
log2FC_cutoff_value=1
)
# count the number of significantly altered pathways
num_altered = len(enrichment_results)
print(f'The total number of altered pathways is {num_altered}.')
The total number of altered pathways is 11.
As you can see, we have a fewer significant pathways (11 here vs 13 above) as the significance threshold is stricter.
Note: there are a lot more GO Biological Processes then there are KEGG Pathways, therefore, testing for those will take a bit longer.
Functional Enrichment with Logistic Regression
Purpose: Conduct functional enrichment analysis using logistic regression to identify overrepresented pathways associated with a set of genes.
This method is ideal for scenarios where you have a gene set of interest and want to explore the significant enrichment of pathways from databases like KEGG and Gene Ontology. Logistic Regression helps reveal statistically significant relationships between the gene set and specific pathways. The logistic regression method is known to have better statistical power than fishers exact test (Sartor et. al., 2009.)
Overview of Logistic Regression for Enrichment Analysis
Logistic Regression Model:
The logistic regression method for gene set enrichment analysis involves modeling the log-odds of a gene belonging to a specific category (gene set) as a linear function of the statistical significance (measured by -log(P-values)).
Statistical Formulas:
Here’s a breakdown of the key components mentioned in the explanation:
Logistic Regression Model:
\[\log\left(\frac{p}{1-p}\right) = \alpha + \beta \cdot x\]where:
\(\log\) is the base 10 logarithm.
\(p\) is the probability of a gene belonging to the specific category.
\(\alpha\) is the intercept.
\(\beta\) is the slope parameter.
\(x\) is the -log(P-value) used as the explanatory variable.
Odds Ratio: The odds ratio indicates the change in the odds of a gene belonging to the specific category for a unit increase in the -log(P-value) or a 10-fold decrease in P-value.
\[ \text{Odds Ratio} = \exp(\beta) \]Wald Test:
The Wald test is used to assess the evidence in the data that \( \beta \) is different from zero.
The test statistic \( W \) follows a \(\chi^2\)-distribution with one degree of freedom.
The P-value is calculated based on this null distribution.
Estimation and Testing:
Maximum likelihood estimates \( \hat{\beta} \) and \( \hat{\alpha} \) are obtained using the iteratively weighted least squares (IWLS) algorithm.
Testing involves assessing whether \( \beta \) is significantly different from zero.
Ranking:
Gene sets are ranked based on P-values or odds ratios.
This method aims to identify enriched or depleted gene sets based on the statistical significance of their association with differential expression. The odds ratio and P-value provide insights into the strength and significance of this association.
Primary Function and Unique Parameters
The primary function is enrich_logistic
with all the required and advanced parameters discussed above in addition to one unique parameter, directional_test
.
Parameter Name |
Parameter Type |
Parameter Description |
---|---|---|
|
|
Signifying if you would like to preform a directional test ( |
Logistic Regression for the Identification of Up and Down Regulated Pathways
This is the recommended method for analyzing your gene expression data.
Step 1: Loading Data. Data should be loaded into a Pandas Dataframe with the structure as describes above. This is defined as your gene_universe_tpp
as is contains all genes detected in your study.
gene_universe_tpp = pd.read_csv('data/test_data/TPP_DE.txt', sep = '\t')
To get a sneak peak about what the data looks like, we can look at the first 5 elements:
gene_universe_tpp.head(5)
NCBI Gene ID | PValue | logFC | |
---|---|---|---|
0 | 100000006 | 0.792615 | 0.115009 |
1 | 100000009 | 0.607285 | -0.144714 |
2 | 100000026 | 0.021338 | 0.603871 |
3 | 100000030 | 0.007880 | -2.083141 |
4 | 100000044 | 0.015286 | 0.803879 |
Step 2: Define necessary parameters and choices. To run the enrichment function, there are some key parameters you have the option of using (see above).
# the gene id type you have in your gene universe.
# this is the name of the column holding your gene ids
ncbi_id = 'NCBI Gene ID'
# the database you wish to test enrichment for
database_choice = 'KEGG Pathway'
Step 3: Run enrichment function.
result = enrich_logistic(gene_universe = gene_universe_tpp,
database = database_choice,
gene_id_type = ncbi_id
)
Step 4: Visualize results. In this case, we assessed whether or not a KEGG Pathway was up or down regulated. This means was there significant up and/or down regulation of a pathway.
# count the number of significantly altered pathways
num_altered = len(result)
num_upregulated = len(result[result['Direction'] == 'upregulated'])
num_downregulated = len(result[result['Direction'] == 'downregulated'])
print(f'The total number of altered pathways is {num_altered}, the number upregulated is {num_upregulated}, and the number downregulated is {num_downregulated}.')
The total number of altered pathways is 47, the number upregulated is 20, and the number downregulated is 27.
%%capture
# visualize the top 5 most significantly altered KEGG pathways
# Markdown(result.head(5).to_markdown())
result.head(5)
Concept Type |
Concept ID |
Concept Name |
# Genes in Concept in Universe |
# Sig Genes Belong to Concept |
Proportion of Sig Genes in Set |
Odds Ratio |
P-value |
Direction |
Bonferroni |
BH FDR |
|
---|---|---|---|---|---|---|---|---|---|---|---|
0 |
KEGG Pathway |
dre03013 |
Nucleocytoplasmic transport |
100 |
26 |
0.26 |
0.734754 |
1.96187e-07 |
downregulated |
3.23709e-05 |
2.12791e-05 |
1 |
KEGG Pathway |
dre01200 |
Carbon metabolism |
124 |
30 |
0.241935 |
0.752328 |
2.57929e-07 |
downregulated |
4.25582e-05 |
2.12791e-05 |
2 |
KEGG Pathway |
dre00970 |
Aminoacyl-tRNA biosynthesis |
51 |
12 |
0.235294 |
0.724611 |
4.59772e-05 |
downregulated |
0.00758624 |
0.00252875 |
3 |
KEGG Pathway |
dre04010 |
MAPK signaling pathway |
373 |
61 |
0.163539 |
1.1604 |
6.46906e-05 |
upregulated |
0.010674 |
0.00256423 |
4 |
KEGG Pathway |
dre04060 |
Cytokine-cytokine receptor interaction |
153 |
20 |
0.130719 |
1.2292 |
7.7704e-05 |
upregulated |
0.0128212 |
0.00256423 |
# create a bar chart to visualize the enriched results
upregulated_results = result[result['Direction'] == 'upregulated']
enrichplots.bar_chart(upregulated_results,
title = 'Top 10 Upregulated KEGG Pathways',
xlab = 'KEGG Pathway',
max_num_pathways=10)
# create a bar chart to visualize the enriched results
downregulated_results = result[result['Direction'] == 'downregulated']
enrichplots.bar_chart(downregulated_results,
title = 'Top 10 Downregulated KEGG Pathways',
xlab = 'KEGG Pathway',
max_num_pathways=10)
Conduct Logistic Regression for Enrichment Analysis (enriched/depleted)
Step 1: Loading Data. Data should be loaded into a Pandas Dataframe with the structure as describes above. This is defined as your gene_universe_tpp
as is contains all genes detected in your study.
gene_universe_tpp = pd.read_csv('data/test_data/TPP_DE.txt', sep = '\t')
Step 2: Define necessary parameters and choices. To run the enrichment function, there are some key parameters you have the option of using (see above).
# the gene id type you have in your gene universe.
# In this case, we have NCBI Gene IDs.
ncbi_id = 'NCBI Gene ID'
# the database you wish to test enrichment for
database_choice = 'KEGG Pathway'
Step 3: Run enrichment function.
result = enrich_logistic(gene_universe = gene_universe_tpp,
database = database_choice,
gene_id_type = ncbi_id,
directional_test = False
)
Step 4: Visualize results. In this case, we assessed whether or not a KEGG Pathway was enriched or depleted. This whenther or not a pathway was overrepresented with differentially expressed genes or underrepresented.
# count the number of significantly altered pathways
num_altered = len(result)
num_enriched = len(result[result['Direction'] == 'enriched'])
num_depleted = len(result[result['Direction'] == 'depleted'])
print(f'The total number of altered pathways is {num_altered}, the number enriched is {num_enriched}, and the number depleted is {num_depleted}.')
The total number of altered pathways is 20, the number enriched is 15, and the number depleted is 5.
%%capture
# visualize the top 5 most significantly altered KEGG pathways
# Markdown(result.head(5).to_markdown())
result.head(5)
Concept Type |
Concept ID |
Concept Name |
# Genes in Concept in Universe |
# Sig Genes Belong to Concept |
Proportion of Sig Genes in Set |
Odds Ratio |
P-value |
Direction |
Bonferroni |
BH FDR |
|
---|---|---|---|---|---|---|---|---|---|---|---|
0 |
KEGG Pathway |
dre00190 |
Oxidative phosphorylation |
128 |
10 |
0.078125 |
0.485144 |
2.12733e-05 |
depleted |
0.00351009 |
0.00351009 |
1 |
KEGG Pathway |
dre04080 |
Neuroactive ligand-receptor interaction |
417 |
56 |
0.134293 |
0.781985 |
0.000103387 |
depleted |
0.0170589 |
0.00675259 |
2 |
KEGG Pathway |
dre03010 |
Ribosome |
124 |
15 |
0.120968 |
0.543902 |
0.000132295 |
depleted |
0.0218287 |
0.00675259 |
3 |
KEGG Pathway |
dre00030 |
Pentose phosphate pathway |
30 |
15 |
0.5 |
1.38619 |
0.000163699 |
enriched |
0.0270104 |
0.00675259 |
4 |
KEGG Pathway |
dre04512 |
ECM-receptor interaction |
101 |
35 |
0.346535 |
1.23605 |
0.000941206 |
enriched |
0.155299 |
0.0310598 |
# create a bar chart to visualize the enriched results
enriched_results = result[result['Direction'] == 'enriched']
enrichplots.bar_chart(enriched_results,
title = 'Top 10 Enriched KEGG Pathways',
xlab = 'KEGG Pathway',
max_num_pathways=10)
If you were to compare these results with those of Fisher’s Exact Test above, you would notice many of the significant pathways show up in both methods.
Functional Enrichment via Orthology
Purpose: Conduct functional enrichment analysis via orthology to identify overrepresented pathways in zebrafish associated with a set of genes that have been originally annotated for human.
This method is ideal for scenarios where you want to leverage existing knowledge from the well-annotated species, humans, and apply it to the less-studied organism, zebrafish. Orthology-based functional enrichment allows you to transfer annotations and pathway information from humamns, where extensive functional genomics studies have been conducted, to the target species of zebrafish. In the context of zebrafish, which may have fewer experimentally validated annotations, leveraging orthologous relationships helps bridge the gap in functional understanding.
In danRerLib
, the organism dreM
stands for Danio rerio mapped, meaning the zebrafish genome has been mapped to human via orthology. Therefore, we can conduct the same steps as shown above but choose the dreM
organism to conduct functional enrichment via orthology.
Conduct Logistic Regression for Enrichment Analysis
Step 1: Loading Data. Data should be loaded into a Pandas Dataframe with the structure as describes above. This is defined as your gene_universe_tpp
as is contains all genes detected in your study.
gene_universe_tpp = pd.read_csv('data/test_data/TPP_DE.txt', sep = '\t')
Step 2: Define necessary parameters and choices. To run the enrichment function, there are some key parameters you have the option of using (see above).
# the gene id type you have in your gene universe.
# In this case, we have NCBI Gene IDs.
ncbi_id = 'NCBI Gene ID'
# the database you wish to test enrichment for
database_choice = 'KEGG Pathway'
# organism - for orthology we want to choose 'dreM'
organism_for_orthology = 'dreM'
Step 3: Run enrichment function.
result = enrich_logistic(gene_universe = gene_universe_tpp,
gene_id_type = ncbi_id,
database = database_choice,
org = organism_for_orthology
)
Step 4: Visualize results. In this case, we assessed whether or not a KEGG Pathway was up or downregulated.
# count the number of significantly altered pathways
num_altered = len(result)
num_upregulated = len(result[result['Direction'] == 'upregulated'])
num_downregulated = len(result[result['Direction'] == 'downregulated'])
print(f'The total number of altered pathways is {num_altered}, the number upregulated is {num_upregulated}, and the number downregulated is {num_downregulated}.')
The total number of altered pathways is 67, the number upregulated is 35, and the number downregulated is 32.
As you can see based on the analysis done above with the default organism of dre
, we have detected more significanly altered pathways via orthology. This is because the number of KEGG Pathways that are annotated for human greatly out numbers those of zebrafish.
fun = KEGG._get_total_number_kegg
print(f'Total Number dre: {fun("dre", "pathway")}')
print(f'Total Number dreM: {fun("dreM", "pathway")}')
print(f'Total Number hsa: {fun("hsa", "pathway")}')
Total Number dre: 179
Total Number dreM: 357
Total Number hsa: 357
As you can see, Danio rerio (dre
) only has 179 annotated pathways while human (hsa
) has 357; therefore, mapped zebrafish via orthology (dreM
) has 357 annotated pathways as well.
%%capture
# visualize the top 5 most significantly altered KEGG pathways
# Markdown(result.head(5).to_markdown())
result.head(5)
Concept Type |
Concept ID |
Concept Name |
# Genes in Concept in Universe |
# Sig Genes Belong to Concept |
Proportion of Sig Genes in Set |
Odds Ratio |
P-value |
Direction |
Bonferroni |
BH FDR |
|
---|---|---|---|---|---|---|---|---|---|---|---|
0 |
KEGG Pathway |
dreM01200 |
Carbon metabolism |
117 |
30 |
0.25641 |
0.761874 |
5.50136e-07 |
downregulated |
0.000184295 |
0.000184295 |
1 |
KEGG Pathway |
dreM03013 |
Nucleocytoplasmic transport |
94 |
22 |
0.234043 |
0.769903 |
1.64493e-05 |
downregulated |
0.00551052 |
0.00275526 |
2 |
KEGG Pathway |
dreM00970 |
Aminoacyl-tRNA biosynthesis |
38 |
10 |
0.263158 |
0.718016 |
8.14351e-05 |
downregulated |
0.0272808 |
0.00909359 |
3 |
KEGG Pathway |
dreM05202 |
Transcriptional misregulation in cancer |
192 |
28 |
0.145833 |
1.20917 |
0.000208636 |
upregulated |
0.0698932 |
0.0158109 |
4 |
KEGG Pathway |
dreM00010 |
Glycolysis / Gluconeogenesis |
68 |
14 |
0.205882 |
0.771128 |
0.000235983 |
downregulated |
0.0790543 |
0.0158109 |
# create a bar chart to visualize the enriched results
upregulated_results = result[result['Direction'] == 'upregulated']
enrichplots.bar_chart(upregulated_results,
title = 'Top 10 Upregulated KEGG Pathways',
xlab = 'KEGG Pathway',
max_num_pathways=10)
# create a bar chart to visualize the enriched results
downregulated_results = result[result['Direction'] == 'downregulated']
enrichplots.bar_chart(downregulated_results,
title = 'Top 10 Down Regulated KEGG Pathways',
xlab = 'KEGG Pathway',
max_num_pathways=10)
Glossary
Enriched Pathway: An enriched pathway refers to a biological pathway that contains a statistically significant number of genes that show altered expression under a particular experimental condition, irregardless of the gene being up or down regulated. For example, if you perform gene expression analysis comparing two conditions (e.g., control vs. treatment) and find that a specific biological pathway has a higher-than-expected number of genes with altered expression, you would say that pathway is enriched.
Depleted Pathway: A depleted pathway is the opposite of an enriched pathway. It refers to a biological pathway that contains a statistically significant number of genes with lower expression than expected under the given experimental condition, irregardless of the gene being up or down regulated. For example, if a biological pathway shows a lower-than-expected number of genes with altered expression in your analysis, that pathway is considered depleted.
Upregulated Pathway: An upregulated pathway refers to a biological pathway where the majority of genes have increased expression levels under a specific experimental condition. For example, if you find that most genes in a particular pathway have higher expression levels in the treatment group compared to the control group, that pathway is considered upregulated.
Downregulated Pathway: A downregulated pathway is the opposite of an upregulated pathway. It indicates a biological pathway where the majority of genes have decreased expression levels under the experimental condition being studied. For example, if most genes in a pathway exhibit lower expression levels in the treatment group compared to the control group, that pathway is considered downregulated.
Conclusion
This concludes the functional enrichment tutorial! In summary, the key functions in this library for enrichment are:
function |
purpose |
---|---|
enrich_fishers |
conduct functional enrichment using the fishers exact method |
enrich_logistic |
conduct functional enrichment using the logistic regression method |
For more information about the full functionality of each function, please refer to the API Reference.
References
Sartor, M. A., Leikauf, G. D., & Medvedovic, M. (2009). LRpath: a logistic regression approach for identifying enriched biological groups in gene expression data. Bioinformatics (Oxford, England), 25(2), 211–217.