Functional Enrichment Analysis

Author: Ashley Schwartz

Date: Originally Developed September 2023, Last Revised March 2024

Purpose and Background

This tutorial goes over how to retrieve the genes in a particular pathway from the Kyoto Encyclopedia of Genes and Genomes (KEGG), retrieve the genes in a particular function from Gene Ontology (GO), and perform gene enrichment analysis on these gene sets using a variety of different methods.

Databases Supported

Database

Abbreviation

Description

Link

Kyoto Encyclopedia of Genes and Genomes Pathway Database

KEGG Pathway

Molecular pathways and interactions

link

Kyoto Encyclopedia of Genes and Genomes Disease Database

KEGG Disease

Genes and pathways associated with diseases

link

Gene Ontology Biological Process Database

GO BP

Molecular events in biological processes

link

Gene Ontology Cellular Component Database

GO CC

Cellular structures and locations

link

Gene Ontology Molecular Function Database

GO MF

Specific activities or functions of gene products

link

Database Naming Conventions

Each database follows naming conventions for their pathways and functions. For a quick overview of how each database names a particular pathway or function, we provide some examples here:

Database

ID Format

Example ID

Pathway/Function Name

Notes

KEGG Pathway

Numeric with Prefix

dre00120 (Zebrafish), hsa00120 (Human)

Primary bile acid biosynthesis

KEGG pathway IDs are species-specific.

KEGG Disease

Alphanumeric

H00001 (Human)

B-cell acute lymphoblastic leukemia

KEGG disease IDs are identified for human and are not defined for zebrafish.

GO BP

Alphanumeric

GO:0007165

Signal transduction

GO BP IDs are unique to each species.

GO CC

Alphanumeric

GO:0005634

Nucleus

GO CC IDs are unique to each species.

GO MF

Alphanumeric

GO:0003824

Catalytic activity

GO MF IDs are unique to each species.

Enrichment Methods Supported

Enrichment Method

Description

Fisher’s Exact Test

A statistical method for assessing the significance of gene enrichment within predefined functional categories or pathways using categorical gene set data.

Logistic Regression

A modeling technique for predicting gene enrichment in specific functional categories, offering flexibility to handle diverse data types within gene sets. This method does not depend on a p-value cutoff for gene significance.

Organisms Supported

danRerLib is built for zebrafish and supports three organism types:

The table you’ve provided is clear and informative, but there appears to be a minor typographical error in the descriptions for “zebrafish” and “mapped zebrafish.” Here’s a corrected version:

Organisms Supported

danRerLib is built for zebrafish and supports three organism types:

Organism

Abbreviation

Description

Zebrafish

‘dre’

The zebrafish taxonomy

Human

‘hsa’

The human taxonomy

Mapped Zebrafish

‘dreM’

An organism defined through orthology

Requirements

In this tutorial we will be utilizing:

  • the required python package

    • see install notes if not currently installed.

# IMPORT PYTHON PACKAGE
# ---------------------
from danrerlib import KEGG, utils, GO
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

Download Gene Sets

If you are interested in identifying the genes within a gene set, you can download genes for a given KEGG pathway id, KEGG disease id, or GO id (BP, CC, or MF). A key benefit and advantage to danRerLib is we can also generate the gene list for a pathway that might not exist for zebrafish, but does exist for human. We can then map the genes to zebrafish genes if desired.

KEGG Pathway

Purpose: Given a KEGG pathway id, retrieve a list of all genes in said pathway.

As a reminder, danRerLib supports three organism types (see background for more information): hsa, dre, and dreM. Of course, you are likely most interested in the zebrafish genes, but human genes are also provided for comparisons. Lets look at the KEGG pathway 00400 which is defined as phenylalanine, tyrosine and tryptophan biosynthesis. This pathway exists for human hsa00400 and zebrafish dre00400. Here are a few examples of gathering the genes in this pathway for different organisms:

kegg_id = '00400'
human_genes = KEGG.get_genes_in_pathway(kegg_id, 'hsa')
human_genes
Human NCBI Gene ID
0 137362
1 259307
2 2805
3 2806
4 5053
5 6898
dre_genes = KEGG.get_genes_in_pathway(kegg_id, 'dre')
dre_genes
NCBI Gene ID
0 335974
1 337166
2 378962
3 406330
4 406688
5 561410
6 791730
mapped_genes = KEGG.get_genes_in_pathway(kegg_id, 'dreM')
mapped_genes
NCBI Gene ID
0 561410
1 335974
2 337166
3 378962
4 406330
5 406688
6 791730

It is also possible to include the organism identifier in the KEGG id sent to the function and omit the organism identifier. As an example of this, see below where we only provide the full KEGG id desired.

KEGG.get_genes_in_pathway('hsa00400')
Human NCBI Gene ID
0 137362
1 259307
2 2805
3 2806
4 5053
5 6898

KEGG Disease

Purpose: Given a KEGG disease id, retrieve a list of all genes in said disease.

We can also get the list of genes for a particular disease listed in the KEGG database. This function takes the disease id and the organism. Not that the zebrafish organism in this case comes from mapped values as the disease gene lists are not annotated for zebrafish within the KEGG database.

# human organism
KEGG.get_genes_in_disease('H00001', 'hsa')
Human NCBI Gene ID
0 25
1 4297
2 4299
3 6929
4 5087
5 861
6 4609
7 64109
8 5079
# zebrafish organism
KEGG.get_genes_in_disease('H00001', 'dre')
NCBI Gene ID
0 100000720
1 557048
2 100537394
3 664768
4 30310
5 58138
6 570960
7 58126
8 393141
9 30686
10 100294510
11 60636

It is expected, as described in the mapping tutorial, that the number of genes in the human gene set may not always be equal to the number of genes in the zebrafish set. This is because there is not always a 1:1 mapping in orthology between human ans zebrafish taxonomy.

Gene Ontology

Purpose: Given a Go concept id, retrieve a list of all genes in said GO concept.

danRerLib supports Gene Ontology Biological Processes (BP), Cellular Components (CC), and Molecular Functions (MF). There is one primary function for the GO module that will retrieve the genes in any given concept, regardless if it comes from BP, CC, or MF. This can be done for either organism (hsa, dre, or dreM). Retrieval of the GO id GO:0033554, cellular response to stress, is shown below for the zebrafish.

# zebrafish organism
GO.get_genes_in_GO_concept('GO:0033554', 'dre')
ZFIN ID
0 ZDB-GENE-030826-18
1 ZDB-GENE-050320-35
2 ZDB-GENE-040718-255
3 ZDB-GENE-040825-4
4 ZDB-GENE-091204-246
5 ZDB-GENE-030827-3
6 ZDB-GENE-030131-8638
7 ZDB-GENE-030326-2
8 ZDB-GENE-030131-9531
9 ZDB-GENE-030131-1096
10 ZDB-GENE-030131-6701
11 A0A2R8S050

Conduct Functional Enrichment Analysis

Functional enrichment analysis is a powerful method for uncovering biological insights from a set of genes. Whether you have a specific gene set of interest or an entire dataset, you are able to perform functional enrichment analysis on pathways from KEGG and Gene Ontology. In this section, we’ll explore different scenarios, from analyzing a small gene set to conducting enrichment analysis on a larger scale using the danRerLib package.

Enrichment Module Overview

Import Enrichment Module

from danrerlib.enrichment import enrich_fishers, enrich_logistic

# can also import as:
# from danrerlib import enrichment

# import a module to help with plotting
from danrerlib import enrichplots

# just a notebook display option
from IPython.display import Markdown

Recommended Data Format: The functional enrichment package is designed to take in your differential expression results and output the pathways significantly altered. Therefore, your data should include every gene detected from your study, the associated p-value that defines significance of differential expression (Recommended to be an adjusted p-value), and the log2FC (or another metric that defines up/down regulation of the gene). This minimizes the work from the user with regard to data filtering for directional tests and background gene sets. The required data format is:

Gene ID

p-value

log2FC

any supported gene id type: NCBI Gene ID, ZFIN ID, Ensembl ID, Symbol

0 < p-value < 1

positive or negative

The data is required to have the gene id in the first column, the pvalue in the second, and the log2FC in the third. The column names can be defined by the user._

The data I will be using in this tutorial is a publicly available differential expression dataset from the Gene Expression Omnibus (GSE:…). The study investigated the effects of the chemical TPP on the developing zebrafish and obtained whole-embryo RNA-sequencing.

Required Parameters

Parameter Name

Parameter Type

Parameter Description

gene_universe

pd.DataFrame

A DataFrame containing gene information, including gene IDs, p-values, and log2FC.

database

str or list[str]

The functional annotation database(s). Options include:’KEGG Pathway’, ‘KEGG Disease’, ‘GO BP’, ‘GO CC’, ‘GO MF’, ‘GO’ (all GO databases), ‘all’ ( all databases). Can also input a list of databases, as an example: [‘KEGG Pathway’, KEGG Disease’]

gene_id_type

str

The type of gene ID in the gene universe. The recommended gene id type is NCBI Gene ID (NCBI_ID). Must be one of: NCBI Gene ID, ZFIN ID, Ensembl ID, Symbol, or for human: Human NCBI Gene ID.

Advanced, Optional Parameters

Parameter Name

Parameter Type

Parameter Description

org

str

The organism code (‘dre’ for zebrafish, ‘dreM’ for mapped zebrafish, ‘hsa’ for human, and ‘vairbale’ for the hybrid approach). Choose variable if you would like to preform enrichment via orthology with zebrafish ids as ground truth. Default is dre

direction

str

The direction of statistical test for enrichment. Unique parameter for enrichment testing options. See individual explanations below for more information.

sig_gene_cutoff_pvalue

float

The significance cutoff for gene inclusion based on p-values. Default is 0.05.

log2FC_cutoff_value

float

The log2 fold change cutoff value for gene inclusion. Default is 0.

concept_ids

list

A list of concept IDs (e.g., pathway IDs or disease IDs) to analyze. Include this if you are interested in testing a specific amount of concept IDs. Default is None.

background_gene_set

pd.DataFrame

A DataFrame representing a background gene set. Include if your gene universe does not contain all genes detected. Default is None.

sig_conceptID_cutoff_pvalue

float

The significance cutoff for concept IDs based on p-values that will be returned to you. Default is 0.05.

order_by_p_value

bool

Whether to order the results by p-value. Default is True.

min_num_genes_in_concept

int

The minimum number of genes in a concept for it to be considered. Default is 10.

include_all

bool

Include all results without filtering based on significance. Default is False.

orthology_base

str

The base organism code for orthology comparisons. Default is ‘dre’.

Functional Enrichment with Fisher’s Exact Test

Purpose: Conduct functional enrichment analysis using Fisher’s Exact Test to identify overrepresented pathways associated with a set of genes.

This method is ideal for scenarios where you have a gene set of interest and want to explore the significant enrichment of pathways from databases like KEGG and Gene Ontology. The Fisher’s Exact Test helps reveal statistically significant relationships between the gene set and specific pathways.

Overview of Fisher’s Exact Test for Overrepresentation

Contingency Table:

Consider a scenario where we are assessing the enrichment of a specific KEGG Pathway (let’s call it “Pathway A”) in a set of differentially expressed genes. The contingency table could look like this:

\[\begin{split} \begin{array}{|c|c|c|} \hline & \text{In Pathway A} & \text{Not In Pathway A} \\ \hline \text{Significantly Expressed} & a & b \\ \hline \text{Not Significantly Expressed} & c & d \\ \hline \end{array} \end{split}\]

In this table:

  • \(a\) is the number of genes both in the pathway and significantly expressed.

  • \(b\) is the number of genes not in the pathway but significantly expressed.

  • \(c\) is the number of genes in the pathway but not significantly expressed.

  • \(d\) is the number of genes neither in the pathway nor significantly expressed.

Statistical Formulas:

  1. Odds Ratio:

    \[\text{Odds Ratio} = \frac{ad}{bc}\]
  2. Fisher’s Exact Test:

    \[P(X \leq a) = \frac{\binom{a+b}{a} \binom{c+d}{c}}{\binom{N}{a+c}}\]

    where \(N = a+b+c+d\) is the total number of genes.

  3. P-value Calculation:

    \[P(\text{Observed or More Extreme}) = \sum_{I=0}^{a} \frac{\binom{a+b}{i} \binom{c+d}{a-i}}{\binom{N}{a+c}}\]

    A low p-value suggests that the observed distribution is unlikely to occur by random chance, indicating potential enrichment or depletion.

Primary Function and Unique Parameters

The primary function is enrich_fishers with all the required and advanced parameters discussed above in addition to one unique parameter, direction.

Parameter Name

Parameter Type

Parameter Description

direction

str

The direction of statistical test for enrichment (‘up’, ‘down’, or ‘non-directional’). If you choose up for example, you will be able to test if the gene set is significantly up-regulated. Default is ‘non-directional’.

Conduct Fishers Exact Test for Enrichment Analysis (enriched/depleted)

Step 1: Loading Data. Data should be loaded into a Pandas Dataframe with the structure as describes above. This is defined as your gene_universe_tpp as is contains all genes detected in your study.

gene_universe_tpp = pd.read_csv('data/test_data/TPP_DE.txt', sep = '\t')

To get a sneak peak about what the data looks like, we can look at the first 5 elements:

gene_universe_tpp.head(5)
NCBI Gene ID PValue logFC
0 100000006 0.792615 0.115009
1 100000009 0.607285 -0.144714
2 100000026 0.021338 0.603871
3 100000030 0.007880 -2.083141
4 100000044 0.015286 0.803879

Step 2: Define necessary parameters and choices. To run the enrichment function, there are some key parameters you have the option of using (see above).

# the gene id type you have in your gene universe. 
# In this case, we have NCBI Gene IDs. 
# This is the name of the column that holds your gene ids.
ncbi_id = 'NCBI Gene ID'

# the database you wish to test enrichment for 
database_choice = 'KEGG Pathway'

Step 3: Run enrichment function.

result = enrich_fishers(gene_universe = gene_universe_tpp, 
                database = database_choice, 
                gene_id_type = ncbi_id)

Step 4 Visualize results. In this case, we assessed whether or not a KEGG Pathway was enriched or depleted. This means was there significant up and/or down regulation of a pathway.

# count the number of significantly altered pathways
num_altered = len(result)
num_enriched = len(result[result['Direction'] == 'enriched'])
num_depleted = len(result[result['Direction'] == 'depleted'])

print(f'The total number of altered pathways is {num_altered}, the number enriched is {num_enriched}, and the number depleted is {num_depleted}.')
The total number of altered pathways is 13, the number enriched is 9, and the number depleted is 4.
%%capture
# visualize the top 5 most significantly altered KEGG pathways
# Markdown(result.head(5).to_markdown())
result.head(5)

Concept Type

Concept ID

Concept Name

# Genes in Concept in Universe

# Sig Genes Belong to Concept

Proportion of Sig Genes in Set

Odds Ratio

P-value

Direction

Bonferroni

BH FDR

0

KEGG Pathway

dre00190

Oxidative phosphorylation

128

10

0.078125

0.326766

0.000137991

depleted

0.0227685

0.0149447

1

KEGG Pathway

dre04080

Neuroactive ligand-receptor interaction

417

56

0.134293

0.590132

0.000197834

depleted

0.0326427

0.0149447

2

KEGG Pathway

dre00030

Pentose phosphate pathway

30

15

0.5

3.94897

0.000271721

enriched

0.044834

0.0149447

3

KEGG Pathway

dre04512

ECM-receptor interaction

101

35

0.346535

2.10563

0.000684854

enriched

0.113001

0.0282502

4

KEGG Pathway

dre04142

Lysosome

160

19

0.11875

0.520786

0.00532119

depleted

0.877996

0.175599

# create a bar chart to visualize the enriched results
enriched_results = result[result['Direction'] == 'enriched']

enrichplots.bar_chart(enriched_results, 
                      title = 'Top Enriched KEGG Pathways', 
                      xlab = 'KEGG Pathway',
                      max_num_pathways=9)
../_images/0b72f70e05ce22cbcde0198862c6a03b9932acc3b59dd79b0770ae9fa7f076fa.png

Step 5 Interpretation. Interpret the results based on the direction of enrichment. Consider the biological context of the pathways and their relevance to your study.

For example, the Pentose phosphate pathway is significantly enriched in the set of genes that are differentially expressed in zebrafish exposed to TPP. The odds ratio of 3.94 and a very low p-value (0.0002) suggest a strong association between the pathway and the significant genes. The term “enriched” indicates that there is an overrepresentation of significant genes in this pathway compared to what would be expected by chance.

Testing for Upregulated Pathways

Contingency Table for Upregulated Genes:

Assume we are interested in assessing whether a KEGG Pathway (let’s call it “Pathway A”) is enriched with upregulated genes. The contingency table for this scenario would be:

\[\begin{split} \begin{array}{|c|c|c|} \hline & \text{In Pathway A} & \text{Not In Pathway A} \\ \hline \text{Upregulated} & a & b \\ \hline \text{Not Upregulated} & c & d \\ \hline \end{array} \end{split}\]

Here:

  • \(a\) is the number of genes both in the pathway and upregulated.

  • \(b\) is the number of genes not in the pathway but upregulated.

  • \(c\) is the number of genes in the pathway but not upregulated.

  • \(d\) is the number of genes neither in the pathway nor upregulated.

Step 1: Load the data. We will use the same dataset as before.

gene_universe_tpp = pd.read_csv('data/test_data/TPP_DE.txt', sep = '\t')

Step 2: Define necessary parameters and choices. The key difference here is the test direction.

# same as before
ncbi_id = 'NCBI Gene ID'
database_choice = 'KEGG Pathway'

# define the directional test
test_direction = 'up'

Step 3 Run enrichment function.

upregulated_results = enrich_fishers(gene_universe = gene_universe_tpp, 
                        gene_id_type = ncbi_id,
                        database = database_choice, 
                        direction = test_direction   
                        )

Step 4 Visualize results. In this case, we assessed whether or not a KEGG Pathway was upregulated.

# count the number of significantly altered pathways
num_upregulated = len(upregulated_results)

print(f'The total number of up-regulated pathways is {num_upregulated}.')
The total number of up-regulated pathways is 13.
%%capture
# visualize the top 5 most significantly altered KEGG pathways
# Markdown(result.head(5).to_markdown())
result.head(5)

Concept Type

Concept ID

Concept Name

# Genes in Concept in Universe

# Sig Genes Belong to Concept

Proportion of Sig Genes in Set

Odds Ratio

P-value

Direction

Bonferroni

BH FDR

0

KEGG Pathway

dre00190

Oxidative phosphorylation

128

10

0.078125

0.326766

0.000137991

depleted

0.0227685

0.0149447

1

KEGG Pathway

dre04080

Neuroactive ligand-receptor interaction

417

56

0.134293

0.590132

0.000197834

depleted

0.0326427

0.0149447

2

KEGG Pathway

dre00030

Pentose phosphate pathway

30

15

0.5

3.94897

0.000271721

enriched

0.044834

0.0149447

3

KEGG Pathway

dre04512

ECM-receptor interaction

101

35

0.346535

2.10563

0.000684854

enriched

0.113001

0.0282502

4

KEGG Pathway

dre04142

Lysosome

160

19

0.11875

0.520786

0.00532119

depleted

0.877996

0.175599

# create a bar chart to visualize the enriched results
enrichplots.bar_chart(upregulated_results,
                      title = 'Upregulated KEGG Pathways',
                      xlab = 'KEGG Pathways',
                      max_num_pathways=13)
../_images/0f4b74f6c929685dbb48caaa6b930829bf98ea385b69697822357bbc5bd36e19.png

Testing for Downregulated Pathways

Contingency Table for Downregulated Genes:

Assume we are interested in assessing whether a KEGG Pathway (let’s call it “Pathway A”) is enriched with downregulated genes. The contingency table for this scenario would be:

\[\begin{split} \begin{array}{|c|c|c|} \hline & \text{In Pathway A} & \text{Not In Pathway A} \\ \hline \text{Downregulated} & a & b \\ \hline \text{Not Downregulated} & c & d \\ \hline \end{array} \end{split}\]

Here:

  • \(a\) is the number of genes both in the pathway and downregulated.

  • \(b\) is the number of genes not in the pathway but downregulated.

  • \(c\) is the number of genes in the pathway but not downregulated.

  • \(d\) is the number of genes neither in the pathway nor downregulated.

Steps 1-6: You can repeat the steps above to test for down-regulated genes.

downregulated_results = enrich_fishers(gene_universe = gene_universe_tpp, 
                        gene_id_type = ncbi_id,
                        database = database_choice, 
                        direction = 'down'   
                        )
# count the number of significantly altered pathways
num_downregulated = len(downregulated_results)
print(f'The total number of downregulated pathways is {num_downregulated}.')
The total number of downregulated pathways is 19.
enrichplots.bar_chart(downregulated_results,
                       title = 'Top 15 Down-Regulated KEGG Pathways',
                       xlab = 'KEGG Pathways', 
                       max_num_pathways=15)
../_images/1c8ac9058287cb5e3470b84ea2ac6aeb0f32b37cfbb522ba539d2dad9b57cff0.png

Advanced Options

The Fisher’s Exact Method, as described in the overview, is greatly dependent on the significance threshold. The definition of a gene that is condisered to be significant can depend on a variety of different factors related to your study. A common significane threshold for determining if a gene is considered to be significantly differentially expressed is a combination of both the adjusted p-value (0.05) and the log2FC (|log2FC|>1). If you would like to incorpoorate this into the analysis, you can!

enrichment_results = enrich_fishers(gene_universe_tpp,
                        database='KEGG Pathway',
                        gene_id_type='NCBI Gene ID',
                        sig_gene_cutoff_pvalue=0.05,
                        log2FC_cutoff_value=1
                        )
# count the number of significantly altered pathways
num_altered = len(enrichment_results)

print(f'The total number of altered pathways is {num_altered}.')
The total number of altered pathways is 11.

As you can see, we have a fewer significant pathways (11 here vs 13 above) as the significance threshold is stricter.

Note: there are a lot more GO Biological Processes then there are KEGG Pathways, therefore, testing for those will take a bit longer.

Functional Enrichment with Logistic Regression

Purpose: Conduct functional enrichment analysis using logistic regression to identify overrepresented pathways associated with a set of genes.

This method is ideal for scenarios where you have a gene set of interest and want to explore the significant enrichment of pathways from databases like KEGG and Gene Ontology. Logistic Regression helps reveal statistically significant relationships between the gene set and specific pathways. The logistic regression method is known to have better statistical power than fishers exact test (Sartor et. al., 2009.)

Overview of Logistic Regression for Enrichment Analysis

Logistic Regression Model:

The logistic regression method for gene set enrichment analysis involves modeling the log-odds of a gene belonging to a specific category (gene set) as a linear function of the statistical significance (measured by -log(P-values)).

Statistical Formulas:

Here’s a breakdown of the key components mentioned in the explanation:

  1. Logistic Regression Model:

    \[\log\left(\frac{p}{1-p}\right) = \alpha + \beta \cdot x\]

    where:

    • \(\log\) is the base 10 logarithm.

    • \(p\) is the probability of a gene belonging to the specific category.

    • \(\alpha\) is the intercept.

    • \(\beta\) is the slope parameter.

    • \(x\) is the -log(P-value) used as the explanatory variable.

  2. Odds Ratio: The odds ratio indicates the change in the odds of a gene belonging to the specific category for a unit increase in the -log(P-value) or a 10-fold decrease in P-value.

    \[ \text{Odds Ratio} = \exp(\beta) \]
  3. Wald Test:

    • The Wald test is used to assess the evidence in the data that \( \beta \) is different from zero.

    • The test statistic \( W \) follows a \(\chi^2\)-distribution with one degree of freedom.

    • The P-value is calculated based on this null distribution.

  4. Estimation and Testing:

    • Maximum likelihood estimates \( \hat{\beta} \) and \( \hat{\alpha} \) are obtained using the iteratively weighted least squares (IWLS) algorithm.

    • Testing involves assessing whether \( \beta \) is significantly different from zero.

  5. Ranking:

    • Gene sets are ranked based on P-values or odds ratios.

This method aims to identify enriched or depleted gene sets based on the statistical significance of their association with differential expression. The odds ratio and P-value provide insights into the strength and significance of this association.

Primary Function and Unique Parameters

The primary function is enrich_logistic with all the required and advanced parameters discussed above in addition to one unique parameter, directional_test.

Parameter Name

Parameter Type

Parameter Description

directional_test

bool

Signifying if you would like to preform a directional test (True) to test for up/down regulation or a non-directional test (False) to test for enrichment/depletion.

Logistic Regression for the Identification of Up and Down Regulated Pathways

This is the recommended method for analyzing your gene expression data.

Step 1: Loading Data. Data should be loaded into a Pandas Dataframe with the structure as describes above. This is defined as your gene_universe_tpp as is contains all genes detected in your study.

gene_universe_tpp = pd.read_csv('data/test_data/TPP_DE.txt', sep = '\t')

To get a sneak peak about what the data looks like, we can look at the first 5 elements:

gene_universe_tpp.head(5)
NCBI Gene ID PValue logFC
0 100000006 0.792615 0.115009
1 100000009 0.607285 -0.144714
2 100000026 0.021338 0.603871
3 100000030 0.007880 -2.083141
4 100000044 0.015286 0.803879

Step 2: Define necessary parameters and choices. To run the enrichment function, there are some key parameters you have the option of using (see above).

# the gene id type you have in your gene universe. 
# this is the name of the column holding your gene ids
ncbi_id = 'NCBI Gene ID'

# the database you wish to test enrichment for 
database_choice = 'KEGG Pathway'

Step 3: Run enrichment function.

result = enrich_logistic(gene_universe = gene_universe_tpp, 
                database = database_choice,
                gene_id_type = ncbi_id                
                )

Step 4: Visualize results. In this case, we assessed whether or not a KEGG Pathway was up or down regulated. This means was there significant up and/or down regulation of a pathway.

# count the number of significantly altered pathways
num_altered = len(result)
num_upregulated = len(result[result['Direction'] == 'upregulated'])
num_downregulated = len(result[result['Direction'] == 'downregulated'])

print(f'The total number of altered pathways is {num_altered}, the number upregulated is {num_upregulated}, and the number downregulated is {num_downregulated}.')
The total number of altered pathways is 47, the number upregulated is 20, and the number downregulated is 27.
%%capture
# visualize the top 5 most significantly altered KEGG pathways
# Markdown(result.head(5).to_markdown())
result.head(5)

Concept Type

Concept ID

Concept Name

# Genes in Concept in Universe

# Sig Genes Belong to Concept

Proportion of Sig Genes in Set

Odds Ratio

P-value

Direction

Bonferroni

BH FDR

0

KEGG Pathway

dre03013

Nucleocytoplasmic transport

100

26

0.26

0.734754

1.96187e-07

downregulated

3.23709e-05

2.12791e-05

1

KEGG Pathway

dre01200

Carbon metabolism

124

30

0.241935

0.752328

2.57929e-07

downregulated

4.25582e-05

2.12791e-05

2

KEGG Pathway

dre00970

Aminoacyl-tRNA biosynthesis

51

12

0.235294

0.724611

4.59772e-05

downregulated

0.00758624

0.00252875

3

KEGG Pathway

dre04010

MAPK signaling pathway

373

61

0.163539

1.1604

6.46906e-05

upregulated

0.010674

0.00256423

4

KEGG Pathway

dre04060

Cytokine-cytokine receptor interaction

153

20

0.130719

1.2292

7.7704e-05

upregulated

0.0128212

0.00256423

# create a bar chart to visualize the enriched results
upregulated_results = result[result['Direction'] == 'upregulated']

enrichplots.bar_chart(upregulated_results, 
                      title = 'Top 10 Upregulated KEGG Pathways', 
                      xlab = 'KEGG Pathway',
                      max_num_pathways=10)
../_images/dde6be34262e2e8ac642d54d9dcd6eb79daf43c58c698b9c2fa987be4e7e8e7d.png
# create a bar chart to visualize the enriched results
downregulated_results = result[result['Direction'] == 'downregulated']

enrichplots.bar_chart(downregulated_results, 
                      title = 'Top 10 Downregulated KEGG Pathways', 
                      xlab = 'KEGG Pathway',
                      max_num_pathways=10)
../_images/17de98a97bdffba8293181d056299a48470faf1da02897ea6afad83a1c5277a2.png

Conduct Logistic Regression for Enrichment Analysis (enriched/depleted)

Step 1: Loading Data. Data should be loaded into a Pandas Dataframe with the structure as describes above. This is defined as your gene_universe_tpp as is contains all genes detected in your study.

gene_universe_tpp = pd.read_csv('data/test_data/TPP_DE.txt', sep = '\t')

Step 2: Define necessary parameters and choices. To run the enrichment function, there are some key parameters you have the option of using (see above).

# the gene id type you have in your gene universe. 
# In this case, we have NCBI Gene IDs. 
ncbi_id = 'NCBI Gene ID'

# the database you wish to test enrichment for 
database_choice = 'KEGG Pathway'

Step 3: Run enrichment function.

result = enrich_logistic(gene_universe = gene_universe_tpp, 
                database = database_choice,
                gene_id_type = ncbi_id,
                directional_test = False
                )

Step 4: Visualize results. In this case, we assessed whether or not a KEGG Pathway was enriched or depleted. This whenther or not a pathway was overrepresented with differentially expressed genes or underrepresented.

# count the number of significantly altered pathways
num_altered = len(result)
num_enriched = len(result[result['Direction'] == 'enriched'])
num_depleted = len(result[result['Direction'] == 'depleted'])

print(f'The total number of altered pathways is {num_altered}, the number enriched is {num_enriched}, and the number depleted is {num_depleted}.')
The total number of altered pathways is 20, the number enriched is 15, and the number depleted is 5.
%%capture
# visualize the top 5 most significantly altered KEGG pathways
# Markdown(result.head(5).to_markdown())
result.head(5)

Concept Type

Concept ID

Concept Name

# Genes in Concept in Universe

# Sig Genes Belong to Concept

Proportion of Sig Genes in Set

Odds Ratio

P-value

Direction

Bonferroni

BH FDR

0

KEGG Pathway

dre00190

Oxidative phosphorylation

128

10

0.078125

0.485144

2.12733e-05

depleted

0.00351009

0.00351009

1

KEGG Pathway

dre04080

Neuroactive ligand-receptor interaction

417

56

0.134293

0.781985

0.000103387

depleted

0.0170589

0.00675259

2

KEGG Pathway

dre03010

Ribosome

124

15

0.120968

0.543902

0.000132295

depleted

0.0218287

0.00675259

3

KEGG Pathway

dre00030

Pentose phosphate pathway

30

15

0.5

1.38619

0.000163699

enriched

0.0270104

0.00675259

4

KEGG Pathway

dre04512

ECM-receptor interaction

101

35

0.346535

1.23605

0.000941206

enriched

0.155299

0.0310598

# create a bar chart to visualize the enriched results
enriched_results = result[result['Direction'] == 'enriched']

enrichplots.bar_chart(enriched_results, 
                      title = 'Top 10 Enriched KEGG Pathways', 
                      xlab = 'KEGG Pathway',
                      max_num_pathways=10)
../_images/532a61274fcb748ae89a1f2753f109f43042d9bd7ebaa12606587886ecb0849e.png

If you were to compare these results with those of Fisher’s Exact Test above, you would notice many of the significant pathways show up in both methods.

Functional Enrichment via Orthology

Purpose: Conduct functional enrichment analysis via orthology to identify overrepresented pathways in zebrafish associated with a set of genes that have been originally annotated for human.

This method is ideal for scenarios where you want to leverage existing knowledge from the well-annotated species, humans, and apply it to the less-studied organism, zebrafish. Orthology-based functional enrichment allows you to transfer annotations and pathway information from humamns, where extensive functional genomics studies have been conducted, to the target species of zebrafish. In the context of zebrafish, which may have fewer experimentally validated annotations, leveraging orthologous relationships helps bridge the gap in functional understanding.

In danRerLib, the organism dreM stands for Danio rerio mapped, meaning the zebrafish genome has been mapped to human via orthology. Therefore, we can conduct the same steps as shown above but choose the dreM organism to conduct functional enrichment via orthology.

Conduct Logistic Regression for Enrichment Analysis

Step 1: Loading Data. Data should be loaded into a Pandas Dataframe with the structure as describes above. This is defined as your gene_universe_tpp as is contains all genes detected in your study.

gene_universe_tpp = pd.read_csv('data/test_data/TPP_DE.txt', sep = '\t')

Step 2: Define necessary parameters and choices. To run the enrichment function, there are some key parameters you have the option of using (see above).

# the gene id type you have in your gene universe. 
# In this case, we have NCBI Gene IDs. 
ncbi_id = 'NCBI Gene ID'

# the database you wish to test enrichment for 
database_choice = 'KEGG Pathway'

# organism - for orthology we want to choose 'dreM'
organism_for_orthology = 'dreM'

Step 3: Run enrichment function.

result = enrich_logistic(gene_universe = gene_universe_tpp, 
                gene_id_type = ncbi_id,
                database = database_choice, 
                org = organism_for_orthology 
                )

Step 4: Visualize results. In this case, we assessed whether or not a KEGG Pathway was up or downregulated.

# count the number of significantly altered pathways
num_altered = len(result)
num_upregulated = len(result[result['Direction'] == 'upregulated'])
num_downregulated = len(result[result['Direction'] == 'downregulated'])

print(f'The total number of altered pathways is {num_altered}, the number upregulated is {num_upregulated}, and the number downregulated is {num_downregulated}.')
The total number of altered pathways is 67, the number upregulated is 35, and the number downregulated is 32.

As you can see based on the analysis done above with the default organism of dre, we have detected more significanly altered pathways via orthology. This is because the number of KEGG Pathways that are annotated for human greatly out numbers those of zebrafish.

fun = KEGG._get_total_number_kegg

print(f'Total Number dre: {fun("dre", "pathway")}')
print(f'Total Number dreM: {fun("dreM", "pathway")}')
print(f'Total Number hsa: {fun("hsa", "pathway")}')
Total Number dre: 179
Total Number dreM: 357
Total Number hsa: 357

As you can see, Danio rerio (dre) only has 179 annotated pathways while human (hsa) has 357; therefore, mapped zebrafish via orthology (dreM) has 357 annotated pathways as well.

%%capture
# visualize the top 5 most significantly altered KEGG pathways
# Markdown(result.head(5).to_markdown())
result.head(5)

Concept Type

Concept ID

Concept Name

# Genes in Concept in Universe

# Sig Genes Belong to Concept

Proportion of Sig Genes in Set

Odds Ratio

P-value

Direction

Bonferroni

BH FDR

0

KEGG Pathway

dreM01200

Carbon metabolism

117

30

0.25641

0.761874

5.50136e-07

downregulated

0.000184295

0.000184295

1

KEGG Pathway

dreM03013

Nucleocytoplasmic transport

94

22

0.234043

0.769903

1.64493e-05

downregulated

0.00551052

0.00275526

2

KEGG Pathway

dreM00970

Aminoacyl-tRNA biosynthesis

38

10

0.263158

0.718016

8.14351e-05

downregulated

0.0272808

0.00909359

3

KEGG Pathway

dreM05202

Transcriptional misregulation in cancer

192

28

0.145833

1.20917

0.000208636

upregulated

0.0698932

0.0158109

4

KEGG Pathway

dreM00010

Glycolysis / Gluconeogenesis

68

14

0.205882

0.771128

0.000235983

downregulated

0.0790543

0.0158109

# create a bar chart to visualize the enriched results
upregulated_results = result[result['Direction'] == 'upregulated']

enrichplots.bar_chart(upregulated_results, 
                      title = 'Top 10 Upregulated KEGG Pathways', 
                      xlab = 'KEGG Pathway',
                      max_num_pathways=10)
../_images/7af344bd29b626e5913baae37b4659844ccca4c44373bd409fff008e018e743d.png
# create a bar chart to visualize the enriched results
downregulated_results = result[result['Direction'] == 'downregulated']

enrichplots.bar_chart(downregulated_results, 
                      title = 'Top 10 Down Regulated KEGG Pathways', 
                      xlab = 'KEGG Pathway',
                      max_num_pathways=10)
../_images/9bee635b0e520d7ac275fb4d432438247e6c5a86f3ce6fb3a88f403daf93b01c.png

Glossary

Enriched Pathway: An enriched pathway refers to a biological pathway that contains a statistically significant number of genes that show altered expression under a particular experimental condition, irregardless of the gene being up or down regulated. For example, if you perform gene expression analysis comparing two conditions (e.g., control vs. treatment) and find that a specific biological pathway has a higher-than-expected number of genes with altered expression, you would say that pathway is enriched.

Depleted Pathway: A depleted pathway is the opposite of an enriched pathway. It refers to a biological pathway that contains a statistically significant number of genes with lower expression than expected under the given experimental condition, irregardless of the gene being up or down regulated. For example, if a biological pathway shows a lower-than-expected number of genes with altered expression in your analysis, that pathway is considered depleted.

Upregulated Pathway: An upregulated pathway refers to a biological pathway where the majority of genes have increased expression levels under a specific experimental condition. For example, if you find that most genes in a particular pathway have higher expression levels in the treatment group compared to the control group, that pathway is considered upregulated.

Downregulated Pathway: A downregulated pathway is the opposite of an upregulated pathway. It indicates a biological pathway where the majority of genes have decreased expression levels under the experimental condition being studied. For example, if most genes in a pathway exhibit lower expression levels in the treatment group compared to the control group, that pathway is considered downregulated.

Conclusion

This concludes the functional enrichment tutorial! In summary, the key functions in this library for enrichment are:

function

purpose

enrich_fishers

conduct functional enrichment using the fishers exact method

enrich_logistic

conduct functional enrichment using the logistic regression method

For more information about the full functionality of each function, please refer to the API Reference.

References

  • Sartor, M. A., Leikauf, G. D., & Medvedovic, M. (2009). LRpath: a logistic regression approach for identifying enriched biological groups in gene expression data. Bioinformatics (Oxford, England), 25(2), 211–217.