Replication Workflow for danRerLib Analysis

Introduction

This Jupyter Notebook contains the reproducible results presented in the associated publication available in this GitHub Repository (see README for more information on the publication). The primary objectives of this notebook are: (1) to showcase the applicability and novelty of the danRerLib Python package and (2) to offer users a step-by-step, reproducible workflow that demonstrates how to effectively utilize danRerLib for transcriptomic analysis in zebrafish research. Through this notebook, users can replicate the analyses performed in the associated publication and gain insights into the capabilities of danRerLib for their own research.

Requirements

To run this workflow, ensure that the following Python packages are installed on your system. Most of these packages are commonly included in Python distributions, but if any are missing, you can install them using the following commands:

!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install statsmodels

Additionally, for the danRerLib package, it can be easily installed via pip. Execute the following command to install the danRerLib package:

!pip install danrerlib

If you prefer a clean and isolated environment for this workflow, consider creating a virtual environment. Here’s an example of creating a virtual environment named “myenv”:

# Create a virtual environment
python -m venv myenv

# Activate the virtual environment
# On Windows:
.\myenv\Scripts\activate
# On macOS/Linux:
source myenv/bin/activate

Make sure to activate your Python environment or virtual environment before running these commands. This ensures that the packages are installed in the correct environment. If you are not familiar with virtual environments, you can refer to the official Python documentation on virtual environments for more detailed information.

Data Preparation

The differential expression data used in this workflow was generated in a previous publication using RNA sequencing to identify genes differentially expressed following exposure to tris(4-chlorophenyl)methanol (TCPMOH) in zebrafish embryos [1]. The data is publicly available on the Gene Expression Omnibus (GEO) under the accession number GSE165920.

Accessing the Data

To access the data, you can visit the GEO database and search for the accession number GSE165920 and download the supplementary file. The detailed description of the data can be found in the associated publication. Alternatively, for your convenience, the downloaded data used in this analysis has also been included in this repository here.

Analysis Workflow

In this workflow, we will perform functional enrichment analysis using the logistic regression method provided by the danrerlib Python package. This method leverages multiple modules within the package, streamlining the analysis process for user convenience. We will conduct the analysis using zebrafish annotations alone and leverage the orthology capabilities of danRerLib. If you were to utilize danRerLib, we recommend using the hybrid orthology method via logistic regression as shown here.

Step 1: Import Libraries

To utilize danrerlib, you will need to import it directly into the workspace. We will be utilizing the enrich_logistic function from the enrichment module and some visualization functions from the enrichplots module. To import these functions and modules only, you can do so as:

from danrerlib.enrichment import enrich_logistic
import danrerlib.enrichplots as ep

It is also necessary to import useful packages to help us handle the data. We will be using the pandas package for this:

import pandas as pd

# just a notebook display option
from IPython.display import Markdown

Step 2: Loading Data

Begin by loading the differential expression data into your Jupyter Notebook. You can use the following code as an example:

data_path = 'data/GSE165920_DMSO_vs_TCPMOH_DESeq2_Result_GeneSymbol_Verbose.txt'
df = pd.read_csv(data_path, sep = '\t')

It’s crucial to examine the loaded data to ensure that the necessary information for enrichment testing is available. In the context of danRerLib, the required data includes the Gene ID, the associated log2 fold change (log2FC) and the associated p-value of differential expression. To inspect the data provided in the public dataset, we can take a closer look at the column headers:

# Display column headers
print(df.columns)
Index(['LLgeneID', 'LLgeneSymbol (V4.3.2.gtf)', 'LLchr', 'LLstart', 'LLend',
       'LLstrand', 'geneName', 'Ens99geneIDversion', 'EntrezGeneID',
       'ZFINgeneID', 'Base mean', 'log2(FC)', 'StdErr', 'Wald-Stats',
       'P-value', 'P-adj'],
      dtype='object')

We can see that we have the following columns that are required by danRerLib:

  • EntrezGeneID: The Gene ID in NCBI format. This is an integer Gene ID.

  • log2(FC): The log 2 fold change for the associated genes.

  • P-adj: The adjusted p-value for the differential expression. We chosse P-adj instead of P-value as P-adj correct for the multiple testing.

We can get the necessary data by filtering the original dataframe:

df_necessary_data = df[['EntrezGeneID', 'P-adj', 'log2(FC)']].dropna()

To inspect what the data looks like, we can print the first few rows of the data.

df_necessary_data.head(5)
EntrezGeneID P-adj log2(FC)
0 564531.0 2.490000e-36 -1.135762
1 560210.0 1.230000e-25 -1.050871
2 83776.0 2.500000e-22 -0.702291
3 100003999.0 1.810000e-21 -1.121619
4 541386.0 7.470000e-21 -0.830242

We can see that the Gene ID is an is reading in as a float, not an integer. This is a problem because the Entrez Gene ID as defined by NCBI is an integer. To make sure we don’t run into any other issues, we can force the Gene ID column to be of integer format using the following code:

df_necessary_data['EntrezGeneID'] = df_necessary_data['EntrezGeneID'].astype(int)
df_necessary_data.head(5)
EntrezGeneID P-adj log2(FC)
0 564531 2.490000e-36 -1.135762
1 560210 1.230000e-25 -1.050871
2 83776 2.500000e-22 -0.702291
3 100003999 1.810000e-21 -1.121619
4 541386 7.470000e-21 -0.830242

Step 3: Perform danRerLib Analysis

We are interested in investigating the enrichment results using the Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathways that have been annotated for zebrafish directly verses those that have been annotated for humans as well. A reminder that danRerLib defines the zebrafish organism as dre and the mapped zebrafish organism (defined via orthology from humans) as dreM. To use both the zebrafish and mapped zebrafish annotations with zebrafish annotations as the default, the organism should be variable.

Enrichment Analysis for Zebrafish (dre) Annotated Pathways:

results_dre = enrich_logistic(
             gene_universe = df_necessary_data,
             gene_id_type = 'Entrez Gene ID',
             database = 'KEGG Pathway',
             org = 'dre')

To get a quick peek at these results, we can look at the first two entries:

%%capture 
# ^ suppresses output
# Markdown(results_dre.head(2).to_markdown())
results_dre.head(2)

Concept Type

Concept ID

Concept Name

# Genes in Concept in Universe

# Sig Genes Belong to Concept

Proportion of Sig Genes in Set

Odds Ratio

P-value

Direction

Bonferroni

BH FDR

0

KEGG Pathway

dre01100

Metabolic pathways

1638

57

0.0347985

0.735718

2.2975e-09

downregulated

3.79087e-07

3.79087e-07

1

KEGG Pathway

dre03008

Ribosome biogenesis in eukaryotes

74

12

0.162162

0.796797

5.36501e-06

downregulated

0.000885227

0.000442613

Enrichment Analysis for Variable (zebrafish and human) Annotated Pathways:

results_variable = enrich_logistic(
             gene_universe = df_necessary_data,
             gene_id_type = 'Entrez Gene ID',
             database = 'KEGG Pathway',
             org = 'variable')

Again, to get a quick peek at these results, we can look at the first two entries:

%%capture 
# ^ suppresses output
# Markdown(results_variable.head(2).to_markdown())
results_variable.head(2)

Concept Type

Concept ID

Concept Name

# Genes in Concept in Universe

# Sig Genes Belong to Concept

Proportion of Sig Genes in Set

Odds Ratio

P-value

Direction

Bonferroni

BH FDR

0

KEGG Pathway

dre01100

Metabolic pathways

1638

57

0.0347985

0.735718

2.2975e-09

downregulated

3.79087e-07

3.79087e-07

1

KEGG Pathway

dreM05321

Inflammatory bowel disease

27

2

0.0740741

1.90326

3.07117e-06

upregulated

0.000534384

0.000534384

Step 4: Interpretation

To provide a quick overview of the results and output, the following describes each column in the enrichment analysis output:

Column Name

Description

Concept Type

The database tested from, in this case it is ‘KEGG Pathway’.

Concept ID

The ID for the concept, or pathway, from the associated Concept Type, or database. You’ll notice that the concepts from using zebrafish only have a prefix of dre and those from the variable case sometimes have dreM. This means that pathway was mapped via orthology.

Concept Name

The name of the concept.

# Genes in Concept in Universe

The number of genes in the concept that are found in the dataset provided.

# Sig Genes Belong to Concept

The number of significant genes in the dataset provided that are in the concept.

Proportion of Sig Genes in Set

The number of significant genes in the concept divided by the total number of genes in the concept

Odds Ratio

The odds ratio measures the likelihood of finding significant genes in the analyzed concept compared to the rest of the genes in the dataset.

P-value

The p-value indicating the significance of the enrichment.

Direction

Indicates whether the genes in the concept are upregulated or downregulated in the dataset.

Bonferrnoi

Adjusted p-value using the Bonferroni method.

BH FDR

Adjusted p-value using the BH FDR method.

We can take a look further into the top 10 upregulated and top 10 downregulated genes per methodology.

To get the top 10 upregulated for the dre case, we can execute the following code:

%%capture 
# ^ suppresses output
# Markdown(results_dre[ results_dre['Direction'] == 'upregulated' ].head(10).to_markdown())
results_dre[ results_dre['Direction'] == 'upregulated' ].head(10)

Concept Type

Concept ID

Concept Name

# Genes in Concept in Universe

# Sig Genes Belong to Concept

Proportion of Sig Genes in Set

Odds Ratio

P-value

Direction

Bonferroni

BH FDR

4

KEGG Pathway

dre04510

Focal adhesion

264

3

0.0113636

1.50245

7.87898e-05

upregulated

0.0130003

0.00251133

6

KEGG Pathway

dre04514

Cell adhesion molecules

136

4

0.0294118

1.56559

0.000172863

upregulated

0.0285224

0.00375606

8

KEGG Pathway

dre04744

Phototransduction

44

4

0.0909091

1.73158

0.000204876

upregulated

0.0338045

0.00375606

10

KEGG Pathway

dre04512

ECM-receptor interaction

96

2

0.0208333

1.53093

0.000509433

upregulated

0.0840564

0.00764149

11

KEGG Pathway

dre03460

Fanconi anemia pathway

44

2

0.0454545

1.55945

0.000790771

upregulated

0.130477

0.0104595

13

KEGG Pathway

dre03030

DNA replication

36

1

0.0277778

1.55237

0.00101622

upregulated

0.167677

0.0119769

14

KEGG Pathway

dre04371

Apelin signaling pathway

176

4

0.0227273

1.40889

0.00124234

upregulated

0.204985

0.0136657

17

KEGG Pathway

dre04520

Adherens junction

136

2

0.0147059

1.40642

0.00208362

upregulated

0.343797

0.0190998

18

KEGG Pathway

dre04810

Regulation of actin cytoskeleton

279

4

0.0143369

1.32796

0.00261093

upregulated

0.430804

0.0214994

23

KEGG Pathway

dre04020

Calcium signaling pathway

293

3

0.0102389

1.31456

0.00321531

upregulated

0.530527

0.0221053

Similarly for the variable case, we can execute a similar code:

%%capture 
# ^ suppresses output
# Markdown(results_variable[ results_variable['Direction'] == 'upregulated' ].head(10).to_markdown())
results_variable[ results_variable['Direction'] == 'upregulated' ].head(10)

Concept Type

Concept ID

Concept Name

# Genes in Concept in Universe

# Sig Genes Belong to Concept

Proportion of Sig Genes in Set

Odds Ratio

P-value

Direction

Bonferroni

BH FDR

1

KEGG Pathway

dreM05321

Inflammatory bowel disease

27

2

0.0740741

1.90326

3.07117e-06

upregulated

0.000534384

0.000534384

4

KEGG Pathway

dreM04062

Chemokine signaling pathway

162

4

0.0246914

1.50238

5.90899e-05

upregulated

0.0102816

0.00379117

6

KEGG Pathway

dreM04659

Th17 cell differentiation

82

3

0.0365854

1.57993

6.5365e-05

upregulated

0.0113735

0.00379117

7

KEGG Pathway

dre04510

Focal adhesion

264

3

0.0113636

1.50245

7.87898e-05

upregulated

0.0130003

0.00251133

9

KEGG Pathway

dreM04014

Ras signaling pathway

247

4

0.0161943

1.41846

0.000151207

upregulated

0.0263101

0.00512741

10

KEGG Pathway

dre04514

Cell adhesion molecules

136

4

0.0294118

1.56559

0.000172863

upregulated

0.0285224

0.00375606

11

KEGG Pathway

dreM04724

Glutamatergic synapse

138

2

0.0144928

1.478

0.000181587

upregulated

0.0315961

0.00512741

13

KEGG Pathway

dre04744

Phototransduction

44

4

0.0909091

1.73158

0.000204876

upregulated

0.0338045

0.00375606

14

KEGG Pathway

dreM04725

Cholinergic synapse

131

2

0.0152672

1.47913

0.000205036

upregulated

0.0356763

0.00512741

15

KEGG Pathway

dreM05412

Arrhythmogenic right ventricular cardiomyopathy

103

2

0.0194175

1.50175

0.000225511

upregulated

0.0392388

0.00512741

For downregulated pathways, you can run similar code again:

%%capture 
# ^ suppresses output
# Markdown(results_dre[ results_dre['Direction'] == 'downregulated' ].head(10).to_markdown())
results_dre[ results_dre['Direction'] == 'downregulated' ].head(10)

Concept Type

Concept ID

Concept Name

# Genes in Concept in Universe

# Sig Genes Belong to Concept

Proportion of Sig Genes in Set

Odds Ratio

P-value

Direction

Bonferroni

BH FDR

0

KEGG Pathway

dre01100

Metabolic pathways

1638

57

0.0347985

0.735718

2.2975e-09

downregulated

3.79087e-07

3.79087e-07

1

KEGG Pathway

dre03008

Ribosome biogenesis in eukaryotes

74

12

0.162162

0.796797

5.36501e-06

downregulated

0.000885227

0.000442613

2

KEGG Pathway

dre03040

Spliceosome

139

3

0.0215827

0.812757

1.74526e-05

downregulated

0.00287968

0.000959892

3

KEGG Pathway

dre04141

Protein processing in endoplasmic reticulum

196

10

0.0510204

0.830596

6.15407e-05

downregulated

0.0101542

0.00251133

5

KEGG Pathway

dre01240

Biosynthesis of cofactors

177

14

0.079096

0.834689

9.13211e-05

downregulated

0.015068

0.00251133

7

KEGG Pathway

dre02010

ABC transporters

44

7

0.159091

0.821231

0.000194014

downregulated

0.0320122

0.00375606

9

KEGG Pathway

dre00830

Retinol metabolism

69

8

0.115942

0.835146

0.000392985

downregulated

0.0648425

0.00648425

12

KEGG Pathway

dre00980

Metabolism of xenobiotics by cytochrome P450

55

5

0.0909091

0.836655

0.000824083

downregulated

0.135974

0.0104595

15

KEGG Pathway

dre00860

Porphyrin metabolism

56

5

0.0892857

0.840365

0.00132582

downregulated

0.21876

0.0136725

16

KEGG Pathway

dre00982

Drug metabolism - cytochrome P450

53

4

0.0754717

0.842789

0.00197357

downregulated

0.32564

0.0190998

%%capture 
# ^ suppresses output
# Markdown(results_variable[ results_variable['Direction'] == 'downregulated' ].head(10).to_markdown())
results_variable[ results_variable['Direction'] == 'downregulated' ].head(10)

Concept Type

Concept ID

Concept Name

# Genes in Concept in Universe

# Sig Genes Belong to Concept

Proportion of Sig Genes in Set

Odds Ratio

P-value

Direction

Bonferroni

BH FDR

0

KEGG Pathway

dre01100

Metabolic pathways

1638

57

0.0347985

0.735718

2.2975e-09

downregulated

3.79087e-07

3.79087e-07

2

KEGG Pathway

dre03008

Ribosome biogenesis in eukaryotes

74

12

0.162162

0.796797

5.36501e-06

downregulated

0.000885227

0.000442613

3

KEGG Pathway

dre03040

Spliceosome

139

3

0.0215827

0.812757

1.74526e-05

downregulated

0.00287968

0.000959892

5

KEGG Pathway

dre04141

Protein processing in endoplasmic reticulum

196

10

0.0510204

0.830596

6.15407e-05

downregulated

0.0101542

0.00251133

8

KEGG Pathway

dre01240

Biosynthesis of cofactors

177

14

0.079096

0.834689

9.13211e-05

downregulated

0.015068

0.00251133

12

KEGG Pathway

dre02010

ABC transporters

44

7

0.159091

0.821231

0.000194014

downregulated

0.0320122

0.00375606

17

KEGG Pathway

dreM05204

Chemical carcinogenesis - DNA adducts

23

2

0.0869565

0.775179

0.000265211

downregulated

0.0461467

0.00512741

19

KEGG Pathway

dre00830

Retinol metabolism

69

8

0.115942

0.835146

0.000392985

downregulated

0.0648425

0.00648425

23

KEGG Pathway

dre00980

Metabolism of xenobiotics by cytochrome P450

55

5

0.0909091

0.836655

0.000824083

downregulated

0.135974

0.0104595

28

KEGG Pathway

dre00860

Porphyrin metabolism

56

5

0.0892857

0.840365

0.00132582

downregulated

0.21876

0.0136725

You will notice that the results in the variable case are slightly different than the dre case alone. This is because we also tested for pathways that are not annotated for zebrafish using orthology built into danRerLib. For a detailed description of the biological underpinnings of these findings, please refer to the publication.

Step 5: Visualization

We can create some plots to visualize the up and downregulated pathways for each methodology. To do so, we use the enrich_plots module which we imported earlier as ep. There are a couple of different plotting options available. I will begin by showing the plots utilized in the publication:

ep.dotplot(results_dre, 20, 'both')
../_images/fc4ab39ec5b400cac71a0616cd27086a55c26a44f88d8696b2f8ea246e004be6.png
ep.dotplot(results_variable, 20, 'both')
../_images/26564386fbb88714526d406ad642a4da55acd5f60826b38727f659ffac4a6943.png

To show some of the other plotting options, here are some other useful plots you can generate using danRerLib:

A volcano plot showing the significantly upregulated and downregulated pathways:

ep.volcano(results_variable, legend_loc='best')
../_images/5a58e45ba523da773db82c4b5a7966015b0f9127b5505f5a68070514f7ead673.png

A bar chart showing the top 10 upregulated pathways and the strength of the significance:

upregulated_pathways_dre = results_dre[ results_dre['Direction']=='upregulated' ]
ep.bar_chart(data = upregulated_pathways_dre, 
             title = 'Top 10 Upregulated Pathways', 
             xlab = 'KEGG Pathways', 
             max_num_pathways=10)
../_images/c63a13d2b3fb97c5a9ed89bfc0f1a6ee83601de387b2fd7ad14213b56e54372e.png
dre_up = (results_dre[ (results_dre['Direction'] == 'upregulated') & (results_dre['P-value'] < 0.05) ] )
dre_down = (results_dre[ (results_dre['Direction'] == 'downregulated') & (results_dre['P-value'] < 0.05) ] )
dre_sig = (results_dre[ (results_dre['P-value'] < 0.05) ] )
variable_up = (results_variable[ (results_variable['Direction'] == 'upregulated') & (results_variable['P-value'] < 0.05) ] )
variable_down = (results_variable[ (results_variable['Direction'] == 'downregulated') & (results_variable['P-value'] < 0.05) ] )
variable_sig = (results_variable[ (results_variable['P-value'] < 0.05) ] )
print(f'Number dre upregulated: {len(dre_up)}')
print(f'Number dre downregulated: {len(dre_down)}')
print(f'Number dre total significant: {len(dre_sig)}')
print(f'Number variable upregulated: {len(variable_up)}')
print(f'Number variable downregulated: {len(variable_down)}')
print(f'Number variable total significant: {len(variable_sig)}')
Number dre upregulated: 29
Number dre downregulated: 26
Number dre total significant: 55
Number variable upregulated: 60
Number variable downregulated: 35
Number variable total significant: 95
results_dreM = enrich_logistic(
             gene_universe = df_necessary_data,
             gene_id_type = 'Entrez Gene ID',
             database = 'KEGG Pathway',
             org = 'dreM')
results_dreM['Concept ID'] = results_dreM['Concept ID'].apply(lambda x: x.replace('dreM', 'dre'))
dreM_sig = (results_dreM[ (results_dreM['P-value'] < 0.05) ] )
# Find common pathway names
common_pathways = set(dre_sig['Concept ID']).intersection(dreM_sig['Concept ID'])
print(f"Number of common significant pathways: {len(common_pathways)}")

pathways_not_in_dreM = set(dre_sig['Concept ID']).difference(dreM_sig['Concept ID'])

# Print the number of pathways and the actual names
print(f"Number of significant pathways in dre not significant in dreM: {len(pathways_not_in_dreM)}")

pathways_not_in_dre = set(dreM_sig['Concept ID']).difference(dre_sig['Concept ID'])

# Print the number of pathways and the actual names
print(f"Number of significant pathways in dreM not significant in dre: {len(pathways_not_in_dre)}")
Number of common significant pathways: 42
Number of significant pathways in dre not significant in dreM: 13
Number of significant pathways in dreM not significant in dre: 43

Conclusion

danRerLib is a specialized and efficient tool tailored for zebrafish researchers engaging in functional enrichment analysis within Python. Through this tutorial and the associated publication, danRerLib can overcome the limitations of the lack of direct zebrafish annotations. Through the integration of orthology, the package enables researchers to delve into unannotated pathways, providing valuable insights.

References

[1]: Navarrete, J., et al. The ecotoxicological contaminant tris(4-chlorophenyl)methanol (TCPMOH) impacts embryonic development in zebrafish (Danio rerio). Aquatic Toxicology. 2021;235:105815.