Orthology Mapping

Author: Ashley Schwartz

Date: Originally Developed September 2023, Last Revised March 2024

Purpose and Background

This tutorial goes over how to get the zebrafish or human gene orthologs from one organism to another. Gene orthology is useful when you want to determine which genes in zebrafish map to genes in the human genome and vice versa. As this is a zebrafish library, there are four different Gene ID options available. Human Gene IDs are only reported as NCBI Entrez Gene IDs. The Gene ID options are:

Gene ID Name

Description

Example

Notes

ZFIN ID

ZFIN gene id: always starts with ‘ZDB’ for zebafish database

ZDB-GENE-011219-1

used as the “master” gene id (link)

NCBI Gene ID

integer gene id managed by NCBI: also known as Entrez Gene ID

140634

link

Symbol

descriptive symbol/name: RefSeq symbol used in RefSeq genome build

cyp1a

nomenclature defined by ZFIN

Ensembl Gene ID

Ensembl database gene id: always starts with ‘ENSDAR’

ENSDARG00000098315

link

Human NCBI Gene ID

integer gene id managed by NCBI: also known as Entrez Gene ID

1543

link

Requirements

In this tutorial we will be utilizing two key elements:

  • a sample Gene ID list (format: .csv, .tsv, .txt) for reading in the Gene IDs, otherwise typing or copy/pasting Gene IDs is also supported. The Gene ID list can either be zebrafish Gene IDs in any of the supported formats or human Gene IDs in NCI Entrez format.

    • the human gene list we will be using is located in the data/test_data subdirectory of this current working directory with relative path data/test_data/hsa04911.txt

      • if you would like to download this dataset you can find it on our GitHub here

    • the zebrafish gene list we will be using is located in the same subdirectory with relative path data/test_data/dre04910.txt

      • if you would like to download this dataset you can find it on our GitHub here

  • the required python package

    • See installation instructions if not already installed.

In general, while you do not need a large foundation in Python to execute the code listed in this tutorial, a general understanding of absolute and relative paths is useful.

note: the Gene IDs are spelling and case sensitive

# IMPORT PYTHON PACKAGE
# ---------------------
from danrerlib import mapping, utils
import pandas as pd

Execute Orthology Mappings

Gathering human and zebrafish orthology are especially useful when a mechanism has been characterized in humans but has yet to be looked at in zebrafish. We know that approximately 70% of human genes have zebrafish orthologs, meaning 70% of human genes also exist in zebrafish. We can use this to our advantage in various research scenarios. In some cases, you might just be interested in gathering the orthologs from a few genes. In other scenarios, you might want to get the orthology for a large list of genes. We will go through a variety of those scenarios here.

Simple Case: Get orthologs

Purpose: given a small list Gene IDs that are of type A in organism x, convert to Gene ID type A in organism B.

The simple case is useful for a quick ortholog investigation if you have a small list you can easily paste.

Zebrafish to Human

Step 1: Define your list of Gene IDs. In this case, you would have a list of Gene IDs in any of the supported zebrafish formats. I named the python list list_of_zfish_ids and included all the Gene IDs I want to find human orthologs for. I chose to use ZFIN IDs here, but any of the supported zebrafish Gene ID types will suffice.

list_of_zfish_ids = ['ZDB-GENE-081113-5', 'ZDB-GENE-031002-50', 
                     'ZDB-GENE-120104-7', 'ZDB-GENE-000607-16', 
                     'ZDB-GENE-060503-934', 'ZDB-GENE-050320-72', 
                     'ZDB-GENE-030131-3147', 'ZDB-GENE-040426-2551']

Step 2: Tell the program which gene type you currently have.

zfish_gene_type = 'ZFIN ID'

Step 3: Launch the conversion function to get your ids converted to human Gene IDs. Note that the only supported Gene ID type for humans is the Human NCBI Gene ID, aka Entrez Gene ID.

human_ids = mapping.convert_to_human(list_of_zfish_ids, zfish_gene_type)

Step 4: To visualize the human ids that have been converted from zebrafish ids, you can either print them to the python shell or save them to a file. If you would like to print them you can use the print_series_pretty function in the utils module.

utils.pretty_print_series(human_ids)
4293
81553
26207
605
4861
55022
221143

If you would rather save the data to a file, you can save human_ids to a file name called human_ids.txt in the output data directory we defined previously. For some default options, you can use the save_series function in the utils module. Feel free to change the output directory to any folder of your choice.

file_name = 'data/out_data/human_ids.txt'
utils.save_data(human_ids, file_name)

Human to Zebrafish

To convert from human Gene IDs to zebrafish, you follow the same steps except we will use the convert_to_zebrafish function in the `mapping module.

Step 1: Define your list of Gene IDs. Reminder again that the only supported human Gene ID type is currently NCBI Gene ID. These are also known as Entrez Gene IDs and are of integer format. My list of Gene IDs is names list_of_human_ids.

list_of_human_ids = [55585, 23191, 4192, 5686, 
                     197021, 390, 344805, 2623]

Step 2: Tell the program which zebrafish Gene ID type you would like to convert to. I will choose the Symbol Gene ID type for this example.

zfish_desired_gene_id_type = 'Symbol'

Step 3: Launch the conversion function for human to zebrafish.

zebrafish_ids = mapping.convert_to_zebrafish(list_of_human_ids, zfish_desired_gene_id_type)

Step 4: Print results to shell or save the results to a file. I will just print the results here using the pretty_print_series function in the utils module.

utils.pretty_print_series(zebrafish_ids)
mdkb
cyfip1
psma5
lctla
rnd3b
ube2q1
gata1a
mdka
lctlb
gata1b
tmprss7

If you take a closer look, you will notice that the number of genes we started with does not always match what we will end up with. There is not a 1:1 mapping of genes between human and zebrafish, as stated previously. In fact, zebrafish often have a lot of duplicate genes (e.g. an a and b gene where a human just has one). Therefore, it is often useful to keep the mapping to know which genes in fact have orthologs and which don’t. To do this, it is better to add a column to an existing DataFrame.

Simple Case: Get Orthologs and Keep Mapping

Purpose: given a small list Gene IDs that are of type A in organism x, convert to Gene ID type A in organism B and keep mapping.

Zebrafish to Human

Step 1: Define your list of Gene IDs. In this case, you would have a list of Gene IDs in any of the supported zebrafish formats. I am using the same list_of_zfish_ids as before.

list_of_zfish_ids = ['ZDB-GENE-081113-5', 'ZDB-GENE-031002-50', 
                     'ZDB-GENE-120104-7', 'ZDB-GENE-000607-16', 
                     'ZDB-GENE-060503-934', 'ZDB-GENE-050320-72', 
                     'ZDB-GENE-030131-3147', 'ZDB-GENE-040426-2551']

Step 2: Tell the program which gene type you currently have and that we want to keep the mapping.

zfish_gene_type = 'ZFIN ID'
keep_mapping = True

Step 3: Launch the conversion function to get your ids converted to human Gene IDs. Note that the only supported Gene ID type for humans is the Human NCBI Gene ID, aka Entrez Gene ID. The keep_mapping variable is added here as well.

human_ids = mapping.convert_to_human(list_of_zfish_ids, zfish_gene_type, keep_mapping)

Step 4: Print results to shell or save the results to a file. Both options are shown here. Notice the file name is specified.

human_ids
ZFIN ID Human NCBI Gene ID
0 ZDB-GENE-081113-5 4293
1 ZDB-GENE-031002-50 81553
2 ZDB-GENE-120104-7 26207
3 ZDB-GENE-000607-16 605
4 ZDB-GENE-060503-934 4861
5 ZDB-GENE-050320-72 55022
6 ZDB-GENE-040426-2551 221143
file_path = 'data/out_data/human_ids_with_mapping.txt'
utils.save_data(human_ids, file_path)

If you notice in this case, row 6 has a NaN value for the Human NCBI Gene ID column. This is because there is no ortholog to the zebrafish gene ZDB-GENE-030131-3147 in humans. If in the further downstream analysis this isn’t important to you, you can drop that row from the dataset.

human_ids.dropna()
ZFIN ID Human NCBI Gene ID
0 ZDB-GENE-081113-5 4293
1 ZDB-GENE-031002-50 81553
2 ZDB-GENE-120104-7 26207
3 ZDB-GENE-000607-16 605
4 ZDB-GENE-060503-934 4861
5 ZDB-GENE-050320-72 55022
6 ZDB-GENE-040426-2551 221143

Or you could have ran the convert_to_human function with the keep_missing_orthos option as False, the default is True.

keep_mapping = True
keep_missing_orthos = False
mapping.convert_to_human(list_of_zfish_ids, zfish_gene_type, 
                         keep_mapping, keep_missing_orthos)
ZFIN ID Human NCBI Gene ID
0 ZDB-GENE-081113-5 4293
1 ZDB-GENE-031002-50 81553
2 ZDB-GENE-120104-7 26207
3 ZDB-GENE-000607-16 605
4 ZDB-GENE-060503-934 4861
5 ZDB-GENE-050320-72 55022
6 ZDB-GENE-040426-2551 221143

Human to Zebrafish

The steps for human to zebrafish are nearly the same. The repeated steps are shown below.

list_of_human_ids = [55585, 23191, 4192, 5686, 
                     197021, 390, 344805, 2623]

zfish_desired_gene_id_type = 'Symbol'
keep_mapping = True
zebrafish_ids = mapping.convert_to_zebrafish(list_of_human_ids, 
                                             zfish_desired_gene_id_type, 
                                             keep_mapping)
zebrafish_ids
Human NCBI Gene ID Symbol
0 55585 ube2q1
1 23191 cyfip1
2 4192 mdkb
3 4192 mdka
4 5686 psma5
5 197021 lctla
6 197021 lctlb
7 390 rnd3b
8 344805 tmprss7
9 2623 gata1b
10 2623 gata1a

In this case, every human gene has an ortholog in zebrafish. In fact, some of the human genes here have more than one zebrafish ortholog. For example, the 2623 gene has the zebrafish orthologs gata1b and gata1a. All orthologs are kept in this method.

Get Orthologs for Genes from File

Purpose: given a list of Gene IDs from a file that are of type A in organism x, convert to Gene ID type A in organism B and save to file.

Zebrafish to Human

Step 1: Read in your list of Gene IDs. The data in the test_data sub-directory is in a tsv type format with a .txt extension. The pandas package can read this without an issue (same with excel or a csv file), we just need to specify the separator. \t is really the best for this type of data. Note that any excel file or csv file should work here. I am using a file named dre04910.txt which is a list of zebrafish genes in the KEGG pathway 04910.

data_file_path = 'data/test_data/dre04910.txt'
data = pd.read_csv(data_file_path, sep='\t')

To get a quick look at the data, we can print the first three table entries and some data stats:

# print first three lines
data.head(3)
rows, cols = data.shape
print(f'There are {rows} rows and {cols} columns')
There are 180 rows and 1 columns

As you can see, this list of Gene IDs has 180 entries.

Step 2: Tell the program which Gene ID type you currently have.

zfish_gene_type = 'NCBI Gene ID'

Step 3: Launch the conversion function to get your ids converted to human Gene IDs. Note that the only supported Gene ID type for humans is the Human NCBI Gene ID, aka Entrez Gene ID.

human_ids = mapping.convert_to_human(data, zfish_gene_type)

Step 4: Print results to shell or save the results to a file. Both options are shown here. Notice the file name is specified.

human_ids.head(3)
m = len(human_ids)
print(f'There are {m} rows in this dataset.')
There are 117 rows in this dataset.

By printing some of the stats as shown above, we see that we are left with 117 human genes. This means that out of the 180 zebrafish genes we had, we only have 117 human orthologs. This is expected as, if you recall, only approximately 70% of human genes have zebrafish orthologs.

file_path = 'data/out_data/dre05910_human_genes.txt'
utils.save_data(human_ids, file_path)

Human to Zebrafish

Below is a quick run-though of the human to zebrafish case. The dataset used is a file named hsa04911.txt which contains all human genes in the KEGG pathway 04911.

# read in data and print some stats
data_file_path = 'data/test_data/hsa04911.txt'
data = pd.read_csv(data_file_path, sep='\t')
m,n = data.shape
print(f'There are {m} rows in this dataset.')
There are 86 rows in this dataset.
# launch conversion
desired_zfish_gene_type = 'NCBI Gene ID'
zfish_ids = mapping.convert_to_zebrafish(data, zfish_gene_type)
# print some results stats
zfish_ids.head(3)
0    64272
1    64269
2    64267
Name: NCBI Gene ID, dtype: object
m = len(zfish_ids)
print(f"There are {m} rows in this dataset.")
There are 112 rows in this dataset.

Get Orthologs for a Column of a Larger Dataset

Purpose: you have a dataset with columns x, y, z. Column x has Gene IDs in from organism \(\alpha\) in type A. You would like to get orthologs for these Gene IDs to organism \(\beta\) in type B while maintaining the information of columns y, z.

In this scenario, you might have some data that looks like:

NCBI Gene ID

PValue

logFC

100002263

2.3

0.03

and you might want to get something like:

NCBI Gene ID

Human NCBI Gene ID

PValue

logFC

140615

9415

2.3

0.03

The information in the log2FC and PValue columns are essential to keep ‘in order’ with the GeneID column. It is often that in this scenario, you will have an entire gene set and will be dealing with a lot more data. Lets look at a test dataset for this case.

Step 1: Read in the data as done previously.

data_file_path = 'data/test_data/example_diff_express_data.txt'
data = pd.read_csv(data_file_path, sep='\t')

To get a quick look at the data, we can print the first three table entries and some data stats:

# print first three lines
data.head(3)
NCBI Gene ID PValue logFC
0 100000006 0.792615 0.115009
1 100000044 0.015286 0.803879
2 100000085 0.264762 0.267360
rows, cols = data.shape
print(f'There are {rows} rows and {cols} columns')
There are 5464 rows and 3 columns

Step 2: Get Human Orthologs. To execute the mapping and add the column, we will use the add_mapped_ortholg_column function from the mapping module. We will give this function the data read in above, and we need to specify the id_from and id_to as before.

id_from = 'NCBI Gene ID'
id_to = 'Human NCBI Gene ID'
new_data = mapping.add_mapped_ortholog_column(data, id_from, id_to)
new_data.head(3)
NCBI Gene ID Human NCBI Gene ID PValue logFC
0 100000006 84561 0.792615 0.115009
1 100000044 NaN 0.015286 0.803879
2 100000085 968 0.264762 0.267360

Step 3: Save to file. You can save this dataframe to a file as done previously.

file_name = 'data/out_data/ortholog_dataframe.txt'
utils.save_data(new_data, file_name)

Conclusion

This concludes the mapping tutorial! In summary, the key functions in this library for mapping are:

function

purpose

convert_to_human

convert a list of Zebrafish Gene IDs to Human Gene IDs

convert_to_zebrafish

convert a list of Human Gene IDs to Zebrafish Gene IDs

add_mapped_ortholog_column

add a ortholog column of Gene IDs to an existing DataFrame

In the human to zebrafish case, we see that we went from 86 Human NCBI Gene IDs to 113 zebrafish NCBI Gene IDs.