Orthology Mapping

Author: Ashley Schwartz

Date: Originally Developed September 2023, Last Revised March 2024

Purpose and Background

This tutorial goes over how to get the zebrafish or human gene orthologs from one organism to another. Gene orthology is useful when you want to determine which genes in zebrafish map to genes in the human genome and vice versa. As this is a zebrafish library, there are four different Gene ID options available. Human Gene IDs are only reported as NCBI Entrez Gene IDs. The Gene ID options are:

Gene ID Name	Description	Example	Notes
ZFIN ID	ZFIN gene id: always starts with ‘ZDB’ for zebafish database	ZDB-GENE-011219-1	used as the “master” gene id (link)
NCBI Gene ID	integer gene id managed by NCBI: also known as Entrez Gene ID	140634	link
Symbol	descriptive symbol/name: RefSeq symbol used in RefSeq genome build	cyp1a	nomenclature defined by ZFIN
Ensembl Gene ID	Ensembl database gene id: always starts with ‘ENSDAR’	ENSDARG00000098315	link
Human NCBI Gene ID	integer gene id managed by NCBI: also known as Entrez Gene ID	1543	link

Requirements

In this tutorial we will be utilizing two key elements:

a sample Gene ID list (format: .csv, .tsv, .txt) for reading in the Gene IDs, otherwise typing or copy/pasting Gene IDs is also supported. The Gene ID list can either be zebrafish Gene IDs in any of the supported formats or human Gene IDs in NCI Entrez format.
- the human gene list we will be using is located in the data/test_data subdirectory of this current working directory with relative path data/test_data/hsa04911.txt
  - if you would like to download this dataset you can find it on our GitHub here
- the zebrafish gene list we will be using is located in the same subdirectory with relative path data/test_data/dre04910.txt
  - if you would like to download this dataset you can find it on our GitHub here
the required python package
- See installation instructions if not already installed.

In general, while you do not need a large foundation in Python to execute the code listed in this tutorial, a general understanding of absolute and relative paths is useful.

note: the Gene IDs are spelling and case sensitive

# IMPORT PYTHON PACKAGE
# ---------------------
from danrerlib import mapping, utils
import pandas as pd

Execute Orthology Mappings

Gathering human and zebrafish orthology are especially useful when a mechanism has been characterized in humans but has yet to be looked at in zebrafish. We know that approximately 70% of human genes have zebrafish orthologs, meaning 70% of human genes also exist in zebrafish. We can use this to our advantage in various research scenarios. In some cases, you might just be interested in gathering the orthologs from a few genes. In other scenarios, you might want to get the orthology for a large list of genes. We will go through a variety of those scenarios here.

Simple Case: Get orthologs

Purpose: given a small list Gene IDs that are of type A in organism x, convert to Gene ID type A in organism B.

The simple case is useful for a quick ortholog investigation if you have a small list you can easily paste.

Zebrafish to Human

Step 1: Define your list of Gene IDs. In this case, you would have a list of Gene IDs in any of the supported zebrafish formats. I named the python list list_of_zfish_ids and included all the Gene IDs I want to find human orthologs for. I chose to use ZFIN IDs here, but any of the supported zebrafish Gene ID types will suffice.

list_of_zfish_ids = ['ZDB-GENE-081113-5', 'ZDB-GENE-031002-50', 
                     'ZDB-GENE-120104-7', 'ZDB-GENE-000607-16', 
                     'ZDB-GENE-060503-934', 'ZDB-GENE-050320-72', 
                     'ZDB-GENE-030131-3147', 'ZDB-GENE-040426-2551']

Step 2: Tell the program which gene type you currently have.

zfish_gene_type = 'ZFIN ID'

Step 3: Launch the conversion function to get your ids converted to human Gene IDs. Note that the only supported Gene ID type for humans is the Human NCBI Gene ID, aka Entrez Gene ID.

human_ids = mapping.convert_to_human(list_of_zfish_ids, zfish_gene_type)

Step 4: To visualize the human ids that have been converted from zebrafish ids, you can either print them to the python shell or save them to a file. If you would like to print them you can use the print_series_pretty function in the utils module.

utils.pretty_print_series(human_ids)

If you would rather save the data to a file, you can save human_ids to a file name called human_ids.txt in the output data directory we defined previously. For some default options, you can use the save_series function in the utils module. Feel free to change the output directory to any folder of your choice.

file_name = 'data/out_data/human_ids.txt'
utils.save_data(human_ids, file_name)

Human to Zebrafish

To convert from human Gene IDs to zebrafish, you follow the same steps except we will use the convert_to_zebrafish function in the `mapping module.

Step 1: Define your list of Gene IDs. Reminder again that the only supported human Gene ID type is currently NCBI Gene ID. These are also known as Entrez Gene IDs and are of integer format. My list of Gene IDs is names list_of_human_ids.

list_of_human_ids = [55585, 23191, 4192, 5686, 
                     197021, 390, 344805, 2623]

Step 2: Tell the program which zebrafish Gene ID type you would like to convert to. I will choose the Symbol Gene ID type for this example.

zfish_desired_gene_id_type = 'Symbol'

Step 3: Launch the conversion function for human to zebrafish.

zebrafish_ids = mapping.convert_to_zebrafish(list_of_human_ids, zfish_desired_gene_id_type)

Step 4: Print results to shell or save the results to a file. I will just print the results here using the pretty_print_series function in the utils module.

utils.pretty_print_series(zebrafish_ids)

mdkb
cyfip1
psma5
lctla
rnd3b
ube2q1
gata1a
mdka
lctlb
gata1b
tmprss7

If you take a closer look, you will notice that the number of genes we started with does not always match what we will end up with. There is not a 1:1 mapping of genes between human and zebrafish, as stated previously. In fact, zebrafish often have a lot of duplicate genes (e.g. an a and b gene where a human just has one). Therefore, it is often useful to keep the mapping to know which genes in fact have orthologs and which don’t. To do this, it is better to add a column to an existing DataFrame.

Simple Case: Get Orthologs and Keep Mapping

Purpose: given a small list Gene IDs that are of type A in organism x, convert to Gene ID type A in organism B and keep mapping.

Zebrafish to Human

Step 1: Define your list of Gene IDs. In this case, you would have a list of Gene IDs in any of the supported zebrafish formats. I am using the same list_of_zfish_ids as before.

list_of_zfish_ids = ['ZDB-GENE-081113-5', 'ZDB-GENE-031002-50', 
                     'ZDB-GENE-120104-7', 'ZDB-GENE-000607-16', 
                     'ZDB-GENE-060503-934', 'ZDB-GENE-050320-72', 
                     'ZDB-GENE-030131-3147', 'ZDB-GENE-040426-2551']

Step 2: Tell the program which gene type you currently have and that we want to keep the mapping.

zfish_gene_type = 'ZFIN ID'
keep_mapping = True

Step 3: Launch the conversion function to get your ids converted to human Gene IDs. Note that the only supported Gene ID type for humans is the Human NCBI Gene ID, aka Entrez Gene ID. The keep_mapping variable is added here as well.

human_ids = mapping.convert_to_human(list_of_zfish_ids, zfish_gene_type, keep_mapping)

Step 4: Print results to shell or save the results to a file. Both options are shown here. Notice the file name is specified.

human_ids

	ZFIN ID	Human NCBI Gene ID
0	ZDB-GENE-081113-5	4293
1	ZDB-GENE-031002-50	81553
2	ZDB-GENE-120104-7	26207
3	ZDB-GENE-000607-16	605
4	ZDB-GENE-060503-934	4861
5	ZDB-GENE-050320-72	55022
6	ZDB-GENE-040426-2551	221143

file_path = 'data/out_data/human_ids_with_mapping.txt'
utils.save_data(human_ids, file_path)

If you notice in this case, row 6 has a NaN value for the Human NCBI Gene ID column. This is because there is no ortholog to the zebrafish gene ZDB-GENE-030131-3147 in humans. If in the further downstream analysis this isn’t important to you, you can drop that row from the dataset.

human_ids.dropna()

	ZFIN ID	Human NCBI Gene ID
0	ZDB-GENE-081113-5	4293
1	ZDB-GENE-031002-50	81553
2	ZDB-GENE-120104-7	26207
3	ZDB-GENE-000607-16	605
4	ZDB-GENE-060503-934	4861
5	ZDB-GENE-050320-72	55022
6	ZDB-GENE-040426-2551	221143

Or you could have ran the convert_to_human function with the keep_missing_orthos option as False, the default is True.

keep_mapping = True
keep_missing_orthos = False
mapping.convert_to_human(list_of_zfish_ids, zfish_gene_type, 
                         keep_mapping, keep_missing_orthos)

	ZFIN ID	Human NCBI Gene ID
0	ZDB-GENE-081113-5	4293
1	ZDB-GENE-031002-50	81553
2	ZDB-GENE-120104-7	26207
3	ZDB-GENE-000607-16	605
4	ZDB-GENE-060503-934	4861
5	ZDB-GENE-050320-72	55022
6	ZDB-GENE-040426-2551	221143

Human to Zebrafish

The steps for human to zebrafish are nearly the same. The repeated steps are shown below.

list_of_human_ids = [55585, 23191, 4192, 5686, 
                     197021, 390, 344805, 2623]

zfish_desired_gene_id_type = 'Symbol'
keep_mapping = True
zebrafish_ids = mapping.convert_to_zebrafish(list_of_human_ids, 
                                             zfish_desired_gene_id_type, 
                                             keep_mapping)
zebrafish_ids

	Human NCBI Gene ID	Symbol
0	55585	ube2q1
1	23191	cyfip1
2	4192	mdkb
3	4192	mdka
4	5686	psma5
5	197021	lctla
6	197021	lctlb
7	390	rnd3b
8	344805	tmprss7
9	2623	gata1b
10	2623	gata1a

In this case, every human gene has an ortholog in zebrafish. In fact, some of the human genes here have more than one zebrafish ortholog. For example, the 2623 gene has the zebrafish orthologs gata1b and gata1a. All orthologs are kept in this method.

Get Orthologs for Genes from File

Purpose: given a list of Gene IDs from a file that are of type A in organism x, convert to Gene ID type A in organism B and save to file.

Zebrafish to Human

Step 1: Read in your list of Gene IDs. The data in the test_data sub-directory is in a tsv type format with a .txt extension. The pandas package can read this without an issue (same with excel or a csv file), we just need to specify the separator. \t is really the best for this type of data. Note that any excel file or csv file should work here. I am using a file named dre04910.txt which is a list of zebrafish genes in the KEGG pathway 04910.

data_file_path = 'data/test_data/dre04910.txt'
data = pd.read_csv(data_file_path, sep='\t')

To get a quick look at the data, we can print the first three table entries and some data stats:

# print first three lines
data.head(3)
rows, cols = data.shape
print(f'There are {rows} rows and {cols} columns')

There are 180 rows and 1 columns

As you can see, this list of Gene IDs has 180 entries.

Step 2: Tell the program which Gene ID type you currently have.

zfish_gene_type = 'NCBI Gene ID'

Step 3: Launch the conversion function to get your ids converted to human Gene IDs. Note that the only supported Gene ID type for humans is the Human NCBI Gene ID, aka Entrez Gene ID.

human_ids = mapping.convert_to_human(data, zfish_gene_type)

Step 4: Print results to shell or save the results to a file. Both options are shown here. Notice the file name is specified.

human_ids.head(3)
m = len(human_ids)
print(f'There are {m} rows in this dataset.')

There are 117 rows in this dataset.

By printing some of the stats as shown above, we see that we are left with 117 human genes. This means that out of the 180 zebrafish genes we had, we only have 117 human orthologs. This is expected as, if you recall, only approximately 70% of human genes have zebrafish orthologs.

file_path = 'data/out_data/dre05910_human_genes.txt'
utils.save_data(human_ids, file_path)

Human to Zebrafish

Below is a quick run-though of the human to zebrafish case. The dataset used is a file named hsa04911.txt which contains all human genes in the KEGG pathway 04911.

# read in data and print some stats
data_file_path = 'data/test_data/hsa04911.txt'
data = pd.read_csv(data_file_path, sep='\t')
m,n = data.shape
print(f'There are {m} rows in this dataset.')

There are 86 rows in this dataset.

# launch conversion
desired_zfish_gene_type = 'NCBI Gene ID'
zfish_ids = mapping.convert_to_zebrafish(data, zfish_gene_type)

# print some results stats
zfish_ids.head(3)

  64272
  64269
  64267
Name: NCBI Gene ID, dtype: object

m = len(zfish_ids)
print(f"There are {m} rows in this dataset.")

There are 112 rows in this dataset.

Get Orthologs for a Column of a Larger Dataset

Purpose: you have a dataset with columns x, y, z. Column x has Gene IDs in from organism \(\alpha\) in type A. You would like to get orthologs for these Gene IDs to organism \(\beta\) in type B while maintaining the information of columns y, z.

In this scenario, you might have some data that looks like:

NCBI Gene ID	PValue	logFC
100002263	2.3	0.03
…	…	…

and you might want to get something like:

NCBI Gene ID	Human NCBI Gene ID	PValue	logFC
140615	9415	2.3	0.03
…	…	…

The information in the log2FC and PValue columns are essential to keep ‘in order’ with the GeneID column. It is often that in this scenario, you will have an entire gene set and will be dealing with a lot more data. Lets look at a test dataset for this case.

Step 1: Read in the data as done previously.

data_file_path = 'data/test_data/example_diff_express_data.txt'
data = pd.read_csv(data_file_path, sep='\t')

To get a quick look at the data, we can print the first three table entries and some data stats:

# print first three lines
data.head(3)

	NCBI Gene ID	PValue	logFC
0	100000006	0.792615	0.115009
1	100000044	0.015286	0.803879
2	100000085	0.264762	0.267360

rows, cols = data.shape
print(f'There are {rows} rows and {cols} columns')

There are 5464 rows and 3 columns

Step 2: Get Human Orthologs. To execute the mapping and add the column, we will use the add_mapped_ortholg_column function from the mapping module. We will give this function the data read in above, and we need to specify the id_from and id_to as before.

id_from = 'NCBI Gene ID'
id_to = 'Human NCBI Gene ID'
new_data = mapping.add_mapped_ortholog_column(data, id_from, id_to)
new_data.head(3)

	NCBI Gene ID	Human NCBI Gene ID	PValue	logFC
0	100000006	84561	0.792615	0.115009
1	100000044	NaN	0.015286	0.803879
2	100000085	968	0.264762	0.267360

Step 3: Save to file. You can save this dataframe to a file as done previously.

file_name = 'data/out_data/ortholog_dataframe.txt'
utils.save_data(new_data, file_name)

Conclusion

This concludes the mapping tutorial! In summary, the key functions in this library for mapping are:

function	purpose
convert_to_human	convert a list of Zebrafish Gene IDs to Human Gene IDs
convert_to_zebrafish	convert a list of Human Gene IDs to Zebrafish Gene IDs
add_mapped_ortholog_column	add a ortholog column of Gene IDs to an existing DataFrame

In the human to zebrafish case, we see that we went from 86 Human NCBI Gene IDs to 113 zebrafish NCBI Gene IDs.