Orthology Mapping
Author: Ashley Schwartz
Date: Originally Developed September 2023, Last Revised March 2024
Purpose and Background
This tutorial goes over how to get the zebrafish or human gene orthologs from one organism to another. Gene orthology is useful when you want to determine which genes in zebrafish map to genes in the human genome and vice versa. As this is a zebrafish library, there are four different Gene ID options available. Human Gene IDs are only reported as NCBI Entrez Gene IDs. The Gene ID options are:
Gene ID Name |
Description |
Example |
Notes |
---|---|---|---|
ZFIN ID |
ZFIN gene id: always starts with ‘ZDB’ for zebafish database |
ZDB-GENE-011219-1 |
used as the “master” gene id (link) |
NCBI Gene ID |
integer gene id managed by NCBI: also known as Entrez Gene ID |
140634 |
|
Symbol |
descriptive symbol/name: RefSeq symbol used in RefSeq genome build |
cyp1a |
nomenclature defined by ZFIN |
Ensembl Gene ID |
Ensembl database gene id: always starts with ‘ENSDAR’ |
ENSDARG00000098315 |
|
Human NCBI Gene ID |
integer gene id managed by NCBI: also known as Entrez Gene ID |
1543 |
Requirements
In this tutorial we will be utilizing two key elements:
a sample Gene ID list (format: .csv, .tsv, .txt) for reading in the Gene IDs, otherwise typing or copy/pasting Gene IDs is also supported. The Gene ID list can either be zebrafish Gene IDs in any of the supported formats or human Gene IDs in NCI Entrez format.
the human gene list we will be using is located in the data/test_data subdirectory of this current working directory with relative path
data/test_data/hsa04911.txt
if you would like to download this dataset you can find it on our GitHub here
the zebrafish gene list we will be using is located in the same subdirectory with relative path
data/test_data/dre04910.txt
if you would like to download this dataset you can find it on our GitHub here
the required python package
See installation instructions if not already installed.
In general, while you do not need a large foundation in Python to execute the code listed in this tutorial, a general understanding of absolute and relative paths is useful.
note: the Gene IDs are spelling and case sensitive
# IMPORT PYTHON PACKAGE
# ---------------------
from danrerlib import mapping, utils
import pandas as pd
Execute Orthology Mappings
Gathering human and zebrafish orthology are especially useful when a mechanism has been characterized in humans but has yet to be looked at in zebrafish. We know that approximately 70% of human genes have zebrafish orthologs, meaning 70% of human genes also exist in zebrafish. We can use this to our advantage in various research scenarios. In some cases, you might just be interested in gathering the orthologs from a few genes. In other scenarios, you might want to get the orthology for a large list of genes. We will go through a variety of those scenarios here.
Simple Case: Get orthologs
Purpose: given a small list Gene IDs that are of type A in organism x, convert to Gene ID type A in organism B.
The simple case is useful for a quick ortholog investigation if you have a small list you can easily paste.
Zebrafish to Human
Step 1: Define your list of Gene IDs. In this case, you would have a list of Gene IDs in any of the supported zebrafish formats. I named the python list list_of_zfish_ids
and included all the Gene IDs I want to find human orthologs for. I chose to use ZFIN IDs here, but any of the supported zebrafish Gene ID types will suffice.
list_of_zfish_ids = ['ZDB-GENE-081113-5', 'ZDB-GENE-031002-50',
'ZDB-GENE-120104-7', 'ZDB-GENE-000607-16',
'ZDB-GENE-060503-934', 'ZDB-GENE-050320-72',
'ZDB-GENE-030131-3147', 'ZDB-GENE-040426-2551']
Step 2: Tell the program which gene type you currently have.
zfish_gene_type = 'ZFIN ID'
Step 3: Launch the conversion function to get your ids converted to human Gene IDs. Note that the only supported Gene ID type for humans is the Human NCBI Gene ID, aka Entrez Gene ID.
human_ids = mapping.convert_to_human(list_of_zfish_ids, zfish_gene_type)
Step 4: To visualize the human ids that have been converted from zebrafish ids, you can either print them to the python shell or save them to a file. If you would like to print them you can use the print_series_pretty
function in the utils
module.
utils.pretty_print_series(human_ids)
4293
81553
26207
605
4861
55022
221143
If you would rather save the data to a file, you can save human_ids
to a file name called human_ids.txt
in the output data directory we defined previously. For some default options, you can use the save_series
function in the utils
module. Feel free to change the output directory to any folder of your choice.
file_name = 'data/out_data/human_ids.txt'
utils.save_data(human_ids, file_name)
Human to Zebrafish
To convert from human Gene IDs to zebrafish, you follow the same steps except we will use the convert_to_zebrafish
function in the `mapping module.
Step 1: Define your list of Gene IDs. Reminder again that the only supported human Gene ID type is currently NCBI Gene ID. These are also known as Entrez Gene IDs and are of integer format. My list of Gene IDs is names list_of_human_ids
.
list_of_human_ids = [55585, 23191, 4192, 5686,
197021, 390, 344805, 2623]
Step 2: Tell the program which zebrafish Gene ID type you would like to convert to. I will choose the Symbol Gene ID type for this example.
zfish_desired_gene_id_type = 'Symbol'
Step 3: Launch the conversion function for human to zebrafish.
zebrafish_ids = mapping.convert_to_zebrafish(list_of_human_ids, zfish_desired_gene_id_type)
Step 4: Print results to shell or save the results to a file. I will just print the results here using the pretty_print_series
function in the utils
module.
utils.pretty_print_series(zebrafish_ids)
mdkb
cyfip1
psma5
lctla
rnd3b
ube2q1
gata1a
mdka
lctlb
gata1b
tmprss7
If you take a closer look, you will notice that the number of genes we started with does not always match what we will end up with. There is not a 1:1 mapping of genes between human and zebrafish, as stated previously. In fact, zebrafish often have a lot of duplicate genes (e.g. an a and b gene where a human just has one). Therefore, it is often useful to keep the mapping to know which genes in fact have orthologs and which don’t. To do this, it is better to add a column to an existing DataFrame.
Simple Case: Get Orthologs and Keep Mapping
Purpose: given a small list Gene IDs that are of type A in organism x, convert to Gene ID type A in organism B and keep mapping.
Zebrafish to Human
Step 1: Define your list of Gene IDs. In this case, you would have a list of Gene IDs in any of the supported zebrafish formats. I am using the same list_of_zfish_ids
as before.
list_of_zfish_ids = ['ZDB-GENE-081113-5', 'ZDB-GENE-031002-50',
'ZDB-GENE-120104-7', 'ZDB-GENE-000607-16',
'ZDB-GENE-060503-934', 'ZDB-GENE-050320-72',
'ZDB-GENE-030131-3147', 'ZDB-GENE-040426-2551']
Step 2: Tell the program which gene type you currently have and that we want to keep the mapping.
zfish_gene_type = 'ZFIN ID'
keep_mapping = True
Step 3: Launch the conversion function to get your ids converted to human Gene IDs. Note that the only supported Gene ID type for humans is the Human NCBI Gene ID, aka Entrez Gene ID. The keep_mapping
variable is added here as well.
human_ids = mapping.convert_to_human(list_of_zfish_ids, zfish_gene_type, keep_mapping)
Step 4: Print results to shell or save the results to a file. Both options are shown here. Notice the file name is specified.
human_ids
ZFIN ID | Human NCBI Gene ID | |
---|---|---|
0 | ZDB-GENE-081113-5 | 4293 |
1 | ZDB-GENE-031002-50 | 81553 |
2 | ZDB-GENE-120104-7 | 26207 |
3 | ZDB-GENE-000607-16 | 605 |
4 | ZDB-GENE-060503-934 | 4861 |
5 | ZDB-GENE-050320-72 | 55022 |
6 | ZDB-GENE-040426-2551 | 221143 |
file_path = 'data/out_data/human_ids_with_mapping.txt'
utils.save_data(human_ids, file_path)
If you notice in this case, row 6 has a NaN
value for the Human NCBI Gene ID
column. This is because there is no ortholog to the zebrafish gene ZDB-GENE-030131-3147
in humans. If in the further downstream analysis this isn’t important to you, you can drop that row from the dataset.
human_ids.dropna()
ZFIN ID | Human NCBI Gene ID | |
---|---|---|
0 | ZDB-GENE-081113-5 | 4293 |
1 | ZDB-GENE-031002-50 | 81553 |
2 | ZDB-GENE-120104-7 | 26207 |
3 | ZDB-GENE-000607-16 | 605 |
4 | ZDB-GENE-060503-934 | 4861 |
5 | ZDB-GENE-050320-72 | 55022 |
6 | ZDB-GENE-040426-2551 | 221143 |
Or you could have ran the convert_to_human
function with the keep_missing_orthos
option as False
, the default is True
.
keep_mapping = True
keep_missing_orthos = False
mapping.convert_to_human(list_of_zfish_ids, zfish_gene_type,
keep_mapping, keep_missing_orthos)
ZFIN ID | Human NCBI Gene ID | |
---|---|---|
0 | ZDB-GENE-081113-5 | 4293 |
1 | ZDB-GENE-031002-50 | 81553 |
2 | ZDB-GENE-120104-7 | 26207 |
3 | ZDB-GENE-000607-16 | 605 |
4 | ZDB-GENE-060503-934 | 4861 |
5 | ZDB-GENE-050320-72 | 55022 |
6 | ZDB-GENE-040426-2551 | 221143 |
Human to Zebrafish
The steps for human to zebrafish are nearly the same. The repeated steps are shown below.
list_of_human_ids = [55585, 23191, 4192, 5686,
197021, 390, 344805, 2623]
zfish_desired_gene_id_type = 'Symbol'
keep_mapping = True
zebrafish_ids = mapping.convert_to_zebrafish(list_of_human_ids,
zfish_desired_gene_id_type,
keep_mapping)
zebrafish_ids
Human NCBI Gene ID | Symbol | |
---|---|---|
0 | 55585 | ube2q1 |
1 | 23191 | cyfip1 |
2 | 4192 | mdkb |
3 | 4192 | mdka |
4 | 5686 | psma5 |
5 | 197021 | lctla |
6 | 197021 | lctlb |
7 | 390 | rnd3b |
8 | 344805 | tmprss7 |
9 | 2623 | gata1b |
10 | 2623 | gata1a |
In this case, every human gene has an ortholog in zebrafish. In fact, some of the human genes here have more than one zebrafish ortholog. For example, the 2623
gene has the zebrafish orthologs gata1b
and gata1a
. All orthologs are kept in this method.
Get Orthologs for Genes from File
Purpose: given a list of Gene IDs from a file that are of type A in organism x, convert to Gene ID type A in organism B and save to file.
Zebrafish to Human
Step 1: Read in your list of Gene IDs. The data in the test_data
sub-directory is in a tsv
type format with a .txt
extension. The pandas
package can read this without an issue (same with excel
or a csv
file), we just need to specify the separator. \t
is really the best for this type of data. Note that any excel file or csv file should work here. I am using a file named dre04910.txt
which is a list of zebrafish genes in the KEGG pathway 04910
.
data_file_path = 'data/test_data/dre04910.txt'
data = pd.read_csv(data_file_path, sep='\t')
To get a quick look at the data, we can print the first three table entries and some data stats:
# print first three lines
data.head(3)
rows, cols = data.shape
print(f'There are {rows} rows and {cols} columns')
There are 180 rows and 1 columns
As you can see, this list of Gene IDs has 180 entries.
Step 2: Tell the program which Gene ID type you currently have.
zfish_gene_type = 'NCBI Gene ID'
Step 3: Launch the conversion function to get your ids converted to human Gene IDs. Note that the only supported Gene ID type for humans is the Human NCBI Gene ID, aka Entrez Gene ID.
human_ids = mapping.convert_to_human(data, zfish_gene_type)
Step 4: Print results to shell or save the results to a file. Both options are shown here. Notice the file name is specified.
human_ids.head(3)
m = len(human_ids)
print(f'There are {m} rows in this dataset.')
There are 117 rows in this dataset.
By printing some of the stats as shown above, we see that we are left with 117 human genes. This means that out of the 180 zebrafish genes we had, we only have 117 human orthologs. This is expected as, if you recall, only approximately 70% of human genes have zebrafish orthologs.
file_path = 'data/out_data/dre05910_human_genes.txt'
utils.save_data(human_ids, file_path)
Human to Zebrafish
Below is a quick run-though of the human to zebrafish case. The dataset used is a file named hsa04911.txt
which contains all human genes in the KEGG pathway 04911.
# read in data and print some stats
data_file_path = 'data/test_data/hsa04911.txt'
data = pd.read_csv(data_file_path, sep='\t')
m,n = data.shape
print(f'There are {m} rows in this dataset.')
There are 86 rows in this dataset.
# launch conversion
desired_zfish_gene_type = 'NCBI Gene ID'
zfish_ids = mapping.convert_to_zebrafish(data, zfish_gene_type)
# print some results stats
zfish_ids.head(3)
0 64272
1 64269
2 64267
Name: NCBI Gene ID, dtype: object
m = len(zfish_ids)
print(f"There are {m} rows in this dataset.")
There are 112 rows in this dataset.
Get Orthologs for a Column of a Larger Dataset
Purpose: you have a dataset with columns x, y, z. Column x has Gene IDs in from organism \(\alpha\) in type A. You would like to get orthologs for these Gene IDs to organism \(\beta\) in type B while maintaining the information of columns y, z.
In this scenario, you might have some data that looks like:
NCBI Gene ID |
PValue |
logFC |
---|---|---|
100002263 |
2.3 |
0.03 |
… |
… |
… |
and you might want to get something like:
NCBI Gene ID |
Human NCBI Gene ID |
PValue |
logFC |
---|---|---|---|
140615 |
9415 |
2.3 |
0.03 |
… |
… |
… |
The information in the log2FC and PValue columns are essential to keep ‘in order’ with the GeneID column. It is often that in this scenario, you will have an entire gene set and will be dealing with a lot more data. Lets look at a test dataset for this case.
Step 1: Read in the data as done previously.
data_file_path = 'data/test_data/example_diff_express_data.txt'
data = pd.read_csv(data_file_path, sep='\t')
To get a quick look at the data, we can print the first three table entries and some data stats:
# print first three lines
data.head(3)
NCBI Gene ID | PValue | logFC | |
---|---|---|---|
0 | 100000006 | 0.792615 | 0.115009 |
1 | 100000044 | 0.015286 | 0.803879 |
2 | 100000085 | 0.264762 | 0.267360 |
rows, cols = data.shape
print(f'There are {rows} rows and {cols} columns')
There are 5464 rows and 3 columns
Step 2: Get Human Orthologs. To execute the mapping and add the column, we will use the add_mapped_ortholg_column
function from the mapping
module. We will give this function the data read in above, and we need to specify the id_from
and id_to
as before.
id_from = 'NCBI Gene ID'
id_to = 'Human NCBI Gene ID'
new_data = mapping.add_mapped_ortholog_column(data, id_from, id_to)
new_data.head(3)
NCBI Gene ID | Human NCBI Gene ID | PValue | logFC | |
---|---|---|---|---|
0 | 100000006 | 84561 | 0.792615 | 0.115009 |
1 | 100000044 | NaN | 0.015286 | 0.803879 |
2 | 100000085 | 968 | 0.264762 | 0.267360 |
Step 3: Save to file. You can save this dataframe to a file as done previously.
file_name = 'data/out_data/ortholog_dataframe.txt'
utils.save_data(new_data, file_name)
Conclusion
This concludes the mapping tutorial! In summary, the key functions in this library for mapping are:
function |
purpose |
---|---|
convert_to_human |
convert a list of Zebrafish Gene IDs to Human Gene IDs |
convert_to_zebrafish |
convert a list of Human Gene IDs to Zebrafish Gene IDs |
add_mapped_ortholog_column |
add a ortholog column of Gene IDs to an existing DataFrame |
In the human to zebrafish case, we see that we went from 86 Human NCBI Gene IDs to 113 zebrafish NCBI Gene IDs.