Gene ID Mapping

Author: Ashley Schwartz

Date: Originally Developed September 2023, Last Revised March 2024

Purpose and Background

This tutorial goes over how to simply convert a list of zebrafish Gene IDs to another Gene ID type. Gene IDs come in very different forms depending on the database or genome build you are using. This can get confusing! The Gene ID options are:

Gene ID Name	Description	Example	Notes
ZFIN ID	ZFIN gene id: always starts with ‘ZDB’ for zebafish database	ZDB-GENE-011219-1	used as the “master” gene id (link)
NCBI Gene ID	integer gene id managed by NCBI: also known as Entrez Gene ID	140634	link
Symbol	descriptive symbol/name: RefSeq symbol used in RefSeq genome build	cyp1a	nomenclature defined by ZFIN
Ensembl Gene ID	Ensembl database gene id: always starts with ‘ENSDAR’	ENSDARG00000098315	link

Requirements

In this tutorial we will be utilizing two key elements:

a sample Gene ID list (format: .csv, .tsv, .txt) for reading in the Gene IDs, otherwise typing or copy/pasting Gene IDs is also supported
- the gene list we will be using is located in the data/test_data subdirectory of this current working directory with relative path data/test_data/example_diff_express_data.txt
the required python package
- see install notes if not currently installed.

In general, while you do not need a large foundation in Python to execute the code listed in this tutorial, a general understanding of absolute and relative paths is useful.

note: the Gene IDs are spelling and case sensitive

# IMPORT PYTHON PACKAGE
# ---------------------
from danrerlib import mapping, utils
import pandas as pd

Execute Mappings

There are a variety of scenarios when you might need to map Gene IDs. In the most simplest case, you might have a few IDs you would like to map to Entrez NCBI Gene IDs since that is a common Gene ID used in pathway databases. Other times, you might want to convert an entire column in an excel fil you have. We will go through a few different options.

Simple Case: Convert a list of Gene IDs

Purpose: Given a small list of Gene IDs that are of Gene ID type A, convert to Gene ID type B.

You would most likely use the simple case if you have a small list of gene ids that you need to convert. Especially useful if you just want to copy and paste and retrieve your converted ids!

Step 1: Define your list of ids. In this case, I have NCBI Gene IDs. I named the python list list_of_gene_ids and include all the Gene IDs I want to convert.

list_of_gene_ids = [ 
    100000252, 100000750, 100001198, 100001260, 100002225, 100002263, 
    100002756, 100003223, 100007521, 100149273, 100149794, 100170795,
    100321746, 100329897, 100330617,
]

Step 2: Tell the program which ID you currently have and which ID you would like to convert to. I currently have NCBI Gene IDs and I want to convert to ZFIN Gene IDs. Note that Gene ID options are spelling and case sensitive. Options are listed at the beginning of this document. (don’t worry, the program will let you know if you have made a mistake when you launch the program!)

current_gene_id_type = 'NCBI Gene ID'
desired_gene_id_type = 'ZFIN ID'

Step 3: Launch the conversion function to get your converted ids. This means we want to run the convert_ids function in the mapping module of our library. Once executed, the converted ids will be stored in the converted_ids variable.

# do conversion
converted_ids = mapping.convert_ids(list_of_gene_ids, current_gene_id_type, desired_gene_id_type)

Step 4: To visualize your converted ids, you can either print them to the python shell or save them to a file. If you would like to print them, which is a fine idea if you only have a few, you can use the print_series_pretty function in the utils module of the library.

utils.pretty_print_series(converted_ids)

ZDB-GENE-030131-1904
ZDB-GENE-030131-3404
ZDB-GENE-030325-1
ZDB-GENE-030616-609
ZDB-GENE-040426-743
ZDB-GENE-050309-246
ZDB-GENE-071009-6
ZDB-GENE-080219-34
ZDB-GENE-080723-44
ZDB-GENE-081223-2
ZDB-GENE-090313-141
ZDB-GENE-091117-28
ZDB-GENE-110309-3
ZDB-GENE-120215-92

If you would rather save the data to a file, you can save converted_ids to a file name called converted_ids.txt in the output data directory we defined previously. For some default options, you can use the save_data function in the utils module. Feel free to change the output directory to any folder of your choice.

file_name = 'data/out_data/converted_ids.txt'
utils.save_data(converted_ids, file_name)

Simple Case: Convert a list of Gene IDs From a File

Purpose: Given a list of Gene IDs from a file that are of Gene ID type A, convert to Gene ID type B.

If you have a file that contains a list of Gene IDs, you can easily repeat these steps by reading in that file first. Check it out:

If you are interested in downloading this dataset to try it yourself, you can find it on our GitHub here.

data_file_path = 'data/test_data/ncbi_gene_list.txt'
list_of_gene_ids = pd.read_csv(data_file_path, sep='\t')
list_of_gene_ids

	NCBI Gene ID
0	30301
1	30459
2	57924
3	564772
4	794572

When we use the pandas package to read in the data, it organizes it into what is called a Pandas DataFrame. All other steps can be executed the same way.

current_gene_id_type = 'NCBI Gene ID'
desired_gene_id_type = 'ZFIN ID'

# do conversion
converted_ids = mapping.convert_ids(list_of_gene_ids, current_gene_id_type, desired_gene_id_type)

utils.pretty_print_series(converted_ids)

ZDB-GENE-000616-10
ZDB-GENE-050419-85
ZDB-GENE-080415-1
ZDB-GENE-991229-12

You could also save the data to a file in the same manner. As you can see, we now have our converted ids. The ids will be in order based on the original Gene IDs given to the program. If you would like to keep the mapping, adding a column to the gene IDs you currently have might be useful (see below). Limitations of this method include a non 1:1 mapping between Gene ID options. It is quite common that there is more than one Ensembl Gene ID for another Gene ID option. This function will return all mappings, but, since it just returns the list, you do not know which gene in your original set has mapped to two different genes in the new set. This may not be an issue for some use cases, but sometimes it is important to know.

If you would like to keep the old Gene IDs along with the mapping, follow the next set of instructions!

Simple Case: Convert a list of Gene IDs and Keep Mapping

Purpose: Given a list of Gene IDs that are of Gene ID type A, convert to Gene ID type B and keep both Gene ID A and Gene ID B in a table.

Step 1: Define your list of ids. This is the same list I used above, and remember they are NCBI Gene IDs. I am also defining my current Gene ID type and the Gene ID type I would like to convert to here.

list_of_gene_ids = [ 
    100000252, 100000750, 100001198, 100001260, 100002225, 100002263, 
    100002756, 100003223, 100007521, 100149273, 100149794, 100170795,
    100321746, 100329897, 100330617,
]

current_gene_id_type = 'NCBI Gene ID'
desired_gene_id_type = 'ZFIN ID'

Step 2: Do converstion. If we would like to keep the mapping, we would use the convert_ids function in the mapping module of our library but activate the keep_mapping parameter. By default, as used earlier, keep_mapping = False

# do conversion
converted_id_table = mapping.convert_ids(list_of_gene_ids, current_gene_id_type, desired_gene_id_type, keep_mapping=True)
converted_id_table

	NCBI Gene ID	ZFIN ID
0	100002263	ZDB-GENE-030131-1904
1	100330617	ZDB-GENE-030131-3404
2	100001198	ZDB-GENE-030325-1
3	100003223	ZDB-GENE-030616-609
4	100000252	ZDB-GENE-040426-743
5	100001260	ZDB-GENE-050309-246
6	100321746	ZDB-GENE-071009-6
7	100002225	ZDB-GENE-080219-34
8	100170795	ZDB-GENE-080723-44
9	100000750	ZDB-GENE-081223-2
10	100007521	ZDB-GENE-090313-141
11	100149273	ZDB-GENE-091117-28
12	100149794	ZDB-GENE-110309-3
13	100329897	ZDB-GENE-120215-92

You can save the data in the same way. In this case, since we have a Pandas DataFrame with column headings, the column names will be saved automatically.

If you would like to read the Gene IDs in from a file, all steps are the same besides the initialization of the Gene IDs:

data_file_path = 'data/test_data/ncbi_gene_list.txt'
list_of_gene_ids = pd.read_csv(data_file_path, sep='\t')

current_gene_id_type = 'NCBI Gene ID'
desired_gene_id_type = 'ZFIN ID'

# do conversion
converted_id_table = mapping.convert_ids(list_of_gene_ids, current_gene_id_type, desired_gene_id_type, keep_mapping=True)
converted_id_table

	NCBI Gene ID	ZFIN ID
0	57924	ZDB-GENE-000616-10
1	794572	ZDB-GENE-050419-85
2	564772	ZDB-GENE-080415-1
3	30459	ZDB-GENE-991229-12

The above methodologies are great if you have list of Gene IDs you would like to convert. There are cases where you might have a large dataset and one column in that dataset has the Gene IDs that you would like to convert. In this scenario, keeping all columns properly sorted is extremely important.

Convert Gene IDs in a Column of a Larger Dataset

Purpose: you have a dataset with columns x, y, z. Column x has Gene IDs in type A. You would like to convert these Gene IDs to type B while maintaining the information of columns y, z.

In this scenario, you might have some data that looks like:

NCBI Gene ID	PValue	logFC
100002263	2.3	0.03
…	…	…

The information in the log2FC and PValue columns are essential to keep ‘in order’ with the GeneID column. It is often that in this scenario, you will have an entire gene set and will be dealing with a lot more data. Lets look at a test dataset for this case.

Step 1: Read in the data. The data in the test_data sub-directory is in a tsv type format with a .txt extension. The pandas package can read this without an issue (same with excel or a csv file), we just need to specify the separator. \t is really the best for this type of data. Note that any excel file or csv file should work here.

If you are interested, the raw data can be downloaded from our GitHub here.

data_file_path = 'data/test_data/example_diff_express_data.txt'
data = pd.read_csv(data_file_path, sep='\t')

To get a quick look at the data, we can print the first three table entries and some data stats:

# print first three lines
data.head(3)
rows, cols = data.shape
print(f'There are {rows} rows and {cols} columns')

There are 5464 rows and 3 columns

Step 2: Convert Gene IDs. To execute the mapping and add the column, we will use the add_mapped_column function from the mapping module. We will give this function the data read in above, and we need to specify the id_from and id_to as before.

id_from = 'NCBI Gene ID'
id_to = 'ZFIN ID'
new_data = mapping.add_mapped_column(data, id_from, id_to)
new_data.head(3)

	NCBI Gene ID	ZFIN ID	PValue	logFC
0	100000006	ZDB-GENE-030131-5654	0.792615	0.115009
1	100000044	ZDB-GENE-121214-60	0.015286	0.803879
2	100000085	ZDB-GENE-030131-1312	0.264762	0.267360

Step 3: Save to file. You can save this dataframe to a file as done previously.

file_name = 'data/out_data/converted_dataframe.txt'
utils.save_data(new_data, file_name)

Special Cases

Reminder that the id_from and id_to must match the specificed options: NCBI Gene ID, Symbol, ZFIN ID, or Ensembl Gene ID. If the column in your dataset does not match one of the ID type options, that is fine. The add_mapped_column function has an option to deal with that.

If you are interested in downloading this dataset, you can do so from our GitHub here.

data_file_path = 'data/test_data/test_set_invalid_col_name.txt'
data = pd.read_csv(data_file_path, sep='\t')
data.head(3)

	NCBI ID	PValue	logFC
0	100000006	0.792615	0.115009
1	100000009	0.607285	-0.144714
2	100000026	0.021338	0.603871

As you can see here, the column name we have is NCBI ID, which is not one of the options. We can use the column_name_with_ids parameter to get around this. Note that the id_from parameter must be one of the options and the column_name_with_ids must match the column name with the IDs in the dataset exactly. Of course, we would want id_from and column_name_with_ids to be of the same Gene ID type, likely with just a different spelling or naming convention as seen here.

id_from = 'NCBI Gene ID'
id_to = 'ZFIN ID'
column_name_with_ids = 'NCBI ID'
new_data = mapping.add_mapped_column(data, id_from, id_to, column_name_with_ids)
new_data.head(3)

	NCBI ID	ZFIN ID	PValue	logFC
0	100000006	ZDB-GENE-030131-5654	0.792615	0.115009
1	100000009	ZDB-GENE-130530-778	0.607285	-0.144714
2	100000026	ZDB-GENE-120823-1	0.021338	0.603871

We can see that the mapping was executed successfully even though the column name is not one of our specified options.

In the final special case, we could also drop the old column ids if we don’t need them anymore.

new_data = mapping.add_mapped_column(data, id_from, id_to, 
                                     column_name_with_ids, keep_old_ids=False)
new_data.head(3)

	ZFIN ID	PValue	logFC
0	ZDB-GENE-030131-5654	0.792615	0.115009
1	ZDB-GENE-130530-778	0.607285	-0.144714
2	ZDB-GENE-120823-1	0.021338	0.603871

In this run I specified the parameter keep_old_ids to be False; therefore, it is no longer included in the output.

Conclusion

This concludes the mapping tutorial! In summary, the key functions in this library for mapping are:

function	purpose
convert_ids	convert a list of Gene IDs
add_mapped_column	add a converted list of Gene IDs to an existing DataFrame

For more information about the full functionality of each function, please refer to the API Reference.