Gene ID Mapping
Author: Ashley Schwartz
Date: Originally Developed September 2023, Last Revised March 2024
Purpose and Background
This tutorial goes over how to simply convert a list of zebrafish Gene IDs to another Gene ID type. Gene IDs come in very different forms depending on the database or genome build you are using. This can get confusing! The Gene ID options are:
Gene ID Name |
Description |
Example |
Notes |
---|---|---|---|
ZFIN ID |
ZFIN gene id: always starts with ‘ZDB’ for zebafish database |
ZDB-GENE-011219-1 |
used as the “master” gene id (link) |
NCBI Gene ID |
integer gene id managed by NCBI: also known as Entrez Gene ID |
140634 |
|
Symbol |
descriptive symbol/name: RefSeq symbol used in RefSeq genome build |
cyp1a |
nomenclature defined by ZFIN |
Ensembl Gene ID |
Ensembl database gene id: always starts with ‘ENSDAR’ |
ENSDARG00000098315 |
Requirements
In this tutorial we will be utilizing two key elements:
a sample Gene ID list (format: .csv, .tsv, .txt) for reading in the Gene IDs, otherwise typing or copy/pasting Gene IDs is also supported
the gene list we will be using is located in the data/test_data subdirectory of this current working directory with relative path
data/test_data/example_diff_express_data.txt
the required python package
see install notes if not currently installed.
In general, while you do not need a large foundation in Python to execute the code listed in this tutorial, a general understanding of absolute and relative paths is useful.
note: the Gene IDs are spelling and case sensitive
# IMPORT PYTHON PACKAGE
# ---------------------
from danrerlib import mapping, utils
import pandas as pd
Execute Mappings
There are a variety of scenarios when you might need to map Gene IDs. In the most simplest case, you might have a few IDs you would like to map to Entrez NCBI Gene IDs since that is a common Gene ID used in pathway databases. Other times, you might want to convert an entire column in an excel fil you have. We will go through a few different options.
Simple Case: Convert a list of Gene IDs
Purpose: Given a small list of Gene IDs that are of Gene ID type A, convert to Gene ID type B.
You would most likely use the simple case if you have a small list of gene ids that you need to convert. Especially useful if you just want to copy and paste and retrieve your converted ids!
Step 1: Define your list of ids. In this case, I have NCBI Gene IDs. I named the python list list_of_gene_ids
and include all the Gene IDs I want to convert.
list_of_gene_ids = [
100000252, 100000750, 100001198, 100001260, 100002225, 100002263,
100002756, 100003223, 100007521, 100149273, 100149794, 100170795,
100321746, 100329897, 100330617,
]
Step 2: Tell the program which ID you currently have and which ID you would like to convert to. I currently have NCBI Gene IDs and I want to convert to ZFIN Gene IDs. Note that Gene ID options are spelling and case sensitive. Options are listed at the beginning of this document. (don’t worry, the program will let you know if you have made a mistake when you launch the program!)
current_gene_id_type = 'NCBI Gene ID'
desired_gene_id_type = 'ZFIN ID'
Step 3: Launch the conversion function to get your converted ids. This means we want to run the convert_ids
function in the mapping
module of our library. Once executed, the converted ids will be stored in the converted_ids
variable.
# do conversion
converted_ids = mapping.convert_ids(list_of_gene_ids, current_gene_id_type, desired_gene_id_type)
Step 4: To visualize your converted ids, you can either print them to the python shell or save them to a file. If you would like to print them, which is a fine idea if you only have a few, you can use the print_series_pretty
function in the utils
module of the library.
utils.pretty_print_series(converted_ids)
ZDB-GENE-030131-1904
ZDB-GENE-030131-3404
ZDB-GENE-030325-1
ZDB-GENE-030616-609
ZDB-GENE-040426-743
ZDB-GENE-050309-246
ZDB-GENE-071009-6
ZDB-GENE-080219-34
ZDB-GENE-080723-44
ZDB-GENE-081223-2
ZDB-GENE-090313-141
ZDB-GENE-091117-28
ZDB-GENE-110309-3
ZDB-GENE-120215-92
If you would rather save the data to a file, you can save converted_ids
to a file name called converted_ids.txt
in the output data directory we defined previously. For some default options, you can use the save_data
function in the utils
module. Feel free to change the output directory to any folder of your choice.
file_name = 'data/out_data/converted_ids.txt'
utils.save_data(converted_ids, file_name)
Simple Case: Convert a list of Gene IDs From a File
Purpose: Given a list of Gene IDs from a file that are of Gene ID type A, convert to Gene ID type B.
If you have a file that contains a list of Gene IDs, you can easily repeat these steps by reading in that file first. Check it out:
If you are interested in downloading this dataset to try it yourself, you can find it on our GitHub here.
data_file_path = 'data/test_data/ncbi_gene_list.txt'
list_of_gene_ids = pd.read_csv(data_file_path, sep='\t')
list_of_gene_ids
NCBI Gene ID | |
---|---|
0 | 30301 |
1 | 30459 |
2 | 57924 |
3 | 564772 |
4 | 794572 |
When we use the pandas
package to read in the data, it organizes it into what is called a Pandas DataFrame
. All other steps can be executed the same way.
current_gene_id_type = 'NCBI Gene ID'
desired_gene_id_type = 'ZFIN ID'
# do conversion
converted_ids = mapping.convert_ids(list_of_gene_ids, current_gene_id_type, desired_gene_id_type)
utils.pretty_print_series(converted_ids)
ZDB-GENE-000616-10
ZDB-GENE-050419-85
ZDB-GENE-080415-1
ZDB-GENE-991229-12
You could also save the data to a file in the same manner. As you can see, we now have our converted ids. The ids will be in order based on the original Gene IDs given to the program. If you would like to keep the mapping, adding a column to the gene IDs you currently have might be useful (see below). Limitations of this method include a non 1:1 mapping between Gene ID options. It is quite common that there is more than one Ensembl Gene ID for another Gene ID option. This function will return all mappings, but, since it just returns the list, you do not know which gene in your original set has mapped to two different genes in the new set. This may not be an issue for some use cases, but sometimes it is important to know.
If you would like to keep the old Gene IDs along with the mapping, follow the next set of instructions!
Simple Case: Convert a list of Gene IDs and Keep Mapping
Purpose: Given a list of Gene IDs that are of Gene ID type A, convert to Gene ID type B and keep both Gene ID A and Gene ID B in a table.
Step 1: Define your list of ids. This is the same list I used above, and remember they are NCBI Gene IDs. I am also defining my current Gene ID type and the Gene ID type I would like to convert to here.
list_of_gene_ids = [
100000252, 100000750, 100001198, 100001260, 100002225, 100002263,
100002756, 100003223, 100007521, 100149273, 100149794, 100170795,
100321746, 100329897, 100330617,
]
current_gene_id_type = 'NCBI Gene ID'
desired_gene_id_type = 'ZFIN ID'
Step 2: Do converstion. If we would like to keep the mapping, we would use the convert_ids
function in the mapping
module of our library but activate the keep_mapping
parameter. By default, as used earlier, keep_mapping = False
# do conversion
converted_id_table = mapping.convert_ids(list_of_gene_ids, current_gene_id_type, desired_gene_id_type, keep_mapping=True)
converted_id_table
NCBI Gene ID | ZFIN ID | |
---|---|---|
0 | 100002263 | ZDB-GENE-030131-1904 |
1 | 100330617 | ZDB-GENE-030131-3404 |
2 | 100001198 | ZDB-GENE-030325-1 |
3 | 100003223 | ZDB-GENE-030616-609 |
4 | 100000252 | ZDB-GENE-040426-743 |
5 | 100001260 | ZDB-GENE-050309-246 |
6 | 100321746 | ZDB-GENE-071009-6 |
7 | 100002225 | ZDB-GENE-080219-34 |
8 | 100170795 | ZDB-GENE-080723-44 |
9 | 100000750 | ZDB-GENE-081223-2 |
10 | 100007521 | ZDB-GENE-090313-141 |
11 | 100149273 | ZDB-GENE-091117-28 |
12 | 100149794 | ZDB-GENE-110309-3 |
13 | 100329897 | ZDB-GENE-120215-92 |
You can save the data in the same way. In this case, since we have a Pandas DataFrame
with column headings, the column names will be saved automatically.
If you would like to read the Gene IDs in from a file, all steps are the same besides the initialization of the Gene IDs:
data_file_path = 'data/test_data/ncbi_gene_list.txt'
list_of_gene_ids = pd.read_csv(data_file_path, sep='\t')
current_gene_id_type = 'NCBI Gene ID'
desired_gene_id_type = 'ZFIN ID'
# do conversion
converted_id_table = mapping.convert_ids(list_of_gene_ids, current_gene_id_type, desired_gene_id_type, keep_mapping=True)
converted_id_table
NCBI Gene ID | ZFIN ID | |
---|---|---|
0 | 57924 | ZDB-GENE-000616-10 |
1 | 794572 | ZDB-GENE-050419-85 |
2 | 564772 | ZDB-GENE-080415-1 |
3 | 30459 | ZDB-GENE-991229-12 |
The above methodologies are great if you have list of Gene IDs you would like to convert. There are cases where you might have a large dataset and one column in that dataset has the Gene IDs that you would like to convert. In this scenario, keeping all columns properly sorted is extremely important.
Convert Gene IDs in a Column of a Larger Dataset
Purpose: you have a dataset with columns x, y, z. Column x has Gene IDs in type A. You would like to convert these Gene IDs to type B while maintaining the information of columns y, z.
In this scenario, you might have some data that looks like:
NCBI Gene ID |
PValue |
logFC |
---|---|---|
100002263 |
2.3 |
0.03 |
… |
… |
… |
The information in the log2FC and PValue columns are essential to keep ‘in order’ with the GeneID column. It is often that in this scenario, you will have an entire gene set and will be dealing with a lot more data. Lets look at a test dataset for this case.
Step 1: Read in the data. The data in the test_data
sub-directory is in a tsv
type format with a .txt
extension. The pandas
package can read this without an issue (same with excel
or a csv
file), we just need to specify the separator. \t
is really the best for this type of data. Note that any excel file or csv file should work here.
If you are interested, the raw data can be downloaded from our GitHub here.
data_file_path = 'data/test_data/example_diff_express_data.txt'
data = pd.read_csv(data_file_path, sep='\t')
To get a quick look at the data, we can print the first three table entries and some data stats:
# print first three lines
data.head(3)
rows, cols = data.shape
print(f'There are {rows} rows and {cols} columns')
There are 5464 rows and 3 columns
Step 2: Convert Gene IDs. To execute the mapping and add the column, we will use the add_mapped_column
function from the mapping
module. We will give this function the data read in above, and we need to specify the id_from
and id_to
as before.
id_from = 'NCBI Gene ID'
id_to = 'ZFIN ID'
new_data = mapping.add_mapped_column(data, id_from, id_to)
new_data.head(3)
NCBI Gene ID | ZFIN ID | PValue | logFC | |
---|---|---|---|---|
0 | 100000006 | ZDB-GENE-030131-5654 | 0.792615 | 0.115009 |
1 | 100000044 | ZDB-GENE-121214-60 | 0.015286 | 0.803879 |
2 | 100000085 | ZDB-GENE-030131-1312 | 0.264762 | 0.267360 |
Step 3: Save to file. You can save this dataframe to a file as done previously.
file_name = 'data/out_data/converted_dataframe.txt'
utils.save_data(new_data, file_name)
Special Cases
Reminder that the id_from
and id_to
must match the specificed options: NCBI Gene ID, Symbol, ZFIN ID, or Ensembl Gene ID. If the column in your dataset does not match one of the ID type options, that is fine. The add_mapped_column
function has an option to deal with that.
If you are interested in downloading this dataset, you can do so from our GitHub here.
data_file_path = 'data/test_data/test_set_invalid_col_name.txt'
data = pd.read_csv(data_file_path, sep='\t')
data.head(3)
NCBI ID | PValue | logFC | |
---|---|---|---|
0 | 100000006 | 0.792615 | 0.115009 |
1 | 100000009 | 0.607285 | -0.144714 |
2 | 100000026 | 0.021338 | 0.603871 |
As you can see here, the column name we have is NCBI ID
, which is not one of the options. We can use the column_name_with_ids
parameter to get around this. Note that the id_from
parameter must be one of the options and the column_name_with_ids
must match the column name with the IDs in the dataset exactly. Of course, we would want id_from
and column_name_with_ids
to be of the same Gene ID type, likely with just a different spelling or naming convention as seen here.
id_from = 'NCBI Gene ID'
id_to = 'ZFIN ID'
column_name_with_ids = 'NCBI ID'
new_data = mapping.add_mapped_column(data, id_from, id_to, column_name_with_ids)
new_data.head(3)
NCBI ID | ZFIN ID | PValue | logFC | |
---|---|---|---|---|
0 | 100000006 | ZDB-GENE-030131-5654 | 0.792615 | 0.115009 |
1 | 100000009 | ZDB-GENE-130530-778 | 0.607285 | -0.144714 |
2 | 100000026 | ZDB-GENE-120823-1 | 0.021338 | 0.603871 |
We can see that the mapping was executed successfully even though the column name is not one of our specified options.
In the final special case, we could also drop the old column ids if we don’t need them anymore.
new_data = mapping.add_mapped_column(data, id_from, id_to,
column_name_with_ids, keep_old_ids=False)
new_data.head(3)
ZFIN ID | PValue | logFC | |
---|---|---|---|
0 | ZDB-GENE-030131-5654 | 0.792615 | 0.115009 |
1 | ZDB-GENE-130530-778 | 0.607285 | -0.144714 |
2 | ZDB-GENE-120823-1 | 0.021338 | 0.603871 |
In this run I specified the parameter keep_old_ids
to be False
; therefore, it is no longer included in the output.
Conclusion
This concludes the mapping tutorial! In summary, the key functions in this library for mapping are:
function |
purpose |
---|---|
convert_ids |
convert a list of Gene IDs |
add_mapped_column |
add a converted list of Gene IDs to an existing DataFrame |
For more information about the full functionality of each function, please refer to the API Reference.