Updating the good old Homologene database

Homologene

I am the author of a small package called homologene. The package wraps the identically named homologene database released by NCBI. This database includes information about genes that are homologues/orthologues of each other. It is structured as a simple table. One can simply access it by doing

library(readr)
library(magrittr)
library(dplyr)
dir.create('homologene-files',showWarnings = FALSE)
download.file(url = "ftp://ftp.ncbi.nih.gov/pub/HomoloGene/current/homologene.data", 
              destfile = 'homologene-files/homologene.data')

homologene = read_tsv('homologene-files/homologene.data',
                      col_names = c('HID',
                                  'Taxonomy',
                                  'Gene.ID',
                                  'Gene.Symbol',
                                  'Protein.GI',
                                  'Protein.Accession'))

homologene
## # A tibble: 275,237 x 6
##      HID Taxonomy Gene.ID Gene.Symbol     Protein.GI Protein.Accession
##    <dbl>    <dbl>   <dbl> <chr>                <dbl> <chr>            
##  1     3     9606      34 ACADM              4557231 NP_000007.1      
##  2     3     9598  469356 ACADM            160961497 NP_001104286.1   
##  3     3     9544  705168 ACADM            109008502 XP_001101274.1   
##  4     3     9615  490207 ACADM            545503811 XP_005622188.1   
##  5     3     9913  505968 ACADM            115497690 NP_001068703.1   
##  6     3    10090   11364 Acadm              6680618 NP_031408.1      
##  7     3    10116   24158 Acadm            292494885 NP_058682.2      
##  8     3     7955  406283 acadm            390190229 NP_998175.2      
##  9     3     7227   38864 CG12262           24660351 NP_648149.1      
## 10     3     7165 1276346 AgaP_AGAP005662   58387602 XP_315683.2      
## # ... with 275,227 more rows

In the homologene package, a subset of this data is available for querying.

head(homologene::homologeneData)
##   HID Gene.Symbol Taxonomy Gene.ID
## 1   3      acdh-8     6239  173979
## 2   3      acdh-7     6239  181758
## 3   5     acdh-12     6239  174180
## 4   6       kat-1     6239  174161
## 5   6     T02G5.4     6239  187992
## 6  13     R04B3.2     6239  180550

At the time I removed the Protein.GI and Protein.Accession since my work had no use for them and limited the information to 7 popular model organisms to so that package wouldn’t be too bloated.

The package includes functions to translate orthologues between species with a one liner

homologene::homologene(c('ENO2','MOG','GZMH'),
                       inTax = 9606, # human
                       outTax = 10090) # mouse
##   9606 10090 9606_ID 10090_ID
## 1 ENO2  Eno2    2026    13807
## 2  MOG   Mog    4340    17441
## 3 GZMH  Gzmd    2999    14941
## 4 GZMH  Gzme    2999    14942
## 5 GZMH  Gzmg    2999    14944
## 6 GZMH  Gzmf    2999    14943

In general I like using this method because it’s fast, the syntax is simple and it’s easy to extend it at a whim. A single set of gene symbols and IDs are used to identify genes and organism level data is tied to taxonomy identifiers which makes things easier. The obvious alternative to this would be to use biomart1 which to me, is less intuitive and clunky.

What is wrong with homologene?

Not everyone is happy with using homologene for orthology detection. The database has not been updated since 2014. This means it is not created using the latest annotations. The reference genomes have been getting updates and gene symbols change frequently. For example here you can see a list of gene name changes that happened in a 15 day period at February 9th 2019.

That list above is generated by parsing gene_info.gz file provided by NCBI. I use this file to keep an up to date list of gene symbols and their synonyms. It comes in handy when I use data released by other researchers that only include gene symbols as identifiers. Using the data here, we can identify how many genes in homologene have changed their names, assuming the NCBI ids are the same. I will be using the data in the form it is made available in the geneSyonym package.

library(geneSynonym)
library(homologene)
library(purrr)

# get mouse gene IDs from the mouse synonym data shipped with geneSynonym
gene_IDs = names(syno10090)

# get mouse genes from the homologene data
homologene::homologeneData %>%
    filter(Gene.ID %in% gene_IDs) ->
    mouse_genes

# get fisrt names (official symbol) from the mouse synonym data
syno10090[as.character(mouse_genes$Gene.ID)] %>%
    strsplit('\\|') %>%
    map_chr(1) ->
    first_names

# set new names as a column
mouse_genes$new_names = first_names

mistmatched_new_names = 
    mouse_genes %>% filter(new_names != Gene.Symbol)

# how many mismatches
nrow(mistmatched_new_names)
## [1] 1418
# take a peek at the first 10
head(mistmatched_new_names, 10)
##     HID   Gene.Symbol Taxonomy   Gene.ID new_names
## 1    93        Epb4.2    10090     13828     Epb42
## 2   479          Daf2    10090     13137     Cd55b
## 3   660        Gm3934    10090 100042625   Gstp-ps
## 4   664       Gucy1b3    10090     54195   Gucy1b1
## 5   984         Clca3    10090     23844     Clca1
## 6  1137          Gbas    10090     14467  Nipsnap2
## 7  1185          Ict1    10090     68572    Mrpl58
## 8  1189 I830012O16Rik    10090    667370    Ifit3b
## 9  1194         Cyr61    10090     16007      Ccn1
## 10 1223        Adrbk1    10090    110355      Grk2

That is 1418 genes that have a different official symbol than they used to in 2014. And that is only in mouse. Considering this only corresponds to 6.7% of all gene symbols, this number isn’t too much but it does mean that we will be missing a few genes if we were using gene symbols for matching. Alas, since I wasn’t wise enough 4 years ago when I started my projects, there are parts of my pipelines that still rely on gene symbol matching.

Most common transformation I make is translating my lists of brain cell type markers that were created using mouse data2 to their human orthologues. I can see which markers are lost due to differences in gene symbols when making this transformation.

# this package includes my marker gene list
# github.com/pavlidisLab/markerGeneProfile
library(markerGeneProfile)
data("mouseMarkerGenesCombined")

mouseMarkerGenesCombined %>% 
    unlist %>%
    unique %>% 
    {.[. %in% mistmatched_new_names$new_names]} 
##  [1] "Adgrg1"   "Hacd2"    "Msmo1"    "Nat8f4"   "Pdzph1"   "Arhgap45"
##  [7] "Cd300c2"  "Fam91a1"  "Hacd4"    "Siglecf"  "Washc2"   "Adgre1"  
## [13] "Stk26"    "Vsir"     "Mcub"     "Plpp3"    "Patj"     "Scn2a"   
## [19] "Diaph2"   "Epb41l2"  "Grk2"     "Nat8f5"   "Naprt"    "Nat8f1"  
## [25] "Adgre5"   "Adgrf5"   "Adgrl4"   "Map3k20"  "Rflnb"    "Tns2"    
## [31] "Nectin3"  "Adgre4"   "Cntrl"    "Hcar2"    "Rp2"      "Spata46" 
## [37] "Cnmd"     "Plpp2"    "Ankub1"   "Mfsd13b"  "Phf24"    "Cbarp"   
## [43] "Cops9"    "Ints6l"   "Epb41l1"  "Adgrb1"

That is 46 genes which corresponds to a 1.8%. This is probably a more realistic image of real world implications since not all genes are equally important. Many genes that are prone to change are pseudogenes, non-coding transcripts etc. Such genes are less likely to come up in an analysis.

Updating homologene gene IDs

One small thing we can do to improve this situation is to manually replace the old gene symbols with new ones. Assuming the gene IDs are the same, we could just replace the old names with the new. However, gene IDs are also prone to changes3 and there is a whole different file that keeps track of these changes. We need to look into that file and update the gene IDs too.

download.file(url = "ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_history.gz", 
              destfile = 'homologene-files/gene_history.gz')

gene_history = read_tsv('homologene-files/gene_history.gz',
                        col_names = c('tax_id', 
                                      'GeneID',
                                      'Discontinued_GeneID',
                                      'Discontinued_Symbol',
                                      'Discontinue_Date'))

gene_history
## # A tibble: 11,165,348 x 5
##    tax_id GeneID Discontinued_GeneID Discontinued_Symbol Discontinue_Date
##     <dbl> <chr>                <dbl> <chr>                          <dbl>
##  1      9 -                  1246494 repA1                       20031113
##  2      9 -                  1246495 repA2                       20031113
##  3      9 -                  1246496 leuA                        20031113
##  4      9 -                  1246497 leuB                        20031113
##  5      9 -                  1246498 leuC                        20031113
##  6      9 -                  1246499 leuD                        20031113
##  7      9 -                  1246506 yqhA                        20031113
##  8      9 -                  1246507 repA2                       20031113
##  9      9 -                  1246508 repA1                       20031113
## 10     24 -                 24952829 jef                         20150716
## # ... with 11,165,338 more rows

One unverified assumption I am making is that once an ID is discontinued, it won’t be used for another gene. Let’s make sure that is actually the case before doing anything else

# combine all the modern IDs from the geneSynonym package (same species as the homologene package)
modern_IDs  = list(syno10090,
                   syno10116,
                   syno6239,
                   syno7227,
                   syno7955,
                   syno9544,
                   syno9606) %>%
    lapply(names) %>%
    do.call(c,.)

gene_history %>% filter(Discontinued_GeneID %in% modern_IDs)
## # A tibble: 0 x 5
## # ... with 5 variables: tax_id <dbl>, GeneID <chr>,
## #   Discontinued_GeneID <dbl>, Discontinued_Symbol <chr>,
## #   Discontinue_Date <dbl>

No current ID was ever discontinued. That’s a good sign. I can just trace the history of discontinued IDs and find the current IDs. Note that when I first tried this here were 20 IDs here. 20 IDs that were discontinued since I last updated the geneSynonym package. Since all of these files area updated nightly it is important to update the databases at the same time.

discontinued_ids = homologeneData %>% 
    filter(Gene.ID %in% gene_history$Discontinued_GeneID)

dim(discontinued_ids)
## [1] 3169    4
unchanged_ids = homologeneData %>%  
    filter(!Gene.ID %in% gene_history$Discontinued_GeneID)

It’s good to see that only a relatively small number of IDs were discontinued. We can trace their history to get their current IDs (if it exists). To make this process faster, we will also take a relevant subset of the gene_history frame.

# get the earliest date where we see one of our ID change events from homologene
# since homologene was updated in 2014, this should be no earlier that that date
earlierst_date = gene_history %>%
    filter(Discontinued_GeneID %in% homologeneData$Gene.ID) %$%
    Discontinue_Date %>% 
    min

earlierst_date
## [1] 20100919

This is interesting. Why does homologene includes gene IDs that were discontinued 4 years before it’s creation? How many such IDs are out there

# get all discontinuation events earlier than 2014. Events are coded as integers
# in YYYYMMDD format
gene_history %>%
    filter(Discontinue_Date < 20140000 & Discontinued_GeneID %in% homologeneData$Gene.ID)
## # A tibble: 9 x 5
##   tax_id GeneID    Discontinued_GeneID Discontinued_Symbol Discontinue_Date
##    <dbl> <chr>                   <dbl> <chr>                          <dbl>
## 1   9544 700788                 702240 DOCK1                       20100919
## 2   9544 -                      707717 LOC707717                   20110304
## 3   9544 -                      708767 TNNI3K                      20110308
## 4   9544 100423619              711097 MAP4                        20130430
## 5   9544 -                      714619 LOC714619                   20110326
## 6   9544 -                      714812 MATR3                       20110401
## 7   9544 -                      715750 LOC715750                   20110325
## 8   9544 -                      720334 MEF2B                       20110329
## 9   9544 712686              100424395 LOC100424395                20100928

Ok it’s not so bad. Not sure what’s the reason behind this but it doesn’t seem like a catastrophic failure.

So now we can filter the gene_history to only include species that we are interested in and events that happened after the earliest discontinuation date.

relevant_gene_history = gene_history %>% filter(Discontinue_Date >= earlierst_date & 
                                                    tax_id %in% homologene::taxData$tax_id)

dim(relevant_gene_history)
## [1] 96776     5

That’s a much smaller search space. Now we can just trace the history of every single gene with a different ID than before. I will probably end up including a version of this function to the geneSynonym package as pretty much all gene lists are at least a little out of date.

traceID = function(id){
    print(id)
    event = relevant_gene_history %>% filter(Discontinued_GeneID == id)
    if(nrow(event)>1){
        # just in case. if the same ID is discontinued twice, there is a problem...
        return("multiple events")
    }
    while(TRUE){
        # see if the new ID is discontinued as well
        next_event = relevant_gene_history %>%
            filter(Discontinued_GeneID == event$GeneID)
        if(nrow(next_event)==0){
            # if not, previous ID is the right one
            return(event$GeneID)
        } else if(nrow(next_event)>1){
            # just in case, if the same ID is discontinued twice, there is a problem...
            return("multiple events")
        } else if(nrow(next_event) == 1){
            # if the new IDs is discontinued, continue the loop and check if it has a parent
            event = next_event
        }
    }
}
discontinued_ids$Gene.ID %>%
    sapply(traceID) ->
    new_ids

Not the most efficient code probably but it’ll be run as a CRON job in the office machine at around 6 am on Sundays so I don’t really care that much about efficiency.

Lets do a few sanity checks

# are there any ids that had multiple events
any(new_ids == "multiple events")
## [1] FALSE
# are there any new ids that the geneSynonym package doesn't know about?
all(new_ids[new_ids != '-'] %in% modern_IDs)
## [1] TRUE
# how many of the discontinued IDs end up having new IDs instead of getting removed
length(new_ids[new_ids != '-'])
## [1] 1401
# how many IDs are lost forever
length(new_ids[new_ids == '-'])
## [1] 1768

Good.

Updating homologene gene symbols

Since we have the new ids now, we can start building our new homologeneData2 frame that will be added to the homologene package. I don’t want to overwrite the original data because people might use it expecting a perfect match with the NCBI database. Also if I messed up here I don’t want people to suffer without explicitly choosing to trust me instead of the actual database.

# create a frame with new ids
discontinued_fix = data.frame(HID = discontinued_ids$HID,
                              Gene.Symbol = discontinued_ids$Gene.Symbol,
                              Taxonomy = discontinued_ids$Taxonomy,
                              Gene.ID = new_ids,
                              stringsAsFactors = FALSE)

# remove symbols that are discontinued
discontinued_fix %<>% filter(Gene.ID != '-')

head(discontinued_fix)
##   HID  Gene.Symbol Taxonomy Gene.ID
## 1 194 LOC100538014     7955  445292
## 2 402 LOC100535497     7955  550518
## 3 522        ltbp1     7955  562072
## 4 592    LOC563696     7955  556619
## 5 595 LOC101882975     7955  561906
## 6 639 LOC100537604     7955  557240

Before going further let’s make sure we aren’t doing anything crazy by peeking at the NCBI website for the id of Itbp1 gene we see there.

All looks fine, so we can put everything together and replace the gene symbols as well

# recombine the two lists sort by homology group
homologeneData2 = 
    rbind(discontinued_fix,unchanged_ids) %>% 
    arrange(HID)

# change the names with the new names
modern_symbols = list(syno10090,
                      syno10116,
                      syno6239,
                      syno7227,
                      syno7955,
                      syno9544,
                      syno9606) %>% 
    lapply(function(x){
        strsplit(x,split = "\\|") %>% map_chr(1)
    }) %>% do.call(c,.)

modern_frame = tibble(modern_IDs,
                      modern_symbols)

new_symbols = 
    modern_frame$modern_symbols[match(homologeneData2$Gene.ID, modern_frame$modern_IDs)]

# a sanity check. all genes with different symbols
sum(homologeneData2$Gene.Symbol != new_symbols)
## [1] 14964
# another sanity check, differences in mouse. 
# we did this in the beginningas well
# note that the number is higher than before 
# since the new IDs allow more genes to be changed
sum((homologeneData2$Gene.Symbol != new_symbols)[homologeneData2$Taxonomy ==10090])
## [1] 1443
homologeneData2 %<>% 
    mutate(Gene.Symbol = modern_frame$modern_symbols[match(Gene.ID,modern_frame$modern_IDs)])

head(homologeneData2)
##   HID Gene.Symbol Taxonomy Gene.ID
## 1   3      acdh-8     6239  173979
## 2   3      acdh-7     6239  181758
## 3   3     CG12262     7227   38864
## 4   3       acadm     7955  406283
## 5   3       ACADM     9544  705168
## 6   3       ACADM     9606      34

The new homologene dataset and associated functions to use them should be on Github in a few days. Then I will set up a CRON job to update it weekly. The CRAN version will have to be perpetually out of date which is somewhat problematic, but it can’t be helped. I may include a function that builds the current version and let users specify the version they want to use. The new syntax will probably look like this:

# translate using the old database
homologene::homologene(c('ENO2','MOG','GZMH'),
                       inTax = 9606,
                       outTax = 10090) 


# translate using the new database
homologene::homologene(c('ENO2','MOG','GZMH'),
                       inTax = 9606,
                       outTax = 10090,
                       db = homologene2) 

# translate using the latest database
homologene::update_homologene(path)
homologene::homologene(c('ENO2','MOG','GZMH'),
                       inTax = 9606,
                       outTax = 10090,
                       db = path)

Before the next CRAN update I want to include a wrapper for biomaRt that uses the same syntax as homologene to make queries in ensembl so that’ll likely be the next post.

Bibliography

1 Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A et al. BioMart and bioconductor: A powerful link between biological databases and microarray data analysis. Bioinformatics 2005; 21: 3439–3440.

2 Mancarci BO, Toker L, Tripathy SJ, Li B, Rocco B, Sibille E et al. Cross-laboratory analysis of brain cell type transcriptomes with applications to interpretation of bulk tissue data. eNeuro 2017;: ENEURO–0212.

3 Information NC for B, Pike USNL of M8R, MD B, Usa 2. Gene Frequently Asked Questions. National Center for Biotechnology Information (US), 2018.