Retrieving Orthologs

Rohit Satyam

King Abdullah University of Science & Technology, Saudi Arabia

Alberto Maillo

King Abdullah University of Science & Technology, Saudi Arabia

David Gomez-Cabrero

King Abdullah University of Science & Technology, Saudi Arabia

Arnab Pain

King Abdullah University of Science & Technology, Saudi Arabia

22 August, 2025

Source: vignettes/retrieving_orthologs.Rmd

retrieving_orthologs.Rmd

Abstract

This is a quick and dirty tutorial on how to fetch orthologs from OrthoMCL and InparanoiDB databases.

Introduction

Users might require Ortholog mapping to:

to assess phyletic profile of a set of genes and see if they have orthologs in other related organism or not.
integrate datasets obtained from two different species. Often such datasets are scRNASeq (eg: (Tebben, Dia, and Serre 2022))
borrow annotations such as localization for hypothesis generation.
perform co-expression analysis between two different species.

Whatever the use case maybe, we will see how we can exploit getpairedOrthologs() to fetch orthologs of multiple organisms of interest in batch manner and integrate them.

# Load package and some other useful packages by using
suppressPackageStartupMessages(
  suppressWarnings({
    library(plasmoRUtils)
    library(dplyr)
    library(plyr)}))

Here, we will try to get all Plasmodium falciparum 3D7 (1742855) orthologs present in Toxoplasma gondii ME49 (1747281), Plasmodium berghei ANKA (1742951) and Plasmodium vivax Sal1(1744988). The numbers mentioned alongside are vocabulary or unique IDs that OrthoMCL uses to fetch paired orthologs.

#?getpairedOrthologs
# listOrthomcl()

res <- lapply(c(1744988,1747281,1742951), function(x){
  getpairedOrthologs(from = 1742855,
                     to=x,
                     db="orthomcl",transform = FALSE)
}) %>% setNames(c("Pvivax","Tgme49","Pb"))

merged_df <- Reduce(function(x, y) merge(x, y, by = "Accession", all = TRUE), res)
merged_df %>% head()
#>            Accession
#> 1 pfal|PF3D7_0100100
#> 2 pfal|PF3D7_0100300
#> 3 pfal|PF3D7_0102500
#> 4 pfal|PF3D7_0102600
#> 5 pfal|PF3D7_0102700
#> 6 pfal|PF3D7_0102800
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Target ID.x
#> 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 pviv|PVX_110810
#> 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 pviv|PVX_110810
#> 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 pviv|PVX_110810
#> 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 pviv|PVX_088265
#> 5 pviv|PVX_112665,pviv|PVX_090250,pviv|PVX_097577,pviv|PVX_094305,pviv|PVX_101510,pviv|PVX_002500,pviv|PVX_090275,pviv|PVX_092995,pviv|PVX_090270,pviv|PVX_112675,pviv|PVX_096950,pviv|PVX_125730,pviv|PVX_101525,pviv|PVX_112670,pviv|PVX_083550,pviv|PVX_088825,pviv|PVX_092990,pviv|PVX_115465,pviv|PVX_090260,pviv|PVX_112690,pviv|PVX_088810,pviv|PVX_121897,pviv|PVX_090255,pviv|PVX_101515,pviv|PVX_112705,pviv|PVX_088850,pviv|PVX_112660,pviv|PVX_125728,pviv|PVX_109280,pviv|PVX_088820,pviv|PVX_097575,pviv|PVX_096995,pviv|PVX_090265,pviv|PVX_112655
#> 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 pviv|PVX_081615
#>          Target ID.y
#> 1               <NA>
#> 2               <NA>
#> 3               <NA>
#> 4 tgon|TGME49_289050
#> 5               <NA>
#> 6               <NA>
#>                                                                                                                 Target ID
#> 1                                                                                                     pber|PBANKA_1332700
#> 2                                                                                                     pber|PBANKA_1332700
#> 3                                                                                                     pber|PBANKA_1332700
#> 4                                                                                                     pber|PBANKA_1225000
#> 5 pber|PBANKA_0623100,pber|PBANKA_1146100,pber|PBANKA_0524300,pber|PBANKA_0524400,pber|PBANKA_1146000,pber|PBANKA_0623300
#> 6                                                                                                     pber|PBANKA_0210300

## Tidying up the dataframe by removing organisms prefixes
strip_prefix <- function(x) sub("^[^|]+\\|", "", x)     # per-token
clean_list   <- function(s) {
    toks <- trimws(strsplit(s, ",", fixed = TRUE)[[1]])
    paste(vapply(toks, strip_prefix, character(1)), collapse = ",")
}

merged_df[] <- lapply(merged_df, function(col) vapply(col, clean_list, character(1)))

## Changing column names
colnames(merged_df) <- c("P falciparum 3D7","P vivax Sal1","T gondii ME49","Plasmodium berghei ANKA")

merged_df %>% head()
#>   P falciparum 3D7
#> 1    PF3D7_0100100
#> 2    PF3D7_0100300
#> 3    PF3D7_0102500
#> 4    PF3D7_0102600
#> 5    PF3D7_0102700
#> 6    PF3D7_0102800
#>                                                                                                                                                                                                                                                                                                                                                                            P vivax Sal1
#> 1                                                                                                                                                                                                                                                                                                                                                                            PVX_110810
#> 2                                                                                                                                                                                                                                                                                                                                                                            PVX_110810
#> 3                                                                                                                                                                                                                                                                                                                                                                            PVX_110810
#> 4                                                                                                                                                                                                                                                                                                                                                                            PVX_088265
#> 5 PVX_112665,PVX_090250,PVX_097577,PVX_094305,PVX_101510,PVX_002500,PVX_090275,PVX_092995,PVX_090270,PVX_112675,PVX_096950,PVX_125730,PVX_101525,PVX_112670,PVX_083550,PVX_088825,PVX_092990,PVX_115465,PVX_090260,PVX_112690,PVX_088810,PVX_121897,PVX_090255,PVX_101515,PVX_112705,PVX_088850,PVX_112660,PVX_125728,PVX_109280,PVX_088820,PVX_097575,PVX_096995,PVX_090265,PVX_112655
#> 6                                                                                                                                                                                                                                                                                                                                                                            PVX_081615
#>   T gondii ME49
#> 1            NA
#> 2            NA
#> 3            NA
#> 4 TGME49_289050
#> 5            NA
#> 6            NA
#>                                                                     Plasmodium berghei ANKA
#> 1                                                                            PBANKA_1332700
#> 2                                                                            PBANKA_1332700
#> 3                                                                            PBANKA_1332700
#> 4                                                                            PBANKA_1225000
#> 5 PBANKA_0623100,PBANKA_1146100,PBANKA_0524300,PBANKA_0524400,PBANKA_1146000,PBANKA_0623300
#> 6                                                                            PBANKA_0210300

Since OrthoMCL and other VEuPathDB database are not updated simultaneously, chances are that you might be using old OrthoMCL ID. If that’s the case above function will not be much helpful. The workaround is to fetch the old OrthoMCL IDs for each organisms and their respective databases and use it as an anchor to combine and collapse gene IDs from all the organisms. The code snippet below demonstrate the same.

## List all organisms
list <- listVeupathdb(customFields = c("primary_key","project_id","species",'species_ncbi_tax_id'))

## Subset organisms I am interested in 
dbs <- list[grep("3D7|ME49$|Sal-1", list$Organism),]
dbs
#> # A tibble: 3 × 4
#>   Organism                  `VEuPathDB Project` Species   Species NCBI taxon I…¹
#>   <chr>                     <chr>               <chr>                      <dbl>
#> 1 Plasmodium vivax Sal-1    PlasmoDB            Plasmodi…                   5855
#> 2 Plasmodium falciparum 3D7 PlasmoDB            Plasmodi…                   5833
#> 3 Toxoplasma gondii ME49    ToxoDB              Toxoplas…                   5811
#> # ℹ abbreviated name: ¹`Species NCBI taxon ID`
df <- lapply(1:nrow(dbs), function(x){
  plasmoRUtils::getTable(org=dbs[x,]$Organism, db=tolower(dbs[x,]$`VEuPathDB Project`),customFields = c("primary_key" ,"gene_orthomcl_name"))
})

## Retain protein coding genes

df2 <- lapply(df, function(x){
  x[!(stringr::str_detect(pattern = "N/A",string = x$`Ortholog Group`)),]
})

## Combine all the tables
merged_df2 <- Reduce(function(x, y) merge(x, y, by = "Ortholog Group", all = TRUE), df2)

merged_df2 %>% head(n = 10)
#>    Ortholog Group  Gene ID.x     Gene ID.y       Gene ID
#> 1      OG6_100000       <NA>          <NA> TGME49_231880
#> 2      OG6_100004 PVX_086920          <NA>          <NA>
#> 3      OG6_100006 PVX_083360 PF3D7_1349300 TGME49_237210
#> 4      OG6_100006 PVX_118355 PF3D7_1349300 TGME49_237210
#> 5      OG6_100007 PVX_001655          <NA>          <NA>
#> 6      OG6_100012 PVX_001960 PF3D7_1021900 TGME49_261590
#> 7      OG6_100012 PVX_001960 PF3D7_1021900 TGME49_308810
#> 8      OG6_100012 PVX_001960 PF3D7_1423100 TGME49_261590
#> 9      OG6_100012 PVX_001960 PF3D7_1423100 TGME49_308810
#> 10     OG6_100012 PVX_085325 PF3D7_1021900 TGME49_261590
## Combining all the gene IDs in each column so that we have 1 row per orthogroup
merged_df2 <- merged_df2 %>% 
  dplyr::group_by( `Ortholog Group`) %>% 
  dplyr::summarise(dplyr::across(dplyr::everything(), 
                                 ~paste(unique(.x), collapse = ",")), .groups = "drop")

## Changing column names
colnames(merged_df2) <- c("Orthogroup ID","P vivax Sal1","P falciparum 3D7","T gondii ME49")
merged_df2 %>% head(n = 10)
#> # A tibble: 10 × 4
#>    `Orthogroup ID` `P vivax Sal1`             `P falciparum 3D7` `T gondii ME49`
#>    <chr>           <chr>                      <chr>              <chr>          
#>  1 OG6_100000      NA                         NA                 TGME49_231880  
#>  2 OG6_100004      PVX_086920                 NA                 NA             
#>  3 OG6_100006      PVX_083360,PVX_118355      PF3D7_1349300      TGME49_237210  
#>  4 OG6_100007      PVX_001655                 NA                 NA             
#>  5 OG6_100012      PVX_001960,PVX_085325      PF3D7_1021900,PF3… TGME49_261590,…
#>  6 OG6_100025      PVX_081792,PVX_005565,PVX… PF3D7_1465800,PF3… TGME49_294550,…
#>  7 OG6_100026      PVX_096045,PVX_124085,PVX… PF3D7_1229100,PF3… TGME49_320460  
#>  8 OG6_100029      NA                         NA                 TGME49_207210  
#>  9 OG6_100034      PVX_089960                 NA                 NA             
#> 10 OG6_100045      PVX_095250                 PF3D7_0319700      NA

Fetching orthologs from InParanoiDB

InParanoiDB uses NCBI taxon IDs as unique identifiers. For P. falciparum (36329), T. gondii (5811), P. vivax (126793) and P. berghei (5823), we will fetch pairwise orthologs by keeping P. falciparum as constant species. We can recycle the code above as follows.

#?getpairedOrthologs
# listipdb()

res <- lapply(c(126793,5811,5823), function(x){
  getpairedOrthologs(from = 36329,
                     to=x,
                     db="ipdb",transform = TRUE)
}) %>% setNames(c("Pvivax","Tgme49","Pb"))

res[[1]] %>% head()
#> # A tibble: 6 × 11
#>   `query_Group-id` query_Bitscore query_Species           query_Inparalog-scor…¹
#>              <dbl> <chr>          <chr>                   <chr>                 
#> 1                1 8359           Plasmodium falciparum … 1                     
#> 2                2 8119           Plasmodium falciparum … 1                     
#> 3                3 7587           Plasmodium falciparum … 1                     
#> 4                4 7155           Plasmodium falciparum … 1                     
#> 5                5 6891           Plasmodium falciparum … 1                     
#> 6                6 6507           Plasmodium falciparum … 1                     
#> # ℹ abbreviated name: ¹`query_Inparalog-score`
#> # ℹ 7 more variables: `query_Protein-name` <chr>, `query_Seed-score` <chr>,
#> #   target_Bitscore <chr>, target_Species <chr>,
#> #   `target_Inparalog-score` <chr>, `target_Protein-name` <chr>,
#> #   `target_Seed-score` <chr>

Combining the results across InParanoiDB, is not as straightforward as for OrthoMCL because of lack of unique Orthogroup ID. However, user can use Query Uniprot ID as anchor, combine the results, convert Uniprot IDs to gene IDs and then collapse rows since two or more Uniprot IDs might correspond to same gene IDs.

Session Info

utils::sessionInfo()
#> R version 4.4.1 (2024-06-14 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26100)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=English_India.utf8  LC_CTYPE=English_India.utf8   
#> [3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C                  
#> [5] LC_TIME=English_India.utf8    
#> 
#> time zone: Asia/Riyadh
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] plyr_1.8.9         dplyr_1.1.4        plasmoRUtils_1.1.0 rlang_1.1.6       
#> [5] readr_2.1.5        janitor_2.2.1      BiocStyle_2.32.1  
#> 
#> loaded via a namespace (and not attached):
#>   [1] segmented_2.1-4             fs_1.6.6                   
#>   [3] ProtGenerics_1.36.0         matrixStats_1.5.0          
#>   [5] bitops_1.0-9                lubridate_1.9.4            
#>   [7] pRoloc_1.44.1               httr_1.4.7                 
#>   [9] RColorBrewer_1.1-3          doParallel_1.0.17          
#>  [11] ggsci_3.2.0                 tools_4.4.1                
#>  [13] MSnbase_2.30.1              backports_1.5.0            
#>  [15] utf8_1.2.6                  R6_2.6.1                   
#>  [17] lazyeval_0.2.2              withr_3.0.2                
#>  [19] prettyunits_1.2.0           gridExtra_2.3              
#>  [21] preprocessCore_1.66.0       cli_3.6.5                  
#>  [23] Biobase_2.64.0              textshaping_1.0.1          
#>  [25] gt_1.0.0                    sass_0.4.10                
#>  [27] topGO_2.56.0                mvtnorm_1.3-3              
#>  [29] randomForest_4.7-1.2        proxy_0.4-27               
#>  [31] pkgdown_2.1.3               Rsamtools_2.20.0           
#>  [33] systemfonts_1.2.3           txdbmaker_1.0.1            
#>  [35] AnnotationForge_1.46.0      dichromat_2.0-0.1          
#>  [37] parallelly_1.45.1           limma_3.60.6               
#>  [39] rstudioapi_0.17.1           impute_1.78.0              
#>  [41] RSQLite_2.4.1               FNN_1.1.4.1                
#>  [43] generics_0.1.4              BiocIO_1.14.0              
#>  [45] vroom_1.6.5                 gtools_3.9.5               
#>  [47] car_3.1-3                   dendextend_1.19.1          
#>  [49] GO.db_3.19.1                Matrix_1.7-1               
#>  [51] MALDIquant_1.22.3           drawProteins_1.24.0        
#>  [53] S4Vectors_0.42.1            abind_1.4-8                
#>  [55] lifecycle_1.0.4             yaml_2.3.10                
#>  [57] snakecase_0.11.1            carData_3.0-5              
#>  [59] SummarizedExperiment_1.34.0 recipes_1.3.1              
#>  [61] SparseArray_1.4.8           BiocFileCache_2.12.0       
#>  [63] grid_4.4.1                  blob_1.2.4                 
#>  [65] promises_1.3.3              crayon_1.5.3               
#>  [67] PSMatch_1.8.0               lattice_0.22-6             
#>  [69] beachmat_2.20.0             annotate_1.82.0            
#>  [71] GenomicFeatures_1.56.0      chromote_0.5.1             
#>  [73] mzR_2.38.0                  KEGGREST_1.44.1            
#>  [75] pillar_1.11.0               knitr_1.50                 
#>  [77] GenomicRanges_1.56.2        rjson_0.2.23               
#>  [79] lpSolve_5.6.23              future.apply_1.20.0        
#>  [81] codetools_0.2-20            mgsub_1.7.3                
#>  [83] glue_1.8.0                  pcaMethods_1.96.0          
#>  [85] data.table_1.17.8           MultiAssayExperiment_1.30.3
#>  [87] vctrs_0.6.5                 png_0.1-8                  
#>  [89] gtable_0.3.6                kernlab_0.9-33             
#>  [91] cachem_1.1.0                gower_1.0.2                
#>  [93] xfun_0.52                   prodlim_2025.04.28         
#>  [95] S4Arrays_1.4.1              polyglotr_1.7.0            
#>  [97] coda_0.19-4.1               survival_3.8-3             
#>  [99] ncdf4_1.24                  timeDate_4041.110          
#> [101] SingleCellExperiment_1.26.0 iterators_1.0.14           
#> [103] hardhat_1.4.1               lava_1.8.1                 
#> [105] statmod_1.5.0               MLInterfaces_1.84.0        
#> [107] ipred_0.9-15                nlme_3.1-166               
#> [109] bit64_4.6.0-1               progress_1.2.3             
#> [111] filelock_1.0.3              LaplacesDemon_16.1.6       
#> [113] GenomeInfoDb_1.40.1         bslib_0.9.0                
#> [115] affyio_1.74.0               irlba_2.3.5.1              
#> [117] rpart_4.1.23                colorspace_2.1-1           
#> [119] BiocGenerics_0.50.0         DBI_1.2.3                  
#> [121] nnet_7.3-19                 tidyselect_1.2.1           
#> [123] processx_3.8.6              bit_4.6.0                  
#> [125] compiler_4.4.1              curl_6.4.0                 
#> [127] rvest_1.0.4                 httr2_1.2.0                
#> [129] graph_1.82.0                SparseM_1.84-2             
#> [131] xml2_1.3.8                  desc_1.4.3                 
#> [133] DelayedArray_0.30.1         plotly_4.11.0              
#> [135] bookdown_0.43               rtracklayer_1.64.0         
#> [137] scales_1.4.0                hexbin_1.28.5              
#> [139] affy_1.82.0                 rappdirs_0.3.3             
#> [141] stringr_1.5.1               digest_0.6.37              
#> [143] mixtools_2.0.0.1            rmarkdown_2.29             
#> [145] XVector_0.44.0              htmltools_0.5.8.1          
#> [147] pkgconfig_2.0.3             SingleR_2.6.0              
#> [149] sparseMatrixStats_1.16.0    MatrixGenerics_1.16.0      
#> [151] dbplyr_2.5.0                fastmap_1.2.0              
#> [153] htmlwidgets_1.6.4           UCSC.utils_1.0.0           
#> [155] DelayedMatrixStats_1.26.0   farver_2.1.2               
#> [157] jquerylib_0.1.4             jsonlite_2.0.0             
#> [159] mclust_6.1.1                BiocParallel_1.38.0        
#> [161] mzID_1.42.0                 ModelMetrics_1.2.2.2       
#> [163] BiocSingular_1.20.0         RCurl_1.98-1.17            
#> [165] magrittr_2.0.3              scuttle_1.14.0             
#> [167] Formula_1.2-5               GenomeInfoDbData_1.2.12    
#> [169] Rcpp_1.1.0                  viridis_0.6.5              
#> [171] MsCoreUtils_1.16.1          vsn_3.72.0                 
#> [173] pROC_1.18.5                 stringi_1.8.7              
#> [175] zlibbioc_1.50.0             MASS_7.3-61                
#> [177] listenv_0.9.1               parallel_4.4.1             
#> [179] Biostrings_2.72.1           splines_4.4.1              
#> [181] hms_1.1.3                   ps_1.9.1                   
#> [183] igraph_2.1.4                ggpubr_0.6.1               
#> [185] QFeatures_1.14.2            ggsignif_0.6.4             
#> [187] reshape2_1.4.4              biomaRt_2.60.1             
#> [189] stats4_4.4.1                ScaledMatrix_1.12.0        
#> [191] XML_3.99-0.18               evaluate_1.0.4             
#> [193] BiocManager_1.30.26         tzdb_0.5.0                 
#> [195] foreach_1.5.2               tidyr_1.3.1                
#> [197] purrr_1.1.0                 future_1.67.0              
#> [199] clue_0.3-66                 bio3d_2.4-5                
#> [201] ggplot2_3.5.2               rsvd_1.0.5                 
#> [203] xtable_1.8-4                broom_1.0.8                
#> [205] restfulr_0.0.16             AnnotationFilter_1.28.0    
#> [207] easyPubMed_2.13             e1071_1.7-16               
#> [209] rstatix_0.7.2               later_1.4.2                
#> [211] class_7.3-22                viridisLite_0.4.2          
#> [213] ragg_1.4.0                  tibble_3.3.0               
#> [215] websocket_1.4.4             memoise_2.0.1              
#> [217] AnnotationDbi_1.66.0        GenomicAlignments_1.40.0   
#> [219] IRanges_2.38.1              cluster_2.1.8              
#> [221] globals_0.18.0              timechange_0.3.0           
#> [223] caret_7.0-1                 sampling_2.11

References

Tebben, Kieran, Aliou Dia, and David Serre. 2022. “Determination of the Stage Composition of Plasmodium Infections from Bulk Gene Expression Data.” Edited by Charles R. Langelier. mSystems 7 (4). https://doi.org/10.1128/msystems.00258-22.