
Retrieving Orthologs
Rohit Satyam
King Abdullah University of Science & Technology, Saudi ArabiaAlberto Maillo
King Abdullah University of Science & Technology, Saudi ArabiaDavid Gomez-Cabrero
King Abdullah University of Science & Technology, Saudi ArabiaArnab Pain
King Abdullah University of Science & Technology, Saudi Arabia22 August, 2025
Source:vignettes/retrieving_orthologs.Rmd
retrieving_orthologs.Rmd
Abstract
This is a quick and dirty tutorial on how to fetch orthologs from OrthoMCL and InparanoiDB databases.
Introduction
Users might require Ortholog mapping to:
to assess phyletic profile of a set of genes and see if they have orthologs in other related organism or not.
integrate datasets obtained from two different species. Often such datasets are scRNASeq (eg: (Tebben, Dia, and Serre 2022))
borrow annotations such as localization for hypothesis generation.
-
perform co-expression analysis between two different species.
Whatever the use case maybe, we will see how we can exploit
getpairedOrthologs()
to fetch orthologs of multiple organisms of interest in batch manner and integrate them.
# Load package and some other useful packages by using
suppressPackageStartupMessages(
suppressWarnings({
library(plasmoRUtils)
library(dplyr)
library(plyr)}))
Here, we will try to get all Plasmodium falciparum 3D7 (1742855) orthologs present in Toxoplasma gondii ME49 (1747281), Plasmodium berghei ANKA (1742951) and Plasmodium vivax Sal1(1744988). The numbers mentioned alongside are vocabulary or unique IDs that OrthoMCL uses to fetch paired orthologs.
#?getpairedOrthologs
# listOrthomcl()
res <- lapply(c(1744988,1747281,1742951), function(x){
getpairedOrthologs(from = 1742855,
to=x,
db="orthomcl",transform = FALSE)
}) %>% setNames(c("Pvivax","Tgme49","Pb"))
merged_df <- Reduce(function(x, y) merge(x, y, by = "Accession", all = TRUE), res)
merged_df %>% head()
#> Accession
#> 1 pfal|PF3D7_0100100
#> 2 pfal|PF3D7_0100300
#> 3 pfal|PF3D7_0102500
#> 4 pfal|PF3D7_0102600
#> 5 pfal|PF3D7_0102700
#> 6 pfal|PF3D7_0102800
#> Target ID.x
#> 1 pviv|PVX_110810
#> 2 pviv|PVX_110810
#> 3 pviv|PVX_110810
#> 4 pviv|PVX_088265
#> 5 pviv|PVX_112665,pviv|PVX_090250,pviv|PVX_097577,pviv|PVX_094305,pviv|PVX_101510,pviv|PVX_002500,pviv|PVX_090275,pviv|PVX_092995,pviv|PVX_090270,pviv|PVX_112675,pviv|PVX_096950,pviv|PVX_125730,pviv|PVX_101525,pviv|PVX_112670,pviv|PVX_083550,pviv|PVX_088825,pviv|PVX_092990,pviv|PVX_115465,pviv|PVX_090260,pviv|PVX_112690,pviv|PVX_088810,pviv|PVX_121897,pviv|PVX_090255,pviv|PVX_101515,pviv|PVX_112705,pviv|PVX_088850,pviv|PVX_112660,pviv|PVX_125728,pviv|PVX_109280,pviv|PVX_088820,pviv|PVX_097575,pviv|PVX_096995,pviv|PVX_090265,pviv|PVX_112655
#> 6 pviv|PVX_081615
#> Target ID.y
#> 1 <NA>
#> 2 <NA>
#> 3 <NA>
#> 4 tgon|TGME49_289050
#> 5 <NA>
#> 6 <NA>
#> Target ID
#> 1 pber|PBANKA_1332700
#> 2 pber|PBANKA_1332700
#> 3 pber|PBANKA_1332700
#> 4 pber|PBANKA_1225000
#> 5 pber|PBANKA_0623100,pber|PBANKA_1146100,pber|PBANKA_0524300,pber|PBANKA_0524400,pber|PBANKA_1146000,pber|PBANKA_0623300
#> 6 pber|PBANKA_0210300
## Tidying up the dataframe by removing organisms prefixes
strip_prefix <- function(x) sub("^[^|]+\\|", "", x) # per-token
clean_list <- function(s) {
toks <- trimws(strsplit(s, ",", fixed = TRUE)[[1]])
paste(vapply(toks, strip_prefix, character(1)), collapse = ",")
}
merged_df[] <- lapply(merged_df, function(col) vapply(col, clean_list, character(1)))
## Changing column names
colnames(merged_df) <- c("P falciparum 3D7","P vivax Sal1","T gondii ME49","Plasmodium berghei ANKA")
merged_df %>% head()
#> P falciparum 3D7
#> 1 PF3D7_0100100
#> 2 PF3D7_0100300
#> 3 PF3D7_0102500
#> 4 PF3D7_0102600
#> 5 PF3D7_0102700
#> 6 PF3D7_0102800
#> P vivax Sal1
#> 1 PVX_110810
#> 2 PVX_110810
#> 3 PVX_110810
#> 4 PVX_088265
#> 5 PVX_112665,PVX_090250,PVX_097577,PVX_094305,PVX_101510,PVX_002500,PVX_090275,PVX_092995,PVX_090270,PVX_112675,PVX_096950,PVX_125730,PVX_101525,PVX_112670,PVX_083550,PVX_088825,PVX_092990,PVX_115465,PVX_090260,PVX_112690,PVX_088810,PVX_121897,PVX_090255,PVX_101515,PVX_112705,PVX_088850,PVX_112660,PVX_125728,PVX_109280,PVX_088820,PVX_097575,PVX_096995,PVX_090265,PVX_112655
#> 6 PVX_081615
#> T gondii ME49
#> 1 NA
#> 2 NA
#> 3 NA
#> 4 TGME49_289050
#> 5 NA
#> 6 NA
#> Plasmodium berghei ANKA
#> 1 PBANKA_1332700
#> 2 PBANKA_1332700
#> 3 PBANKA_1332700
#> 4 PBANKA_1225000
#> 5 PBANKA_0623100,PBANKA_1146100,PBANKA_0524300,PBANKA_0524400,PBANKA_1146000,PBANKA_0623300
#> 6 PBANKA_0210300
Since OrthoMCL and other VEuPathDB database are not updated simultaneously, chances are that you might be using old OrthoMCL ID. If that’s the case above function will not be much helpful. The workaround is to fetch the old OrthoMCL IDs for each organisms and their respective databases and use it as an anchor to combine and collapse gene IDs from all the organisms. The code snippet below demonstrate the same.
## List all organisms
list <- listVeupathdb(customFields = c("primary_key","project_id","species",'species_ncbi_tax_id'))
## Subset organisms I am interested in
dbs <- list[grep("3D7|ME49$|Sal-1", list$Organism),]
dbs
#> # A tibble: 3 × 4
#> Organism `VEuPathDB Project` Species Species NCBI taxon I…¹
#> <chr> <chr> <chr> <dbl>
#> 1 Plasmodium vivax Sal-1 PlasmoDB Plasmodi… 5855
#> 2 Plasmodium falciparum 3D7 PlasmoDB Plasmodi… 5833
#> 3 Toxoplasma gondii ME49 ToxoDB Toxoplas… 5811
#> # ℹ abbreviated name: ¹`Species NCBI taxon ID`
df <- lapply(1:nrow(dbs), function(x){
plasmoRUtils::getTable(org=dbs[x,]$Organism, db=tolower(dbs[x,]$`VEuPathDB Project`),customFields = c("primary_key" ,"gene_orthomcl_name"))
})
## Retain protein coding genes
df2 <- lapply(df, function(x){
x[!(stringr::str_detect(pattern = "N/A",string = x$`Ortholog Group`)),]
})
## Combine all the tables
merged_df2 <- Reduce(function(x, y) merge(x, y, by = "Ortholog Group", all = TRUE), df2)
merged_df2 %>% head(n = 10)
#> Ortholog Group Gene ID.x Gene ID.y Gene ID
#> 1 OG6_100000 <NA> <NA> TGME49_231880
#> 2 OG6_100004 PVX_086920 <NA> <NA>
#> 3 OG6_100006 PVX_083360 PF3D7_1349300 TGME49_237210
#> 4 OG6_100006 PVX_118355 PF3D7_1349300 TGME49_237210
#> 5 OG6_100007 PVX_001655 <NA> <NA>
#> 6 OG6_100012 PVX_001960 PF3D7_1021900 TGME49_261590
#> 7 OG6_100012 PVX_001960 PF3D7_1021900 TGME49_308810
#> 8 OG6_100012 PVX_001960 PF3D7_1423100 TGME49_261590
#> 9 OG6_100012 PVX_001960 PF3D7_1423100 TGME49_308810
#> 10 OG6_100012 PVX_085325 PF3D7_1021900 TGME49_261590
## Combining all the gene IDs in each column so that we have 1 row per orthogroup
merged_df2 <- merged_df2 %>%
dplyr::group_by( `Ortholog Group`) %>%
dplyr::summarise(dplyr::across(dplyr::everything(),
~paste(unique(.x), collapse = ",")), .groups = "drop")
## Changing column names
colnames(merged_df2) <- c("Orthogroup ID","P vivax Sal1","P falciparum 3D7","T gondii ME49")
merged_df2 %>% head(n = 10)
#> # A tibble: 10 × 4
#> `Orthogroup ID` `P vivax Sal1` `P falciparum 3D7` `T gondii ME49`
#> <chr> <chr> <chr> <chr>
#> 1 OG6_100000 NA NA TGME49_231880
#> 2 OG6_100004 PVX_086920 NA NA
#> 3 OG6_100006 PVX_083360,PVX_118355 PF3D7_1349300 TGME49_237210
#> 4 OG6_100007 PVX_001655 NA NA
#> 5 OG6_100012 PVX_001960,PVX_085325 PF3D7_1021900,PF3… TGME49_261590,…
#> 6 OG6_100025 PVX_081792,PVX_005565,PVX… PF3D7_1465800,PF3… TGME49_294550,…
#> 7 OG6_100026 PVX_096045,PVX_124085,PVX… PF3D7_1229100,PF3… TGME49_320460
#> 8 OG6_100029 NA NA TGME49_207210
#> 9 OG6_100034 PVX_089960 NA NA
#> 10 OG6_100045 PVX_095250 PF3D7_0319700 NA
Fetching orthologs from InParanoiDB
InParanoiDB uses NCBI taxon IDs as unique identifiers. For P. falciparum (36329), T. gondii (5811), P. vivax (126793) and P. berghei (5823), we will fetch pairwise orthologs by keeping P. falciparum as constant species. We can recycle the code above as follows.
#?getpairedOrthologs
# listipdb()
res <- lapply(c(126793,5811,5823), function(x){
getpairedOrthologs(from = 36329,
to=x,
db="ipdb",transform = TRUE)
}) %>% setNames(c("Pvivax","Tgme49","Pb"))
res[[1]] %>% head()
#> # A tibble: 6 × 11
#> `query_Group-id` query_Bitscore query_Species query_Inparalog-scor…¹
#> <dbl> <chr> <chr> <chr>
#> 1 1 8359 Plasmodium falciparum … 1
#> 2 2 8119 Plasmodium falciparum … 1
#> 3 3 7587 Plasmodium falciparum … 1
#> 4 4 7155 Plasmodium falciparum … 1
#> 5 5 6891 Plasmodium falciparum … 1
#> 6 6 6507 Plasmodium falciparum … 1
#> # ℹ abbreviated name: ¹`query_Inparalog-score`
#> # ℹ 7 more variables: `query_Protein-name` <chr>, `query_Seed-score` <chr>,
#> # target_Bitscore <chr>, target_Species <chr>,
#> # `target_Inparalog-score` <chr>, `target_Protein-name` <chr>,
#> # `target_Seed-score` <chr>
Combining the results across InParanoiDB, is not as straightforward as for OrthoMCL because of lack of unique Orthogroup ID. However, user can use Query Uniprot ID as anchor, combine the results, convert Uniprot IDs to gene IDs and then collapse rows since two or more Uniprot IDs might correspond to same gene IDs.
Session Info
utils::sessionInfo()
#> R version 4.4.1 (2024-06-14 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26100)
#>
#> Matrix products: default
#>
#>
#> locale:
#> [1] LC_COLLATE=English_India.utf8 LC_CTYPE=English_India.utf8
#> [3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C
#> [5] LC_TIME=English_India.utf8
#>
#> time zone: Asia/Riyadh
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] plyr_1.8.9 dplyr_1.1.4 plasmoRUtils_1.1.0 rlang_1.1.6
#> [5] readr_2.1.5 janitor_2.2.1 BiocStyle_2.32.1
#>
#> loaded via a namespace (and not attached):
#> [1] segmented_2.1-4 fs_1.6.6
#> [3] ProtGenerics_1.36.0 matrixStats_1.5.0
#> [5] bitops_1.0-9 lubridate_1.9.4
#> [7] pRoloc_1.44.1 httr_1.4.7
#> [9] RColorBrewer_1.1-3 doParallel_1.0.17
#> [11] ggsci_3.2.0 tools_4.4.1
#> [13] MSnbase_2.30.1 backports_1.5.0
#> [15] utf8_1.2.6 R6_2.6.1
#> [17] lazyeval_0.2.2 withr_3.0.2
#> [19] prettyunits_1.2.0 gridExtra_2.3
#> [21] preprocessCore_1.66.0 cli_3.6.5
#> [23] Biobase_2.64.0 textshaping_1.0.1
#> [25] gt_1.0.0 sass_0.4.10
#> [27] topGO_2.56.0 mvtnorm_1.3-3
#> [29] randomForest_4.7-1.2 proxy_0.4-27
#> [31] pkgdown_2.1.3 Rsamtools_2.20.0
#> [33] systemfonts_1.2.3 txdbmaker_1.0.1
#> [35] AnnotationForge_1.46.0 dichromat_2.0-0.1
#> [37] parallelly_1.45.1 limma_3.60.6
#> [39] rstudioapi_0.17.1 impute_1.78.0
#> [41] RSQLite_2.4.1 FNN_1.1.4.1
#> [43] generics_0.1.4 BiocIO_1.14.0
#> [45] vroom_1.6.5 gtools_3.9.5
#> [47] car_3.1-3 dendextend_1.19.1
#> [49] GO.db_3.19.1 Matrix_1.7-1
#> [51] MALDIquant_1.22.3 drawProteins_1.24.0
#> [53] S4Vectors_0.42.1 abind_1.4-8
#> [55] lifecycle_1.0.4 yaml_2.3.10
#> [57] snakecase_0.11.1 carData_3.0-5
#> [59] SummarizedExperiment_1.34.0 recipes_1.3.1
#> [61] SparseArray_1.4.8 BiocFileCache_2.12.0
#> [63] grid_4.4.1 blob_1.2.4
#> [65] promises_1.3.3 crayon_1.5.3
#> [67] PSMatch_1.8.0 lattice_0.22-6
#> [69] beachmat_2.20.0 annotate_1.82.0
#> [71] GenomicFeatures_1.56.0 chromote_0.5.1
#> [73] mzR_2.38.0 KEGGREST_1.44.1
#> [75] pillar_1.11.0 knitr_1.50
#> [77] GenomicRanges_1.56.2 rjson_0.2.23
#> [79] lpSolve_5.6.23 future.apply_1.20.0
#> [81] codetools_0.2-20 mgsub_1.7.3
#> [83] glue_1.8.0 pcaMethods_1.96.0
#> [85] data.table_1.17.8 MultiAssayExperiment_1.30.3
#> [87] vctrs_0.6.5 png_0.1-8
#> [89] gtable_0.3.6 kernlab_0.9-33
#> [91] cachem_1.1.0 gower_1.0.2
#> [93] xfun_0.52 prodlim_2025.04.28
#> [95] S4Arrays_1.4.1 polyglotr_1.7.0
#> [97] coda_0.19-4.1 survival_3.8-3
#> [99] ncdf4_1.24 timeDate_4041.110
#> [101] SingleCellExperiment_1.26.0 iterators_1.0.14
#> [103] hardhat_1.4.1 lava_1.8.1
#> [105] statmod_1.5.0 MLInterfaces_1.84.0
#> [107] ipred_0.9-15 nlme_3.1-166
#> [109] bit64_4.6.0-1 progress_1.2.3
#> [111] filelock_1.0.3 LaplacesDemon_16.1.6
#> [113] GenomeInfoDb_1.40.1 bslib_0.9.0
#> [115] affyio_1.74.0 irlba_2.3.5.1
#> [117] rpart_4.1.23 colorspace_2.1-1
#> [119] BiocGenerics_0.50.0 DBI_1.2.3
#> [121] nnet_7.3-19 tidyselect_1.2.1
#> [123] processx_3.8.6 bit_4.6.0
#> [125] compiler_4.4.1 curl_6.4.0
#> [127] rvest_1.0.4 httr2_1.2.0
#> [129] graph_1.82.0 SparseM_1.84-2
#> [131] xml2_1.3.8 desc_1.4.3
#> [133] DelayedArray_0.30.1 plotly_4.11.0
#> [135] bookdown_0.43 rtracklayer_1.64.0
#> [137] scales_1.4.0 hexbin_1.28.5
#> [139] affy_1.82.0 rappdirs_0.3.3
#> [141] stringr_1.5.1 digest_0.6.37
#> [143] mixtools_2.0.0.1 rmarkdown_2.29
#> [145] XVector_0.44.0 htmltools_0.5.8.1
#> [147] pkgconfig_2.0.3 SingleR_2.6.0
#> [149] sparseMatrixStats_1.16.0 MatrixGenerics_1.16.0
#> [151] dbplyr_2.5.0 fastmap_1.2.0
#> [153] htmlwidgets_1.6.4 UCSC.utils_1.0.0
#> [155] DelayedMatrixStats_1.26.0 farver_2.1.2
#> [157] jquerylib_0.1.4 jsonlite_2.0.0
#> [159] mclust_6.1.1 BiocParallel_1.38.0
#> [161] mzID_1.42.0 ModelMetrics_1.2.2.2
#> [163] BiocSingular_1.20.0 RCurl_1.98-1.17
#> [165] magrittr_2.0.3 scuttle_1.14.0
#> [167] Formula_1.2-5 GenomeInfoDbData_1.2.12
#> [169] Rcpp_1.1.0 viridis_0.6.5
#> [171] MsCoreUtils_1.16.1 vsn_3.72.0
#> [173] pROC_1.18.5 stringi_1.8.7
#> [175] zlibbioc_1.50.0 MASS_7.3-61
#> [177] listenv_0.9.1 parallel_4.4.1
#> [179] Biostrings_2.72.1 splines_4.4.1
#> [181] hms_1.1.3 ps_1.9.1
#> [183] igraph_2.1.4 ggpubr_0.6.1
#> [185] QFeatures_1.14.2 ggsignif_0.6.4
#> [187] reshape2_1.4.4 biomaRt_2.60.1
#> [189] stats4_4.4.1 ScaledMatrix_1.12.0
#> [191] XML_3.99-0.18 evaluate_1.0.4
#> [193] BiocManager_1.30.26 tzdb_0.5.0
#> [195] foreach_1.5.2 tidyr_1.3.1
#> [197] purrr_1.1.0 future_1.67.0
#> [199] clue_0.3-66 bio3d_2.4-5
#> [201] ggplot2_3.5.2 rsvd_1.0.5
#> [203] xtable_1.8-4 broom_1.0.8
#> [205] restfulr_0.0.16 AnnotationFilter_1.28.0
#> [207] easyPubMed_2.13 e1071_1.7-16
#> [209] rstatix_0.7.2 later_1.4.2
#> [211] class_7.3-22 viridisLite_0.4.2
#> [213] ragg_1.4.0 tibble_3.3.0
#> [215] websocket_1.4.4 memoise_2.0.1
#> [217] AnnotationDbi_1.66.0 GenomicAlignments_1.40.0
#> [219] IRanges_2.38.1 cluster_2.1.8
#> [221] globals_0.18.0 timechange_0.3.0
#> [223] caret_7.0-1 sampling_2.11