Machine learning models accurately predict clades of proteocephalidean tapeworms (Onchoproteocephalidea) based on host and zoogeographical data
Creators
Description
This presentation was delivered at the ASP 99th Annual Meeting (Denver, Colorado, USA) by Dr. Denis Jacob Machado (UNC Charlotte's Dept. of Bioinformatics and Genomics and CIPHER center), the head of the Phyloinformatics Lab (phyloinformatics.com).
AUTHORS
Philippe Vieira Alves (1,2), Reinaldo J. da Silva (1), Alain de Chambrier (3), José L. Luque (4), Anastasiia Duchenko (2), Daniel Janies (2), Denis Jacob Machado (2).
INSTITUTIONS
(1) São Paulo State University, Botucatu, São Paulo, Brazil. (2) University of North Carolina at Charlotte (UNC Charlotte), Center for Computational Intelligence to Predict Health and Environmental Risks (CIPHER), Charlotte, NC, USA. (3) Natural History Museum, Geneva, Switzerland. (4) Federal Rural University of Rio de Janeiro, Seropédica, Rio de Janeiro, Brazil.
TITLE
Machine learning models accurately predict clades of proteocephalidean tapeworms (Onchoproteocephalidea) based on host and zoogeographical data.
ABSTRACT
Proteocephalids are a diverse group of tapeworms that have colonized vertebrate hosts in freshwater and terrestrial environments. Despite its ubiquity worldwide, our understanding of key macroevolutionary processes that have driven the group's evolution remains yet to be discovered. Here, we reviewed the phylogenetic relationships of proteocephalid tapeworms (Onchoproteocephalidea I) using publicly available (671) and newly generated (90) sequences of the large subunit nuclear ribosomal RNA (28S rRNA) and the mitochondrial cytochrome c oxidase subunit I (MT-CO1) for 537 terminals. The tree search was conducted under the parsimony optimality criterion using a total evidence approach. Interestingly, we were not able to recover the Proteocephalidae monophyly. In addition, it was difficult to reconcile the tree with individual data representing biological, ecological, and zoogeographical traits of the hosts and parasites using traditional character optimization strategies. To test the predictive potential of combined (not individual) host and zoogeographical data in the context of the proteocephalid tree, we trained Random Forest machine learning models and demonstrated that they are capable of correctly placing 87% of the terminals into eight representative clades. Furthermore, we interactively perturbed the tree at increasing levels of perturbation probability and observed that model accuracy correlates negatively with the degree of clade perturbation. Our analyses show that even the host and biogeographical data, which individually correlate poorly with the tree, can be used to accurately predict proteocephalid clades when analyzed together. This is to be expected if the evolution of proteocephalidean clades depends on host and biogeographical attributes. We discuss how these machine learning models may serve to provide external support for unpuzzling the proteocephalid tree.
UPLOADED FILES
This project contains the original slides and additional information in separate compressed directories. Each of them contains a READ-ME.txt file describing its contents.
- slides.pdf: Slide deck for this presentation
- appendix_s1.tar.gz: Supporting information, detailing samples used
- appendix_s2.tar.gz: Supporting information, including results from character categorization
- itol.tar.gz: Data needed to reconstruct our final image using iTOL v6
OTHER MATERIALS
The following materials are available upon directly request only as they are being reviewed and prepared for publication:
- The data that was used to search for the most parsimonious trees and calculate Jackknife and relative Goodman-Bremer values.
- The script and data needed to reproduce our Randon Forests experiments.
FUNDING
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP), proc. no. 2023/00714-5.
Files
slides.pdf
Additional details
Funding
- Fundação de Amparo à Pesquisa do Estado de São Paulo
- Organização do mitogenoma e diversidade de cestoides proteocefalídeos (Cestoda) revelados por genome skimming 2023/00714-5