Published June 16, 2024 | Version v1
Presentation Open

Machine learning models accurately predict clades of proteocephalidean tapeworms (Onchoproteocephalidea) based on host and zoogeographical data

  • 1. ROR icon Universidade Estadual Paulista (Unesp)
  • 2. ROR icon Natural History Museum of Geneva
  • 3. ROR icon Universidade Federal Rural do Rio de Janeiro
  • 4. ROR icon University of North Carolina at Charlotte

Description

This presentation was delivered at the ASP 99th Annual Meeting (Denver, Colorado, USA) by Dr. Denis Jacob Machado (UNC Charlotte's Dept. of Bioinformatics and Genomics and CIPHER center), the head of the Phyloinformatics Lab (phyloinformatics.com).

AUTHORS

Philippe Vieira Alves (1,2), Reinaldo J. da Silva (1), Alain de Chambrier (3), José L. Luque (4), Anastasiia Duchenko (2), Daniel Janies (2), Denis Jacob Machado (2).

INSTITUTIONS

(1) São Paulo State University, Botucatu, São Paulo, Brazil. (2) University of North Carolina at Charlotte (UNC Charlotte), Center for Computational Intelligence to Predict Health and Environmental Risks (CIPHER), Charlotte, NC, USA. (3) Natural History Museum, Geneva, Switzerland. (4) Federal Rural University of Rio de Janeiro, Seropédica, Rio de Janeiro, Brazil.

TITLE

Machine learning models accurately predict clades of proteocephalidean tapeworms (Onchoproteocephalidea) based on host and zoogeographical data.

ABSTRACT

Proteocephalids are a diverse group of tapeworms that have colonized vertebrate hosts in freshwater and terrestrial environments. Despite its ubiquity worldwide, our understanding of key macroevolutionary processes that have driven the group's evolution remains yet to be discovered. Here, we reviewed the phylogenetic relationships of proteocephalid tapeworms (Onchoproteocephalidea I) using publicly available (671) and newly generated (90) sequences of the large subunit nuclear ribosomal RNA (28S rRNA) and the mitochondrial cytochrome c oxidase subunit I (MT-CO1) for 537 terminals. The tree search was conducted under the parsimony optimality criterion using a total evidence approach. Interestingly, we were not able to recover the Proteocephalidae monophyly. In addition, it was difficult to reconcile the tree with individual data representing biological, ecological, and zoogeographical traits of the hosts and parasites using traditional character optimization strategies. To test the predictive potential of combined (not individual) host and zoogeographical data in the context of the proteocephalid tree, we trained Random Forest machine learning models and demonstrated that they are capable of correctly placing 87% of the terminals into eight representative clades. Furthermore, we interactively perturbed the tree at increasing levels of perturbation probability and observed that model accuracy correlates negatively with the degree of clade perturbation. Our analyses show that even the host and biogeographical data, which individually correlate poorly with the tree, can be used to accurately predict proteocephalid clades when analyzed together. This is to be expected if the evolution of proteocephalidean clades depends on host and biogeographical attributes. We discuss how these machine learning models may serve to provide external support for unpuzzling the proteocephalid tree.

UPLOADED FILES

This project contains the original slides and additional information in separate compressed directories. Each of them contains a READ-ME.txt file describing its contents.

  • slides.pdf: Slide deck for this presentation
  • appendix_s1.tar.gz: Supporting information, detailing samples used
  • appendix_s2.tar.gz: Supporting information, including results from character categorization
  • itol.tar.gz: Data needed to reconstruct our final image using iTOL v6

OTHER MATERIALS

The following materials are available upon directly request only as they are being reviewed and prepared for publication:

  • The data that was used to search for the most parsimonious trees and calculate Jackknife and relative Goodman-Bremer values.
  • The script and data needed to reproduce our Randon Forests experiments.

FUNDING

Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP), proc. no. 2023/00714-5.

Files

slides.pdf

Files (5.2 MB)

Name Size Download all
md5:8e630883ed75ac8ddb636f813ea59eb4
20.5 kB Download
md5:a17e8d3ad755e431cc800adc5a94d441
655.8 kB Download
md5:e1da1486d79a2ecc1324db528819ceba
462.7 kB Download
md5:eab6591837ceef839c30bce07136fe2b
4.1 MB Preview Download

Additional details

Funding

Fundação de Amparo à Pesquisa do Estado de São Paulo
Organização do mitogenoma e diversidade de cestoides proteocefalídeos (Cestoda) revelados por genome skimming 2023/00714-5