ML/NLP in the biological domain - adapting to large, complex datasets

Neely, Christopher; joachimiak, marcin; Park, Gilchan; Kishore, Dileep; Cashman, Mikaela; Dehal, Paramvir; Riehl, William; Yang, Ziming; Gupta, Prachi; Soto, Carlos; Mutalik, Vivek; Yoo, Shinjae; Priya, Ranjan; Arkin, Adam

doi:10.5281/zenodo.13989009

Published October 25, 2024 | Version v1

Poster Open

ML/NLP in the biological domain - adapting to large, complex datasets

1. Lawrence Berkeley National Laboratory
2. E O Lawrence Berkeley National Laboratory
3. Brookhaven National Laboratory
4. Oak Ridge National Laboratory

The Department of Energy Systems Biology Knowledgebase (KBase) explores novel machine learning and natural language processing (ML/NLP) use-cases in the biological domain. Due to the scale, complexity, and non-uniformity of the data within the KBase platform, existing ML/NLP pipelines must be adjusted to meet these challenges. This work will detail several of these concerns and solutions in the context of biological research in DOE-centric scientific focus areas that are likely shared by many domains outside of biology.

While mining training data from numerous and diverse sources is a common first step in ML projects, mixed data sources can complicate the interpretation of the results. In particular, KBase has developed a model for classifying annotated metagenomes with the environment from which they were extracted using gradient-boosted decision trees (Catboost1). This model was trained on the MGnify dataset, a complex data source comprising many different sources of annotations, including taxonomic labels, protein domain and full-length functional annotations using Gene Ontology2,3 and InterPro4 annotations. Deriving a meaningful interpretation of the data relies on downstream analysis after model training and evaluation, such as permutation and feature importance analysis.

Results from NLP classification tasks, including those of biological relevance, are often improved by augmenting input with domain-specific context (RAG5, etc.). This requirement can preclude the use of models with small context windows. The size of the model also directs a lab’s ability to use it, as very large models may require prohibitively expensive pre-training or fine-tuning. In our work, we explore the results of using different models with varying context window sizes and their ability to improve classification metrics for NLP models trained for genetic tool development and biomanufacturing tasks.

Sampling bias may occur as a result of slight variations in experimental protocols and data sources. These issues impact a model’s ability to generalize to data on which it was not trained. In biological applications, like phenotype classification, these variations can impact the predictive performance for out-of-clade predictions by machine learning classifiers6. Addressing these effects calls for innovative sampling techniques that extend beyond basic random and stratified splitting. We are developing a sampling technique based on the similarity of the most important and predictive features that use phenotypic similarity as a proxy.

While issues in developing a useful ML pipeline or NLP model are shared across many domains, the exact means of mitigating is field-dependent. KBase’s research, focusing on biological applications, is also subject to these issues, often requiring insight into the biological domain to guide model result improvement.

Files

RSE 2024 Poster.pdf

Files (1.3 MB)

Name	Size	Download all
RSE 2024 Poster.pdf md5:4b0bb8589711355a68bd36b8bc555d8d	1.3 MB	Preview Download

Additional details

Available: 2024-10-15

Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: unbiased boosting with categorical features. Proceedings of the 32nd International Conference on Neural Information Processing Systems.
Ashburner et al. (2000). Gene ontology: tool for the unification of biology. Nat Genet. 25(1):25-9. DOI: 10.1038/75556.
The Gene Ontology Consortium. (2023). The Gene Ontology knowledgebase in 2023. Genetics. May 4;224(1):iyad031. DOI: 10.1093/genetics/iyad031.
Typhaine Paysan-Lafosse, et al. (2023). InterPro in 2022. Nucleic Acids Research, Volume 51, Issue D1. https://doi.org/10.1093/nar/gkac993.
"Shap." n.d. PyPI. Accessed September 13, 2024. https://pypi.org/project/shap/.
PubMed Central (PMC) [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; 2000 - [cited 2024 Oct 8]. Available from: https://www.ncbi.nlm.nih.gov/pmc/
Lewis, Patrick. (2023, November 15). What Is Retrieval-Augmented Generation, aka RAG? NVIDIA. https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/.
Li Z, Selim A, Kuehn S. (2023). Statistical prediction of microbial metabolic traits from genomes. PLoS Comput Biol 19(12): e1011705. https://doi.org/10.1371/journal.pcbi.1011705.

	All versions	This version
Views	97	97
Downloads	110	110
Data volume	161.1 MB	161.1 MB

RSE 2024 Poster.pdf

Files (1.3 MB)

Dates

References

ML/NLP in the biological domain - adapting to large, complex datasets

Authors/Creators

Description

Files

RSE 2024 Poster.pdf

Files (1.3 MB)

Additional details

Dates

References