Machine Learning Validation via Rational Dataset Sampling with astartes

Burns, Jackson; Spiekermann, Kevin; Bhattacharjee, Himaghna; Vlachos, Dionisios; Green, William

doi:10.5281/zenodo.7884532

Published May 1, 2023 | Version 1.0.0

Software Open

Machine Learning Validation via Rational Dataset Sampling with astartes

1. Center for Computational Science and Engineering, Massachusetts Institute of Technology
2. Department of Chemical Engineering, Massachusetts Institute of Technology, United States
3. Department of Chemical and Biomolecular Engineering, University of Delaware, United States

Machine Learning (ML) has become an increasingly popular tool to accelerate traditional workflows. Critical to the use of ML is the process of splitting datasets into training, validation, and testing subsets that are used to develop and evaluate models. Common practice in the literature is to assign these subsets randomly. Although this approach is fast and efficient, it only measures a model's capacity to interpolate. Testing errors from random splits may be overly optimistic if given new data that is dissimilar to the scope of the training set; thus, there is a growing need to easily measure performance for extrapolation tasks. To address this issue, we report astartes, an open-source Python package that implements many existing similarity- and distance-based algorithms to partition data into more challenging splits that can better assess out-of-sample performance. This publication focuses on use-cases within cheminformatics. However, astartes operates on arbitrary vector inputs, so its principals and workflow are generalizable to other ML domains as well. astartes is available via the Python package manager `pip` and is publicly hosted on GitHub (github.com/JacksonBurns/astartes).

Files

astartes-1.0.0.zip

Files (19.3 MB)

Name	Size	Download all
astartes-1.0.0.zip md5:031d6d42cd776774b17b7a6cbf5bfe86	19.3 MB	Preview Download

	All versions	This version
Views	315	88
Downloads	86	35
Data volume	1.7 GB	674.8 MB

Machine Learning Validation via Rational Dataset Sampling with astartes

Authors/Creators

Description

Files

astartes-1.0.0.zip

Files (19.3 MB)