Published May 1, 2023
| Version 1.0.0
Software
Open
Machine Learning Validation via Rational Dataset Sampling with astartes
Authors/Creators
- 1. Center for Computational Science and Engineering, Massachusetts Institute of Technology
- 2. Department of Chemical Engineering, Massachusetts Institute of Technology, United States
- 3. Department of Chemical and Biomolecular Engineering, University of Delaware, United States
Description
Machine Learning (ML) has become an increasingly popular tool to accelerate traditional workflows. Critical to the use of ML is the process of splitting datasets into training, validation, and testing subsets that are used to develop and evaluate models. Common practice in the literature is to assign these subsets randomly. Although this approach is fast and efficient, it only measures a model's capacity to interpolate. Testing errors from random splits may be overly optimistic if given new data that is dissimilar to the scope of the training set; thus, there is a growing need to easily measure performance for extrapolation tasks. To address this issue, we report astartes, an open-source Python package that implements many existing similarity- and distance-based algorithms to partition data into more challenging splits that can better assess out-of-sample performance. This publication focuses on use-cases within cheminformatics. However, astartes operates on arbitrary vector inputs, so its principals and workflow are generalizable to other ML domains as well. astartes is available via the Python package manager `pip` and is publicly hosted on GitHub (github.com/JacksonBurns/astartes).
Files
astartes-1.0.0.zip
Files
(19.3 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:031d6d42cd776774b17b7a6cbf5bfe86
|
19.3 MB | Preview Download |