There is a newer version of the record available.

Published May 1, 2023 | Version 1.0.0
Software Open

Machine Learning Validation via Rational Dataset Sampling with astartes

  • 1. Center for Computational Science and Engineering, Massachusetts Institute of Technology
  • 2. Department of Chemical Engineering, Massachusetts Institute of Technology, United States
  • 3. Department of Chemical and Biomolecular Engineering, University of Delaware, United States

Description

Machine Learning (ML) has become an increasingly popular tool to accelerate traditional workflows. Critical to the use of ML is the process of splitting datasets into training, validation, and testing subsets that are used to develop and evaluate models. Common practice in the literature is to assign these subsets randomly. Although this approach is fast and efficient, it only measures a model's capacity to interpolate. Testing errors from random splits may be overly optimistic if given new data that is dissimilar to the scope of the training set; thus, there is a growing need to easily measure performance for extrapolation tasks. To address this issue, we report astartes, an open-source Python package that implements many existing similarity- and distance-based algorithms to partition data into more challenging splits that can better assess out-of-sample performance. This publication focuses on use-cases within cheminformatics. However, astartes operates on arbitrary vector inputs, so its principals and workflow are generalizable to other ML domains as well. astartes is available via the Python package manager `pip` and is publicly hosted on GitHub (github.com/JacksonBurns/astartes).

Files

astartes-1.0.0.zip

Files (19.3 MB)

Name Size Download all
md5:031d6d42cd776774b17b7a6cbf5bfe86
19.3 MB Preview Download