Info: Zenodo’s user support line is staffed on regular business days between Dec 23 and Jan 5. Response times may be slightly longer than normal.

Published July 12, 2023 | Version v1
Journal article Open

Training Data Alchemy: Balancing Quality and Quantity in Machine Learning Training

  • 1. Associate Professor, Ashoka Women's Engineering College, Kurnool
  • 2. Student, Ashoka Women's Engineering College, Kurnool
  • 3. Assistant Professor, Ashoka Women's Engineering College, Kurnool
  • 4. Professor, Ashoka Women's Engineering College, Kurnool
  • 5. Student, Alliance University, Bangalore
  • 6. Student,G. Pullaiah College of Engineering and Technology, Kurnool

Description

Determining the optimal amount of training data for machine learning algorithms is a critical task in achieving successful and accurate models. This abstract delves into the research surrounding this question and provides insights into the factors that affect the quantity of training data required for effective machine learning. It explores the delicate balance between data quality and quantity, the concept of over fitting, and the importance of representative and diverse datasets. Additionally, it discusses the various techniques and approaches used to estimate the minimum training data required for achieving desirable performance. By understanding the implications of training data size on model performance, researchers and practitioners can make informed decisions in selecting appropriate training datasets, thereby maximizing the efficiency and effectiveness of machine learning algorithms.

Files

Training Data Alchemy -Formatted Paper.pdf

Files (323.8 kB)

Name Size Download all
md5:27de4932018e76b974ae2ab35b911f99
323.8 kB Preview Download

Additional details

References

  • 1. Halevy, A., Norvig, P., & Pereira, F. (2009). The Unreasonable Effectiveness of Data. IEEE Intelligent Systems, 24(2), 8-12.
  • 2. Le, Q., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G. S., Ng, A. (2012). Building high-level features using large-scale unsupervised learning. In Proceedings of the 29th International Conference on Machine Learning (ICML-12):1025-1032.
  • 3. Rudin, C. (2019). The Mythos of Model Interpretability. Journal of the Royal Statistical Society: Series A (Statistics in Society), 182(3), 1019-1047.
  • 4. He, H., Bai, Y., Garcia, E. A., & Li, S. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284.
  • 5. Bengio, Y. (2012). Practical Recommendations for Gradient-Based Training of Deep Architectures. In Neural Networks: Tricks of the Trade.:437-478).