Designing MLOps Pipelines for Distributed Training: Architectures, Automation, and Best Practices

Tharakesavulu Vangalapat

doi:10.5281/zenodo.17384284

Published February 28, 2023 | Version v1

Journal article Open

Designing MLOps Pipelines for Distributed Training: Architectures, Automation, and Best Practices

Tharakesavulu Vangalapat

Machine Learning Operations (MLOps) is a critical enabler of modern AI, supporting end-to-end automation, scalability, and reliability for machine learning at production scale. With the explosive growth of data and deep learning models, distributed training has become essential for reducing training times and leveraging heterogeneous resources. This paper presents a comprehensive review and practical guide for designing MLOps pipelines that support distributed training. We synthesize best practices, toolchains, real-world case studies, code examples, and performance benchmarks. Key topics include pipeline orchestration, data management, resource scheduling, experiment tracking, monitoring, security, compliance, and future directions for robust, scalable, and reproducible distributed ML systems.

Files

EJAET-10-2-104-112.pdf

Files (448.1 kB)

Name	Size	Download all
EJAET-10-2-104-112.pdf md5:28015ef657477b9063bdd8ef49db8f0f	448.1 kB	Preview Download

Additional details

[1]. Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, no. 7553, pp. 436–444, 2015.
[2]. M. I. Jordan and T. M. Mitchell, "Machine learning: Trends, perspectives, and prospects," Science, vol. 349, no. 6245, pp. 255–260, 2015.
[3]. I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT Press, 2016.
[4]. D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J. Crespo, and D. Dennison, "Hidden technical debt in machine learning systems," in NeurIPS, 2015, pp. 2503–2511.
[5]. S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, N. Nagappan, B. Nushi, and T. Zimmermann, "Software engineering for machine learning: A case study," in ICSE, 2019, pp. 291–300.
[6]. N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich, "Data management challenges in production machine learning," SIGMOD Record, vol. 47, no. 2, pp. 45–52, 2019.
[7]. D. Kreuzberger, N. Ku¨hl, and S. Hirschl, "Machine learning operations (mlops): Overview, definition, and architecture," arXiv preprint arXiv:2205.02302, 2022.
[8]. Y. Zhou, Y. Sun, X. Zhang, H. Dai, X. Lin, and S. Wang, "Mlops: A survey of techniques and tools for machine learning operations," in Journal of Physics: Conference Series, vol. 1693, no. 1, 2020, p. 012106.
[9]. D. Baylor, E. Breck, H. Cheng, N. Fiedel, M. Fu, G. Irving, S. Jain, C. Mewald, N. Polyzotis, S. Whang, and M. Zinkevich, "Tensorflow extended (tfx) for production ml pipelines," in KDD, 2019.
[10]. J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, "Large scale distributed deep networks," NeurIPS, 2012.
[11]. M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, "Scaling distributed machine learning with the parameter server," in OSDI, 2014, pp. 583–598.
[12]. K. Hazelwood et al., "Applied machine learning at facebook: A datacenter infrastructure perspective," HPCA, pp. 620–629, 2018.
[13]. M. e. a. Abadi, "Tensorflow: Large-scale machine learning on heterogeneous systems," arXiv preprint arXiv:1603.04467, 2016.
[14]. E. Breck, N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich, "The ml test score: A rubric for ml production readiness and technical debt reduction," IEEE Data Eng. Bull., vol. 39, no. 3, pp. 39–50, 2017.
[15]. T. Baier, A. Keshava, and L. Thamsen, "Deploying machine learning models as microservices using kubeflow, tensorflow serving and seldon," arXiv preprint arXiv:2101.01083, 2021.
[16]. A. Sergeev and M. Del Balso, "Horovod: fast and easy distributed deep learning in tensorflow," in arXiv preprint arXiv:1802.05799, 2018.
[17]. K. Contributors, "Kubeflow: Machine learning toolkit for kubernetes," in GitHub repository, 2020, https://github.com/kubeflow/kubeflow.
[18]. M. Zaharia, A. Chen, A. Davidson, A. Ghodsi, M. Hong, A. Konwinski, C. Murching, T. Nykodym, P. Ogilvie, R. Parkhe et al., "Accelerating the machine learning lifecycle with mlflow," in IEEE Data Eng., 2018.
[19]. D. Contributors, "Dvc: Data version control," GitHub repository, 2020, https://dvc.org.
[20]. SeldonIO, "Seldon core: Machine learning deployment for kubernetes," in GitHub repository, 2021, https://github.com/SeldonIO/seldon-core.
[21]. J. Hermann, G. Balac, B. Saha et al., "Michelangelo: Uber's machine learning platform," in KDD MLOps Panel, 2017, https://eng.uber.com/ michelangelo/.
[22]. A. Izzy, M. Bryan et al., "Bighead: Airbnb's endto-end machine learning platform," in ML Platform Meetup, 2018, https://medium.com/airbnb-engineering/ bighead-airbnbs-end-to-end-machine-learning-platform-6eae43b7cfa7.
[23]. G. Kim, P. Debois, J. Willis, J. Humble, and J. Allspaw, The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations. IT Revolution, 2016.
[24]. X. Bouthillier, G. Varoquaux, and P. Vincent, "Survey of experiment management tools for machine learning," arXiv preprint arXiv:1906.01718, 2019.
[25]. F. Contributors, "Feast: Feature store for machine learning," in GitHub repository, 2020, https://github.com/feast-dev/feast.
[26]. P. Inc., "Polyaxon: Platform for reproducible and scalable machine learning," in GitHub repository, 2020, https://github.com/polyaxon/polyaxon.
[27]. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., "Pytorch: An imperative style, high-performance deep learning library," in NeurIPS, 2019.
[28]. P. Goyal, P. Doll´ar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, "Accurate, large minibatch sgd: Training imagenet in 1 hour," in arXiv preprint arXiv:1706.02677, 2017.
[29]. M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, "Megatron-lm: Training multi-billion parameter language models using model parallelism," arXiv preprint arXiv:1909.08053, 2019.
[30]. N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, "Mesh-tensorflow: Deep learning for supercomputers," in NeurIPS, 2018.
[31]. T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., "Language models are few-shot learners," in NeurIPS, 2020.
[32]. D. Narayanan, M. Shoeybi, J. Casper, M. Patwary, P. LeGresley, V. Korthikanti, B. Catanzaro, and M. Zaharia, "Efficient large-scale language model training on gpu clusters," arXiv preprint arXiv:2104.04473, 2021.
[33]. A. Krizhevsky, "Learning multiple layers of features from tiny images," 2009, tech report, University of Toronto.
[34]. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," CVPR, pp. 248–255, 2009.
[35]. D. Dua and C. Graff, "Uci machine learning repository," in UCI Machine Learning Repository, 2019, https://archive.ics.uci.edu/ml.
[36]. J. Vanschoren, J. N. Van Rijn, B. Bischl, and L. Torgo, "Openml: Networked science in machine learning," SIGKDD Explorations, vol. 15, no. 2, pp. 49–60, 2014.
[37]. I. Superconductive Health, "Great expectations: Always know what to expect from your data," 2021, https://greatexpectations.io/.
[38]. P. Inc., "Pachyderm: Data pipelines with data lineage," in GitHub repository, 2019, https://github.com/pachyderm/pachyderm.
[39]. A. S. Foundation, "Apache airflow: A workflow management platform," in GitHub repository, 2019, https://github.com/apache/airflow.
[40]. K. Authors, "Kfserving: Serverless inferencing on kubernetes," in GitHub repository, 2019, https://github.com/kubeflow/kfserving.
[41]. AWS and Facebook, "Torchserve: Model serving for pytorch," in GitHub repository, 2020, https://github.com/pytorch/serve.
[42]. F. Labs, "Fiddler: Explainable monitoring for production models," in GitHub repository, 2021, https://github.com/fiddler-labs/fiddler-examples.
[43]. P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh et al., "Mixed precision training," arXiv preprint arXiv:1710.03740, 2018.
[44]. K. Authors, "Kubernetes: Production-grade container orchestration," in GitHub repository, 2016, https://github.com/kubernetes/kubernetes.
[45]. P. Authors, "Prometheus: Monitoring system & time series database,"GitHub, 2017, https://prometheus.io.
[46]. G. Labs, "Grafana: The open platform for analytics and monitoring," GitHub, 2015, https://grafana.com.
[47]. M. T. Ribeiro, S. Singh, and C. Guestrin, ""why should i trust you?": Explaining the predictions of any classifier," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016, pp. 1135–1144.
[48]. S. M. Lundberg and S.-I. Lee, "A unified approach to interpreting model predictions," vol. 30, pp. 4765–4774, 2017.
[49]. A. Rashid, S. N. Islam, M. I. Yousaf, M. Young, and M. A. Habib, "Security and privacy in machine learning: A survey," arXiv preprint arXiv:2007.11078, 2021.
[50]. Y. Zhang, Q. Chen, and C. A. Gunter, "Privacy for machine learning and artificial intelligence," Communications of the ACM, vol. 62, no. 10, pp. 16–18, 2019.
[51]. N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, "A survey on bias and fairness in machine learning," arXiv preprint arXiv:1908.09635, 2019.
[52]. S. Barocas, M. Hardt, and A. Narayanan, "Fairness in machine learning," NIPS Tutorial, 2017.
[53]. P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode et al., "Advances and open problems in federated learning," in Foundations and Trends in Machine Learning, vol. 14, no. 1–2, 2021, pp. 1–210.
[54]. T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, "Federated learning: Challenges, methods, and future directions," IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 50–60, 2020.

	All versions	This version
Views	50	50
Downloads	32	32
Data volume	19.7 MB	19.7 MB

Designing MLOps Pipelines for Distributed Training: Architectures, Automation, and Best Practices

Authors/Creators

Description

Files

EJAET-10-2-104-112.pdf

Files (448.1 kB)

Additional details

References