Big Data Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure

10.5281/zenodo.3550777 https://zenodo.org/records/3550777 oai:zenodo.org:3550777 Michał Bień Michał Bień Big Data Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure Zenodo 2019 CERN openlab summer-student programme 2019-11-22 2023-07-14 10.5281/zenodo.3550776 https://zenodo.org/communities/cernopenlab Creative Commons Attribution 4.0 International This work has successfully deployed two different use cases of interest for High Energy Physics using cloud resources:  CMS Big data reduction: This use case consists in running a data reduction workloads for physics data. The code and implementation has originally been developed by CERN openlab in collaboration with CMS and Intel in 2017-2018. It aims at demonstrating the scalability of a data reduction workflow, by processing ROOT files using Apache Spark  Spark DL Trigger: This use case consists in the deployment of a full data preparation and machine learning pipeline, starting from data ingestion (4.5 TB of ROOT data), to the training of classifier using neural networks. This use case is implemented using Apache Spark and the Keras API, following previous work in collaboration with CERN openlab. Resources for this work have been deployed using Oracle Cloud Infrastructure (OCI). In particular this project has allowed to complete:  Setup of the project using Oracle Container Engine for Kubernetes and Oracle Cloud resources  Troubleshooting of the oci-hdfs-connector to run Apache Spark at scale on OCI Object Storage  Measurements of OCI Object Storage performance for the selected use cases  Investigations and performance measurements of the resource utilisation on Oracle Container Engine for Kubernetes (OKE), when running the TensorFlow/Keras neural network model training at scale, using CPU resources, and when using GPU. Notable results of this project:  Produced several key improvements to the oci-hdfs-connector. The improvements are necessary to run the latest Spark version (Spark 2.4.x) on Oracle Cloud. The connector is distributed by Oracle with open source licensing, and the improvements will be fed back to Oracle.  Improved instrumentation infrastructure for measuring Spark workloads on cloud resources, by streamlining the deployment of Spark performance dashboard on Kubernetes and developing a Helm chart  Produced a solution for direct measurement of I/O latency for Spark workloads reading from OCI or S3 storage. The results are of general interest for Spark users, notably including the Spark service at CERN  Developed methods to parallelize TensorFlow/Keras on Kubernetes using TensorFlow 2.0 new tf.distribute features. These are of general interest for ML practitioners, notably including the users of CERN cloud services.