Big Data Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure
Creators
Description
This work has successfully deployed two different use cases of interest for High Energy Physics
using cloud resources:
CMS Big data reduction: This use case consists in running a data reduction workloads for
physics data. The code and implementation has originally been developed by CERN openlab
in collaboration with CMS and Intel in 2017-2018. It aims at demonstrating the scalability of a
data reduction workflow, by processing ROOT files using Apache Spark
Spark DL Trigger: This use case consists in the deployment of a full data preparation and
machine learning pipeline, starting from data ingestion (4.5 TB of ROOT data), to the training
of classifier using neural networks. This use case is implemented using Apache Spark and
the Keras API, following previous work in collaboration with CERN openlab.
Resources for this work have been deployed using Oracle Cloud Infrastructure (OCI). In particular
this project has allowed to complete:
Setup of the project using Oracle Container Engine for Kubernetes and Oracle Cloud
resources
Troubleshooting of the oci-hdfs-connector to run Apache Spark at scale on OCI Object
Storage
Measurements of OCI Object Storage performance for the selected use cases
Investigations and performance measurements of the resource utilisation on Oracle
Container Engine for Kubernetes (OKE), when running the TensorFlow/Keras neural network
model training at scale, using CPU resources, and when using GPU.
Notable results of this project:
Produced several key improvements to the oci-hdfs-connector. The improvements are
necessary to run the latest Spark version (Spark 2.4.x) on Oracle Cloud. The connector is
distributed by Oracle with open source licensing, and the improvements will be fed back to
Oracle.
Improved instrumentation infrastructure for measuring Spark workloads on cloud resources,
by streamlining the deployment of Spark performance dashboard on Kubernetes and
developing a Helm chart
Produced a solution for direct measurement of I/O latency for Spark workloads reading from
OCI or S3 storage. The results are of general interest for Spark users, notably including the
Spark service at CERN
Developed methods to parallelize TensorFlow/Keras on Kubernetes using TensorFlow 2.0
new tf.distribute features. These are of general interest for ML practitioners, notably including
the users of CERN cloud services.
Files
Report_Michal_Bien.pdf
Files
(1.9 MB)
Name | Size | Download all |
---|---|---|
md5:f4ba7f91816350a8bcd872a4ba138c71
|
1.9 MB | Preview Download |