Report Open Access

Big Data Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure

Michał Bień

This work has successfully deployed two different use cases of interest for High Energy Physics 
using cloud resources: 
 CMS Big data reduction: This use case consists in running a data reduction workloads for 
physics data. The code and implementation has originally been developed by CERN openlab 
in collaboration with CMS and Intel in 2017-2018. It aims at demonstrating the scalability of a 
data reduction workflow, by processing ROOT files using Apache Spark 
 Spark DL Trigger: This use case consists in the deployment of a full data preparation and 
machine learning pipeline, starting from data ingestion (4.5 TB of ROOT data), to the training 
of classifier using neural networks. This use case is implemented using Apache Spark and 
the Keras API, following previous work in collaboration with CERN openlab. 
Resources for this work have been deployed using Oracle Cloud Infrastructure (OCI). In particular 
this project has allowed to complete: 
 Setup of the project using Oracle Container Engine for Kubernetes and Oracle Cloud 
 Troubleshooting of the oci-hdfs-connector to run Apache Spark at scale on OCI Object 
 Measurements of OCI Object Storage performance for the selected use cases 
 Investigations and performance measurements of the resource utilisation on Oracle 
Container Engine for Kubernetes (OKE), when running the TensorFlow/Keras neural network 
model training at scale, using CPU resources, and when using GPU. 
Notable results of this project: 
 Produced several key improvements to the oci-hdfs-connector. The improvements are 
necessary to run the latest Spark version (Spark 2.4.x) on Oracle Cloud. The connector is 
distributed by Oracle with open source licensing, and the improvements will be fed back to 
 Improved instrumentation infrastructure for measuring Spark workloads on cloud resources, 
by streamlining the deployment of Spark performance dashboard on Kubernetes and 
developing a Helm chart 
 Produced a solution for direct measurement of I/O latency for Spark workloads reading from 
OCI or S3 storage. The results are of general interest for Spark users, notably including the 
Spark service at CERN 
 Developed methods to parallelize TensorFlow/Keras on Kubernetes using TensorFlow 2.0 
new tf.distribute features. These are of general interest for ML practitioners, notably including 
the users of CERN cloud services.

Files (1.9 MB)
Name Size
1.9 MB Download
All versions This version
Views 353352
Downloads 420420
Data volume 816.3 MB816.3 MB
Unique views 330329
Unique downloads 400400


Cite as