Big Data Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure

Michał Bień

doi:10.5281/zenodo.3550777

Published November 22, 2019 | Version v1

Report Open

Big Data Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure

Michał Bień

This work has successfully deployed two different use cases of interest for High Energy Physics
using cloud resources:
 CMS Big data reduction: This use case consists in running a data reduction workloads for
physics data. The code and implementation has originally been developed by CERN openlab
in collaboration with CMS and Intel in 2017-2018. It aims at demonstrating the scalability of a
data reduction workflow, by processing ROOT files using Apache Spark
 Spark DL Trigger: This use case consists in the deployment of a full data preparation and
machine learning pipeline, starting from data ingestion (4.5 TB of ROOT data), to the training
of classifier using neural networks. This use case is implemented using Apache Spark and
the Keras API, following previous work in collaboration with CERN openlab.
Resources for this work have been deployed using Oracle Cloud Infrastructure (OCI). In particular
this project has allowed to complete:
 Setup of the project using Oracle Container Engine for Kubernetes and Oracle Cloud
resources
 Troubleshooting of the oci-hdfs-connector to run Apache Spark at scale on OCI Object
Storage
 Measurements of OCI Object Storage performance for the selected use cases
 Investigations and performance measurements of the resource utilisation on Oracle
Container Engine for Kubernetes (OKE), when running the TensorFlow/Keras neural network
model training at scale, using CPU resources, and when using GPU.
Notable results of this project:
 Produced several key improvements to the oci-hdfs-connector. The improvements are
necessary to run the latest Spark version (Spark 2.4.x) on Oracle Cloud. The connector is
distributed by Oracle with open source licensing, and the improvements will be fed back to
Oracle.
 Improved instrumentation infrastructure for measuring Spark workloads on cloud resources,
by streamlining the deployment of Spark performance dashboard on Kubernetes and
developing a Helm chart
 Produced a solution for direct measurement of I/O latency for Spark workloads reading from
OCI or S3 storage. The results are of general interest for Spark users, notably including the
Spark service at CERN
 Developed methods to parallelize TensorFlow/Keras on Kubernetes using TensorFlow 2.0
new tf.distribute features. These are of general interest for ML practitioners, notably including
the users of CERN cloud services.

Files

Report_Michal_Bien.pdf

Files (1.9 MB)

Name	Size	Download all
Report_Michal_Bien.pdf md5:f4ba7f91816350a8bcd872a4ba138c71	1.9 MB	Preview Download

	All versions	This version
Views	912	911
Downloads	742	742
Data volume	1.5 GB	1.5 GB

Big Data Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure

Authors/Creators

Description

Files

Report_Michal_Bien.pdf

Files (1.9 MB)