Report Open Access

Big Data Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure

Michał Bień

JSON-LD ( Export

  "description": "<p>This work has successfully deployed two different use cases of interest for High Energy Physics&nbsp;<br>\nusing cloud resources:&nbsp;<br>\n\uf0b7 CMS Big data reduction: This use case consists in running a data reduction workloads for&nbsp;<br>\nphysics data. The code and implementation has originally been developed by CERN openlab&nbsp;<br>\nin collaboration with CMS and Intel in 2017-2018. It aims at demonstrating the scalability of a&nbsp;<br>\ndata reduction workflow, by processing ROOT files using Apache Spark&nbsp;<br>\n\uf0b7 Spark DL Trigger: This use case consists in the deployment of a full data preparation and&nbsp;<br>\nmachine learning pipeline, starting from data ingestion (4.5 TB of ROOT data), to the training&nbsp;<br>\nof classifier using neural networks. This use case is implemented using Apache Spark and&nbsp;<br>\nthe Keras API, following previous work in collaboration with CERN openlab.&nbsp;<br>\nResources for this work have been deployed using Oracle Cloud Infrastructure (OCI). In particular&nbsp;<br>\nthis project has allowed to complete:&nbsp;<br>\n\uf0b7 Setup of the project using Oracle Container Engine for Kubernetes and Oracle Cloud&nbsp;<br>\nresources&nbsp;<br>\n\uf0b7 Troubleshooting of the oci-hdfs-connector to run Apache Spark at scale on OCI Object&nbsp;<br>\nStorage&nbsp;<br>\n\uf0b7 Measurements of OCI Object Storage performance for the selected use cases&nbsp;<br>\n\uf0b7 Investigations and performance measurements of the resource utilisation on Oracle&nbsp;<br>\nContainer Engine for Kubernetes (OKE), when running the TensorFlow/Keras neural network&nbsp;<br>\nmodel training at scale, using CPU resources, and when using GPU.&nbsp;<br>\nNotable results of this project:&nbsp;<br>\n\uf0b7 Produced several key improvements to the oci-hdfs-connector. The improvements are&nbsp;<br>\nnecessary to run the latest Spark version (Spark 2.4.x) on Oracle Cloud. The connector is&nbsp;<br>\ndistributed by Oracle with open source licensing, and the improvements will be fed back to&nbsp;<br>\nOracle.&nbsp;<br>\n\uf0b7 Improved instrumentation infrastructure for measuring Spark workloads on cloud resources,&nbsp;<br>\nby streamlining the deployment of Spark performance dashboard on Kubernetes and&nbsp;<br>\ndeveloping a Helm chart&nbsp;<br>\n\uf0b7 Produced a solution for direct measurement of I/O latency for Spark workloads reading from&nbsp;<br>\nOCI or S3 storage. The results are of general interest for Spark users, notably including the&nbsp;<br>\nSpark service at CERN&nbsp;<br>\n\uf0b7 Developed methods to parallelize TensorFlow/Keras on Kubernetes using TensorFlow 2.0&nbsp;<br>\nnew tf.distribute features. These are of general interest for ML practitioners, notably including&nbsp;<br>\nthe users of CERN cloud services.</p>", 
  "license": "", 
  "creator": [
      "@type": "Person", 
      "name": "Micha\u0142 Bie\u0144"
  "headline": "Big Data Analysis and Machine Learning  at Scale with Oracle Cloud Infrastructure", 
  "image": "", 
  "datePublished": "2019-11-22", 
  "url": "", 
  "keywords": [
    "CERN openlab", 
    "summer student programme"
  "@context": "", 
  "identifier": "", 
  "@id": "", 
  "@type": "ScholarlyArticle", 
  "name": "Big Data Analysis and Machine Learning  at Scale with Oracle Cloud Infrastructure"
All versions This version
Views 386385
Downloads 472472
Data volume 917.4 MB917.4 MB
Unique views 360359
Unique downloads 449449


Cite as