Report Open Access

Big Data Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure

Michał Bień


JSON-LD (schema.org) Export

{
  "description": "<p>This work has successfully deployed two different use cases of interest for High Energy Physics&nbsp;<br>\nusing cloud resources:&nbsp;<br>\n\uf0b7 CMS Big data reduction: This use case consists in running a data reduction workloads for&nbsp;<br>\nphysics data. The code and implementation has originally been developed by CERN openlab&nbsp;<br>\nin collaboration with CMS and Intel in 2017-2018. It aims at demonstrating the scalability of a&nbsp;<br>\ndata reduction workflow, by processing ROOT files using Apache Spark&nbsp;<br>\n\uf0b7 Spark DL Trigger: This use case consists in the deployment of a full data preparation and&nbsp;<br>\nmachine learning pipeline, starting from data ingestion (4.5 TB of ROOT data), to the training&nbsp;<br>\nof classifier using neural networks. This use case is implemented using Apache Spark and&nbsp;<br>\nthe Keras API, following previous work in collaboration with CERN openlab.&nbsp;<br>\nResources for this work have been deployed using Oracle Cloud Infrastructure (OCI). In particular&nbsp;<br>\nthis project has allowed to complete:&nbsp;<br>\n\uf0b7 Setup of the project using Oracle Container Engine for Kubernetes and Oracle Cloud&nbsp;<br>\nresources&nbsp;<br>\n\uf0b7 Troubleshooting of the oci-hdfs-connector to run Apache Spark at scale on OCI Object&nbsp;<br>\nStorage&nbsp;<br>\n\uf0b7 Measurements of OCI Object Storage performance for the selected use cases&nbsp;<br>\n\uf0b7 Investigations and performance measurements of the resource utilisation on Oracle&nbsp;<br>\nContainer Engine for Kubernetes (OKE), when running the TensorFlow/Keras neural network&nbsp;<br>\nmodel training at scale, using CPU resources, and when using GPU.&nbsp;<br>\nNotable results of this project:&nbsp;<br>\n\uf0b7 Produced several key improvements to the oci-hdfs-connector. The improvements are&nbsp;<br>\nnecessary to run the latest Spark version (Spark 2.4.x) on Oracle Cloud. The connector is&nbsp;<br>\ndistributed by Oracle with open source licensing, and the improvements will be fed back to&nbsp;<br>\nOracle.&nbsp;<br>\n\uf0b7 Improved instrumentation infrastructure for measuring Spark workloads on cloud resources,&nbsp;<br>\nby streamlining the deployment of Spark performance dashboard on Kubernetes and&nbsp;<br>\ndeveloping a Helm chart&nbsp;<br>\n\uf0b7 Produced a solution for direct measurement of I/O latency for Spark workloads reading from&nbsp;<br>\nOCI or S3 storage. The results are of general interest for Spark users, notably including the&nbsp;<br>\nSpark service at CERN&nbsp;<br>\n\uf0b7 Developed methods to parallelize TensorFlow/Keras on Kubernetes using TensorFlow 2.0&nbsp;<br>\nnew tf.distribute features. These are of general interest for ML practitioners, notably including&nbsp;<br>\nthe users of CERN cloud services.</p>", 
  "license": "https://creativecommons.org/licenses/by/4.0/legalcode", 
  "creator": [
    {
      "@type": "Person", 
      "name": "Micha\u0142 Bie\u0144"
    }
  ], 
  "headline": "Big Data Analysis and Machine Learning  at Scale with Oracle Cloud Infrastructure", 
  "image": "https://zenodo.org/static/img/logos/zenodo-gradient-round.svg", 
  "datePublished": "2019-11-22", 
  "url": "https://zenodo.org/record/3550777", 
  "keywords": [
    "CERN openlab", 
    "summer student programme"
  ], 
  "@context": "https://schema.org/", 
  "identifier": "https://doi.org/10.5281/zenodo.3550777", 
  "@id": "https://doi.org/10.5281/zenodo.3550777", 
  "@type": "ScholarlyArticle", 
  "name": "Big Data Analysis and Machine Learning  at Scale with Oracle Cloud Infrastructure"
}
386
472
views
downloads
All versions This version
Views 386385
Downloads 472472
Data volume 917.4 MB917.4 MB
Unique views 360359
Unique downloads 449449

Share

Cite as