Report Open Access

Big Data Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure

Michał Bień


Citation Style Language JSON Export

{
  "publisher": "Zenodo", 
  "DOI": "10.5281/zenodo.3550777", 
  "author": [
    {
      "family": "Micha\u0142 Bie\u0144"
    }
  ], 
  "issued": {
    "date-parts": [
      [
        2019, 
        11, 
        22
      ]
    ]
  }, 
  "abstract": "<p>This work has successfully deployed two different use cases of interest for High Energy Physics&nbsp;<br>\nusing cloud resources:&nbsp;<br>\n\uf0b7 CMS Big data reduction: This use case consists in running a data reduction workloads for&nbsp;<br>\nphysics data. The code and implementation has originally been developed by CERN openlab&nbsp;<br>\nin collaboration with CMS and Intel in 2017-2018. It aims at demonstrating the scalability of a&nbsp;<br>\ndata reduction workflow, by processing ROOT files using Apache Spark&nbsp;<br>\n\uf0b7 Spark DL Trigger: This use case consists in the deployment of a full data preparation and&nbsp;<br>\nmachine learning pipeline, starting from data ingestion (4.5 TB of ROOT data), to the training&nbsp;<br>\nof classifier using neural networks. This use case is implemented using Apache Spark and&nbsp;<br>\nthe Keras API, following previous work in collaboration with CERN openlab.&nbsp;<br>\nResources for this work have been deployed using Oracle Cloud Infrastructure (OCI). In particular&nbsp;<br>\nthis project has allowed to complete:&nbsp;<br>\n\uf0b7 Setup of the project using Oracle Container Engine for Kubernetes and Oracle Cloud&nbsp;<br>\nresources&nbsp;<br>\n\uf0b7 Troubleshooting of the oci-hdfs-connector to run Apache Spark at scale on OCI Object&nbsp;<br>\nStorage&nbsp;<br>\n\uf0b7 Measurements of OCI Object Storage performance for the selected use cases&nbsp;<br>\n\uf0b7 Investigations and performance measurements of the resource utilisation on Oracle&nbsp;<br>\nContainer Engine for Kubernetes (OKE), when running the TensorFlow/Keras neural network&nbsp;<br>\nmodel training at scale, using CPU resources, and when using GPU.&nbsp;<br>\nNotable results of this project:&nbsp;<br>\n\uf0b7 Produced several key improvements to the oci-hdfs-connector. The improvements are&nbsp;<br>\nnecessary to run the latest Spark version (Spark 2.4.x) on Oracle Cloud. The connector is&nbsp;<br>\ndistributed by Oracle with open source licensing, and the improvements will be fed back to&nbsp;<br>\nOracle.&nbsp;<br>\n\uf0b7 Improved instrumentation infrastructure for measuring Spark workloads on cloud resources,&nbsp;<br>\nby streamlining the deployment of Spark performance dashboard on Kubernetes and&nbsp;<br>\ndeveloping a Helm chart&nbsp;<br>\n\uf0b7 Produced a solution for direct measurement of I/O latency for Spark workloads reading from&nbsp;<br>\nOCI or S3 storage. The results are of general interest for Spark users, notably including the&nbsp;<br>\nSpark service at CERN&nbsp;<br>\n\uf0b7 Developed methods to parallelize TensorFlow/Keras on Kubernetes using TensorFlow 2.0&nbsp;<br>\nnew tf.distribute features. These are of general interest for ML practitioners, notably including&nbsp;<br>\nthe users of CERN cloud services.</p>", 
  "title": "Big Data Analysis and Machine Learning  at Scale with Oracle Cloud Infrastructure", 
  "type": "article", 
  "id": "3550777"
}
385
469
views
downloads
All versions This version
Views 385384
Downloads 469469
Data volume 911.6 MB911.6 MB
Unique views 359358
Unique downloads 446446

Share

Cite as