Report Open Access

Big Data Analysis and Machine Learning at Scale with Oracle Cloud Infrastructure

Michał Bień

JSON Export

  "files": [
      "links": {
        "self": ""
      "checksum": "md5:f4ba7f91816350a8bcd872a4ba138c71", 
      "bucket": "adbefc2f-1477-4832-9f51-8ecb01c08a1e", 
      "key": "Report_Michal_Bien.pdf", 
      "type": "pdf", 
      "size": 1943637
  "owners": [
  "doi": "10.5281/zenodo.3550777", 
  "stats": {
    "version_unique_downloads": 446.0, 
    "unique_views": 358.0, 
    "views": 384.0, 
    "version_views": 385.0, 
    "unique_downloads": 446.0, 
    "version_unique_views": 359.0, 
    "volume": 911565753.0, 
    "version_downloads": 469.0, 
    "downloads": 469.0, 
    "version_volume": 911565753.0
  "links": {
    "doi": "", 
    "conceptdoi": "", 
    "bucket": "", 
    "conceptbadge": "", 
    "html": "", 
    "latest_html": "", 
    "badge": "", 
    "latest": ""
  "conceptdoi": "10.5281/zenodo.3550776", 
  "created": "2019-11-22T13:32:52.899852+00:00", 
  "updated": "2020-01-20T17:32:34.849641+00:00", 
  "conceptrecid": "3550776", 
  "revision": 3, 
  "id": 3550777, 
  "metadata": {
    "access_right_category": "success", 
    "doi": "10.5281/zenodo.3550777", 
    "description": "<p>This work has successfully deployed two different use cases of interest for High Energy Physics&nbsp;<br>\nusing cloud resources:&nbsp;<br>\n\uf0b7 CMS Big data reduction: This use case consists in running a data reduction workloads for&nbsp;<br>\nphysics data. The code and implementation has originally been developed by CERN openlab&nbsp;<br>\nin collaboration with CMS and Intel in 2017-2018. It aims at demonstrating the scalability of a&nbsp;<br>\ndata reduction workflow, by processing ROOT files using Apache Spark&nbsp;<br>\n\uf0b7 Spark DL Trigger: This use case consists in the deployment of a full data preparation and&nbsp;<br>\nmachine learning pipeline, starting from data ingestion (4.5 TB of ROOT data), to the training&nbsp;<br>\nof classifier using neural networks. This use case is implemented using Apache Spark and&nbsp;<br>\nthe Keras API, following previous work in collaboration with CERN openlab.&nbsp;<br>\nResources for this work have been deployed using Oracle Cloud Infrastructure (OCI). In particular&nbsp;<br>\nthis project has allowed to complete:&nbsp;<br>\n\uf0b7 Setup of the project using Oracle Container Engine for Kubernetes and Oracle Cloud&nbsp;<br>\nresources&nbsp;<br>\n\uf0b7 Troubleshooting of the oci-hdfs-connector to run Apache Spark at scale on OCI Object&nbsp;<br>\nStorage&nbsp;<br>\n\uf0b7 Measurements of OCI Object Storage performance for the selected use cases&nbsp;<br>\n\uf0b7 Investigations and performance measurements of the resource utilisation on Oracle&nbsp;<br>\nContainer Engine for Kubernetes (OKE), when running the TensorFlow/Keras neural network&nbsp;<br>\nmodel training at scale, using CPU resources, and when using GPU.&nbsp;<br>\nNotable results of this project:&nbsp;<br>\n\uf0b7 Produced several key improvements to the oci-hdfs-connector. The improvements are&nbsp;<br>\nnecessary to run the latest Spark version (Spark 2.4.x) on Oracle Cloud. The connector is&nbsp;<br>\ndistributed by Oracle with open source licensing, and the improvements will be fed back to&nbsp;<br>\nOracle.&nbsp;<br>\n\uf0b7 Improved instrumentation infrastructure for measuring Spark workloads on cloud resources,&nbsp;<br>\nby streamlining the deployment of Spark performance dashboard on Kubernetes and&nbsp;<br>\ndeveloping a Helm chart&nbsp;<br>\n\uf0b7 Produced a solution for direct measurement of I/O latency for Spark workloads reading from&nbsp;<br>\nOCI or S3 storage. The results are of general interest for Spark users, notably including the&nbsp;<br>\nSpark service at CERN&nbsp;<br>\n\uf0b7 Developed methods to parallelize TensorFlow/Keras on Kubernetes using TensorFlow 2.0&nbsp;<br>\nnew tf.distribute features. These are of general interest for ML practitioners, notably including&nbsp;<br>\nthe users of CERN cloud services.</p>", 
    "license": {
      "id": "CC-BY-4.0"
    "title": "Big Data Analysis and Machine Learning  at Scale with Oracle Cloud Infrastructure", 
    "relations": {
      "version": [
          "count": 1, 
          "index": 0, 
          "parent": {
            "pid_type": "recid", 
            "pid_value": "3550776"
          "is_last": true, 
          "last_child": {
            "pid_type": "recid", 
            "pid_value": "3550777"
    "communities": [
        "id": "cernopenlab"
    "keywords": [
      "CERN openlab", 
      "summer student programme"
    "publication_date": "2019-11-22", 
    "creators": [
        "name": "Micha\u0142 Bie\u0144"
    "access_right": "open", 
    "resource_type": {
      "subtype": "report", 
      "type": "publication", 
      "title": "Report"
    "related_identifiers": [
        "scheme": "doi", 
        "identifier": "10.5281/zenodo.3550776", 
        "relation": "isVersionOf"
All versions This version
Views 385384
Downloads 469469
Data volume 911.6 MB911.6 MB
Unique views 359358
Unique downloads 446446


Cite as