Dataset Open Access

Antarex HPC Fault Dataset

Alessio Netti; Zeynep Kiziltan; Ozalp Babaoglu; Alina Sirbu; Andrea Bartolini; Andrea Borghesi


JSON-LD (schema.org) Export

{
  "inLanguage": {
    "alternateName": "eng", 
    "@type": "Language", 
    "name": "English"
  }, 
  "description": "<p>The Antarex dataset contains trace data collected from the&nbsp;homonymous experimental HPC system located at ETH Zurich&nbsp;while it was subjected to fault injection,&nbsp;for the purpose of conducting machine learning-based fault detection studies for&nbsp;HPC systems. Acquiring our own dataset was made necessary&nbsp;by the fact that commercial HPC system operators are very&nbsp;reluctant to share trace data containing information about&nbsp;faults in their systems.</p>\n\n<p>In order to acquire data, we executed benchmark applications and at the same time injected faults in the system&nbsp;at specific times via dedicated programs, so as to trigger&nbsp;anomalies in the behaviour of the applications. A wide range of faults is covered in our dataset, from hardware faults, to misconfiguration faults, and finally to performance anomalies cause by interference from other processes.&nbsp;This was achieved through the FINJ fault injection tool, developed by the authors.</p>\n\n<p>The dataset contains two types of data: one type of&nbsp;data&nbsp;refers to a series of CSV files, each containing a set of system performance metrics sampled through the LDMS&nbsp;HPC monitoring framework. Another type refers to the log&nbsp;files detailing the status of the system (i.e., currently running&nbsp;benchmark applications or injected fault programs) at each&nbsp;time point in the dataset. Such a structure enables researchers&nbsp;to perform a wide range of studies on the dataset. Moreover,&nbsp;since we collected the dataset by streaming continuous data,&nbsp;any study based on it will easily be reproducible on a real&nbsp;HPC system, in an online way.&nbsp;The dataset is divided in two parts:&nbsp;the first&nbsp;includes&nbsp;only the CPU and memory-related benchmark applications&nbsp;and fault programs, while the second is strictly hard drive-related. We executed each part in both single-core and multi-core variants, resulting in a total of 4 dataset blocks for 32 days of data acquisition, and 20GB of uncompressed data.</p>\n\n<p>For a detailed analysis on&nbsp;the structure and features of the Antarex dataset, please refer to the&nbsp;research paper&nbsp;&quot;Online Fault Classification in HPC System through Machine Learning&quot;, by Netti et al. Additional details can be found in the research paper &quot;FINJ: a Fault Injection Tool for HPC System&quot; by Netti et al., whereas all source code can be found on the GitHub repository of the FINJ tool.</p>\n\n<p>When using this dataset, please cite the two reference papers above as follows:</p>\n\n<p>&quot;&nbsp;Netti A., Kiziltan Z., Babaoglu O., S&icirc;rbu A., Bartolini A., Borghesi A. (2019) FINJ: A Fault Injection Tool for HPC Systems. In: Mencagli G. et al. (eds) Euro-Par 2018: Parallel Processing Workshops. Euro-Par 2018. Lecture Notes in Computer Science, vol 11339. Springer, Cham&quot;</p>\n\n<p>&quot; Netti A., Kiziltan Z., Babaoglu O., S&icirc;rbu A., Bartolini A., Borghesi A. (2019) Online Fault Classification in HPC Systems through Machine Learning. arXiv:1810.11208&quot;</p>", 
  "license": "https://creativecommons.org/licenses/by/4.0/legalcode", 
  "creator": [
    {
      "affiliation": "Department of Computer Science and Engineering, University of Bologna", 
      "@type": "Person", 
      "name": "Alessio Netti"
    }, 
    {
      "affiliation": "Department of Computer Science and Engineering, University of Bologna", 
      "@type": "Person", 
      "name": "Zeynep Kiziltan"
    }, 
    {
      "affiliation": "Department of Computer Science and Engineering, University of Bologna", 
      "@type": "Person", 
      "name": "Ozalp Babaoglu"
    }, 
    {
      "affiliation": "Department of Computer Science, University of Pisa", 
      "@type": "Person", 
      "name": "Alina Sirbu"
    }, 
    {
      "affiliation": "Department of Electrical, Electronic and Information Engineering, University of Bologna", 
      "@type": "Person", 
      "name": "Andrea Bartolini"
    }, 
    {
      "affiliation": "Department of Electrical, Electronic and Information Engineering, University of Bologna", 
      "@type": "Person", 
      "name": "Andrea Borghesi"
    }
  ], 
  "url": "https://zenodo.org/record/2553224", 
  "datePublished": "2018-10-10", 
  "keywords": [
    "High-performance computing", 
    "Exascale systems", 
    "Monitoring", 
    "Fault Detection", 
    "Machine Learning"
  ], 
  "version": "1.0", 
  "@type": "Dataset", 
  "contributor": [
    {
      "affiliation": "Department of Computer Science and Engineering, University of Bologna", 
      "@type": "Person", 
      "name": "Alessio Netti"
    }, 
    {
      "affiliation": "Department of Computer Science and Engineering, University of Bologna", 
      "@type": "Person", 
      "name": "Zeynep Kiziltan"
    }, 
    {
      "affiliation": "Department of Computer Science and Engineering, University of Bologna", 
      "@type": "Person", 
      "name": "Ozalp Babaoglu"
    }, 
    {
      "affiliation": "Department of Computer Science, University of Pisa", 
      "@type": "Person", 
      "name": "Alina Sirbu"
    }, 
    {
      "affiliation": "Department of Electrical, Electronic and Information Engineering, University of Bologna", 
      "@type": "Person", 
      "name": "Andrea Bartolini"
    }, 
    {
      "affiliation": "Department of Electrical, Electronic and Information Engineering, University of Bologna", 
      "@type": "Person", 
      "name": "Andrea Borghesi"
    }
  ], 
  "@context": "https://schema.org/", 
  "distribution": [
    {
      "contentUrl": "https://zenodo.org/api/files/c4ab6885-4b9e-4600-9884-f2cba9ab2b71/Antarex.zip", 
      "encodingFormat": "zip", 
      "@type": "DataDownload"
    }, 
    {
      "contentUrl": "https://zenodo.org/api/files/c4ab6885-4b9e-4600-9884-f2cba9ab2b71/Readme_Antarex_Dataset.pdf", 
      "encodingFormat": "pdf", 
      "@type": "DataDownload"
    }
  ], 
  "identifier": "https://doi.org/10.5281/zenodo.2553224", 
  "@id": "https://doi.org/10.5281/zenodo.2553224", 
  "workFeatured": {
    "alternateName": "Euro-Par 2019", 
    "@type": "Event", 
    "name": "International European Conference on Parallel Processing"
  }, 
  "name": "Antarex HPC Fault Dataset"
}
710
275
views
downloads
All versions This version
Views 710321
Downloads 275109
Data volume 367.3 GB99.6 GB
Unique views 598283
Unique downloads 17783

Share

Cite as