Dataset Open Access

Antarex HPC Fault Dataset

Alessio Netti; Zeynep Kiziltan; Ozalp Babaoglu; Alina Sirbu; Andrea Bartolini; Andrea Borghesi


Citation Style Language JSON Export

{
  "publisher": "Zenodo", 
  "DOI": "10.5281/zenodo.2553224", 
  "language": "eng", 
  "title": "Antarex HPC Fault Dataset", 
  "issued": {
    "date-parts": [
      [
        2018, 
        10, 
        10
      ]
    ]
  }, 
  "abstract": "<p>The Antarex dataset contains trace data collected from the&nbsp;homonymous experimental HPC system located at ETH Zurich&nbsp;while it was subjected to fault injection,&nbsp;for the purpose of conducting machine learning-based fault detection studies for&nbsp;HPC systems. Acquiring our own dataset was made necessary&nbsp;by the fact that commercial HPC system operators are very&nbsp;reluctant to share trace data containing information about&nbsp;faults in their systems.</p>\n\n<p>In order to acquire data, we executed benchmark applications and at the same time injected faults in the system&nbsp;at specific times via dedicated programs, so as to trigger&nbsp;anomalies in the behaviour of the applications. A wide range of faults is covered in our dataset, from hardware faults, to misconfiguration faults, and finally to performance anomalies cause by interference from other processes.&nbsp;This was achieved through the FINJ fault injection tool, developed by the authors.</p>\n\n<p>The dataset contains two types of data: one type of&nbsp;data&nbsp;refers to a series of CSV files, each containing a set of system performance metrics sampled through the LDMS&nbsp;HPC monitoring framework. Another type refers to the log&nbsp;files detailing the status of the system (i.e., currently running&nbsp;benchmark applications or injected fault programs) at each&nbsp;time point in the dataset. Such a structure enables researchers&nbsp;to perform a wide range of studies on the dataset. Moreover,&nbsp;since we collected the dataset by streaming continuous data,&nbsp;any study based on it will easily be reproducible on a real&nbsp;HPC system, in an online way.&nbsp;The dataset is divided in two parts:&nbsp;the first&nbsp;includes&nbsp;only the CPU and memory-related benchmark applications&nbsp;and fault programs, while the second is strictly hard drive-related. We executed each part in both single-core and multi-core variants, resulting in a total of 4 dataset blocks for 32 days of data acquisition, and 20GB of uncompressed data.</p>\n\n<p>For a detailed analysis on&nbsp;the structure and features of the Antarex dataset, please refer to the&nbsp;research paper&nbsp;&quot;Online Fault Classification in HPC System through Machine Learning&quot;, by Netti et al. Additional details can be found in the research paper &quot;FINJ: a Fault Injection Tool for HPC System&quot; by Netti et al., whereas all source code can be found on the GitHub repository of the FINJ tool.</p>\n\n<p>When using this dataset, please cite the two reference papers above as follows:</p>\n\n<p>&quot;&nbsp;Netti A., Kiziltan Z., Babaoglu O., S&icirc;rbu A., Bartolini A., Borghesi A. (2019) FINJ: A Fault Injection Tool for HPC Systems. In: Mencagli G. et al. (eds) Euro-Par 2018: Parallel Processing Workshops. Euro-Par 2018. Lecture Notes in Computer Science, vol 11339. Springer, Cham&quot;</p>\n\n<p>&quot; Netti A., Kiziltan Z., Babaoglu O., S&icirc;rbu A., Bartolini A., Borghesi A. (2019) Online Fault Classification in HPC Systems through Machine Learning. arXiv:1810.11208&quot;</p>", 
  "author": [
    {
      "family": "Alessio Netti"
    }, 
    {
      "family": "Zeynep Kiziltan"
    }, 
    {
      "family": "Ozalp Babaoglu"
    }, 
    {
      "family": "Alina Sirbu"
    }, 
    {
      "family": "Andrea Bartolini"
    }, 
    {
      "family": "Andrea Borghesi"
    }
  ], 
  "id": "2553224", 
  "note": "The archive contains 4 directories, one for each block of the dataset - namely CPU/Memory and HDD, in single-core and multi-core variants. In each of these directories, you will find the following: a 7z archive containing the LDMS CSV files for each of the 7 used plugins; FINJ workloads and execution logs; the histograms for the durations and inter-arrival times of fault tasks in PDF format; launch scripts, if any. Source code for all of the injected fault programs and additional details can be found on the GitHub repository of the FINJ tool.", 
  "version": "1.0", 
  "type": "dataset", 
  "event": "International European Conference on Parallel Processing (Euro-Par 2019)"
}
711
275
views
downloads
All versions This version
Views 711321
Downloads 275109
Data volume 367.3 GB99.6 GB
Unique views 599283
Unique downloads 17783

Share

Cite as