{
  "DOI": "10.5281/zenodo.2553224",
  "abstract": "The Antarex dataset contains trace data collected from the\u00a0homonymous experimental HPC system located at ETH Zurich\u00a0while it was subjected to fault injection,\u00a0for the purpose of conducting machine learning-based fault detection studies for\u00a0HPC systems. Acquiring our own dataset was made necessary\u00a0by the fact that commercial HPC system operators are very\u00a0reluctant to share trace data containing information about\u00a0faults in their systems.\n\n\nIn order to acquire data, we executed benchmark applications and at the same time injected faults in the system\u00a0at specific times via dedicated programs, so as to trigger\u00a0anomalies in the behaviour of the applications. A wide range of faults is covered in our dataset, from hardware faults, to misconfiguration faults, and finally to performance anomalies cause by interference from other processes.\u00a0This was achieved through the FINJ fault injection tool, developed by the authors.\n\n\nThe dataset contains two types of data: one type of\u00a0data\u00a0refers to a series of CSV files, each containing a set of system performance metrics sampled through the LDMS\u00a0HPC monitoring framework. Another type refers to the log\u00a0files detailing the status of the system (i.e., currently running\u00a0benchmark applications or injected fault programs) at each\u00a0time point in the dataset. Such a structure enables researchers\u00a0to perform a wide range of studies on the dataset. Moreover,\u00a0since we collected the dataset by streaming continuous data,\u00a0any study based on it will easily be reproducible on a real\u00a0HPC system, in an online way.\u00a0The dataset is divided in two parts:\u00a0the first\u00a0includes\u00a0only the CPU and memory-related benchmark applications\u00a0and fault programs, while the second is strictly hard drive-related. We executed each part in both single-core and multi-core variants, resulting in a total of 4 dataset blocks for 32 days of data acquisition, and 20GB of uncompressed data.\n\n\nFor a detailed analysis on\u00a0the structure and features of the Antarex dataset, please refer to the\u00a0research paper\u00a0\"Online Fault Classification in HPC System through Machine Learning\", by Netti et al. Additional details can be found in the research paper \"FINJ: a Fault Injection Tool for HPC System\" by Netti et al., whereas all source code can be found on the GitHub repository of the FINJ tool.\n\n\nWhen using this dataset, please cite the two reference papers above as follows:\n\n\n\"\u00a0Netti A., Kiziltan Z., Babaoglu O., S\u00eerbu A., Bartolini A., Borghesi A. (2019) FINJ: A Fault Injection Tool for HPC Systems. In: Mencagli G. et al. (eds) Euro-Par 2018: Parallel Processing Workshops. Euro-Par 2018. Lecture Notes in Computer Science, vol 11339. Springer, Cham\"\n\n\n\" Netti A., Kiziltan Z., Babaoglu O., S\u00eerbu A., Bartolini A., Borghesi A. (2019) Online Fault Classification in HPC Systems through Machine Learning. arXiv:1810.11208\"",
  "author": [
    {
      "family": "Alessio Netti"
    },
    {
      "family": "Zeynep Kiziltan"
    },
    {
      "family": "Ozalp Babaoglu"
    },
    {
      "family": "Alina Sirbu"
    },
    {
      "family": "Andrea Bartolini"
    },
    {
      "family": "Andrea Borghesi"
    }
  ],
  "event": "International European Conference on Parallel Processing (Euro-Par 2019)",
  "id": "2553224",
  "issued": {
    "date-parts": [
      [
        "2018",
        "10",
        "10"
      ]
    ]
  },
  "language": "eng",
  "publisher": "Zenodo",
  "title": "Antarex HPC Fault Dataset",
  "type": "dataset",
  "version": "1.0"
}