Dataset for the paper "Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset"
Authors/Creators
Description
We present a large-scale anomaly detection dataset collected from IBM Cloud's Console over approximately 4.5 months. This high-dimensional dataset captures telemetry data from multiple data centers, specifically designed to aid researchers in developing and benchmarking anomaly detection methods in large-scale cloud environments. It contains 39,365 entries, each representing a 5-minute interval, with 117,448 features/attributes, as interval_start is used as the index. The dataset includes detailed information on request counts, HTTP response codes, and various aggregated statistics. The dataset also includes labeled anomaly events identified through IBM's internal monitoring tools, providing a comprehensive resource for real-world anomaly detection research and evaluation.
File Descriptions
location_downtime.csv- Details planned and unplanned downtimes for IBM Cloud data centers, including start and end times in ISO 8601 format.unpivoted_data.parquet- Contains raw telemetry data with 413 million+ rows, covering details like location, HTTP status codes, request types, and aggregated statistics (min, max, median response times).anomaly_windows.csv- Ground truth for anomalies, listing start and end times of recorded anomalies, categorized by source (Issue Tracker, Instant Messenger, Test Log).pivoted_data_all.parquet- Pivoted version of the telemetry dataset with 39,365 rows and 117,449 columns, including aggregated statistics across multiple metrics and intervals.demo/demo.[ipynb|html]: This demo file provides examples of how to access data in the Parquet files, available in Jupyter Notebook (.ipynb) and HTML (.html) formats, respectively.
Further details of the dataset can be found in Appendix B: Dataset Characteristics of the paper titled "Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset." Sample code for training anomaly detectors using this data is provided in this package.
When using the dataset, please cite it as follows:
@misc{islam2024anomaly,title={Anomaly Detection in Large-Scale Cloud Systems: An Industry Case and Dataset},author={Mohammad Saiful Islam and Mohamed Sami Rakha and William Pourmajidi and Janakan Sivaloganathan and John Steinbacher and Andriy Miranskyy},year={2024},eprint={2411.09047},archivePrefix={arXiv},url={https://arxiv.org/abs/2411.09047}}
Files
anomaly_windows.csv
Files
(5.5 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:d87f9add1d127c66dbe6158ae56c2a64
|
1.5 kB | Preview Download |
|
md5:d5986896bcac642441a744ce7f725805
|
296.0 kB | Download |
|
md5:3e6974c149ca5ef56fb83cc14ea2e33f
|
13.4 kB | Preview Download |
|
md5:64ea8969d1158aca34f54cc5a20d87e7
|
6.1 kB | Preview Download |
|
md5:d2b515002dadd0a339f682f4b53c6ba1
|
2.7 GB | Download |
|
md5:e4805ad9a870696771e0453c8763a026
|
2.7 GB | Download |
Additional details
Related works
- Is required by
- Software: 10.5281/zenodo.14598119 (DOI)
- Is supplement to
- Preprint: arXiv:2411.09047 (arXiv)