syslrn: Learning What to Monitor for Efficient Anomaly Detection [Dataset]
Authors/Creators
- 1. NEC Laboratories Europe
Description
This repository includes the dataset for the paper:
D. Sanvito, G. Siracusano, S. Santhanam, R. Gonzalez, R. Bifulco
syslrn: Learning What to Monitor for Efficient Anomaly Detection
ACM EuroMLSys 2022
The dataset contains two directories at the root level:
- raw_dataset
- processed_dataset
Each folder in the raw_dataset directory contains the raw monitoring data used to generate the graph associated to a single experiment together with additional metadata files.
Each folder in the processed_dataset directory contains the graph associated to a single experiment as a set of three CSV files: two for the graph edges (pid_childof_pid_df.csv and pid_speakswith_pid_df.csv) and one for the graph nodes (proc_df.csv).
We provide below a code snippet to parse a graph from processed_dataset directory.
In both folders the name of each sub-folder is based on the following schema: [SCENARIO]_[W]wl/test_[TEST_ID] where:
- [SCENARIO] reports the target component for the failure injection (cinder_failure, neutron_failure, nova_failure). ff indicates instead a failure-free execution
- [W] reports the number of concurrent workloads
- [TEST_ID] reports the ID of the specific failure scenario injected (same ID selected by the OpenStack failure injection framework [1] )
Each experiment includes the following data in the raw_dataset sub-folders:
- audit_raw_logs_[TEST_ID]/: raw audit monitoring data
- bpf_tools_[TEST_ID]/: raw ebpf tools monitoring data
- instance-[INSTANCE_ID]/: workload-specific metadata files, e.g. stdout/stderr (generated by the OpenStack failure injection framework [1] )
- logs_workload_[TEST_ID]/: OpenStack application logs
- perf_tools_[TEST_ID]/: raw perf tools monitoring data
- audit_filtered_[TEST_ID].log: audit data pre-processed by ausearch (e.g. numerical entities are resolved to symbols)
- failure_[TEST_ID].info: metadata information about the specific failure scenario (generated by the OpenStack failure injection framework [1] )
- timestamps_[TEST_ID]: timing information
[1] D. Cotroneo, L. De Simone, P. Liguori, R. Natella, N. Bidokhti - How Bad Can a Bug Get? An Empirical Analysis of Software Failures in the OpenStack Cloud Computing Platform [ACM ESEC/FSE 2019]
Example: parsing a graph from processed_dataset directory
import pandas as pd
import networkx as nx
def parse_csv(path):
processes_df = pd.read_csv('%sproc_df.csv' % path, index_col=0).reset_index(drop=True)
speakswith_edges_df = pd.read_csv('%spid_speakswith_pid_df.csv' % path, index_col=0)
speakswith_edges_df['type'] = 'speaksWith'
childof_edges_df = pd.read_csv('%spid_childof_pid_df.csv' % path, index_col=0)
childof_edges_df['type'] = 'childOf'
return processes_df, pd.concat([speakswith_edges_df, childof_edges_df], ignore_index=True)
def make_graph(nodes_df, edges_df):
G = nx.MultiGraph()
for _, node in nodes_df.iterrows():
G.add_node(node.pid, **node)
for _, edge in edges_df.iterrows():
G.add_edge(edge.pid1, edge.pid2, type=edge.type)
return G
PATH = 'processed_dataset/ff_1wl/test_1/'
nodes_df, edges_df = parse_csv(PATH)
G = make_graph(nodes_df, edges_df)
nx.draw_networkx(G, node_size=10, with_labels=False)
If you use this dataset for your research, please cite the following paper:
@inproceedings{sanvito2022syslrn,
title={syslrn: Learning What to Monitor for Efficient Anomaly Detection},
author={Sanvito, Davide and Siracusano, Giuseppe and Santhanam, Sharan and Gonzalez, Roberto and Bifulco, Roberto},
booktitle={2nd European Workshop on Machine Learning and Systems (EuroMLSys '22)},
year={2022},
address = {Rennes, France},
publisher = {ACM},
month = apr,
}
Files
dataset.zip
Files
(7.2 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:6461ead82f833a0c3e21c6252a075457
|
7.2 GB | Preview Download |
|
md5:34ba7dcd79d1c269a5da51e3f0d61fdd
|
7.3 kB | Download |
Additional details
Related works
- References
- Conference paper: 10.1145/3517207.3526979 (DOI)
References
- D. Sanvito, G. Siracusano, S. Santhanam, R. Gonzalez, R. Bifulco - syslrn: Learning What to Monitor for Efficient Anomaly Detection [ACM EuroMLSys 2022]