Process Behavior Corpus and Benchmarking Datasets
Authors/Creators
Description
A corpus of process behaviors and benchmarking datasets for semantics-aware process mining tasks.
Files:
- process_behavior_corpus.csv: the text corpus, which contains the behavior allowed by process models as sequences of activities (column string_traces).
- T_SAD.csv: A benchmark dataset generated from the corpus to assess the following task: Given a trace σ, decide if σ is a valid execution of the underlying process or not, without knowing the behavior allowed in the process.
Each row contains a trace (column trace) with a corresponding label (column anomalous) indicating whether the trace represents a valid execution of the underlying process. The set of activities that can occur in the process are also given (column unique_activities). - A_SAD.csv: A benchmark dataset generated from the corpus to assess the following task: Given an eventually-follows relation ef = a ≺ b of
a trace σ, decide if ef represents a valid execution order of the two activities a and b that are executed in a process or not, without knowing the behavior allowed in the process.
Each row contains an eventually-follows relation (column eventually_follows) with a corresponding label (column out_of_order) indicating wether the two activities of the relation were executed in an invalid order (TRUE) or in a valid order (FALSE) according to the underlying process (model). The set of activities that can occur in the process are also given (column unique_activities). - S_NAP.csv: A benchmark dataset generated from the corpus to assess the following task: Given an event log L and a prefix p_k of length k, with 1 < k, predict the next activity a_k+1
Each row contains a trace prefix (column prefix) with a corresponding next activity (column next) indicating the activity that should be performed next after the last activity of the prefix according to the trace from which the prefix was generated. The set of activities that can occur in the process are also given (column unique_activities). - S-PMD.csv: A benchmark dataset generated from the corpus to assess the following tasks:
- Given a set of possible activities (column unique_activities), generate a difectly follows graph (column dfg) that captures the trace semantics of the process model.
- Given a set of possible activities (column unique_activities), generate a simple process tree (column pt) that captures the trace semantics of the process model.
Reference and legal info:
The corpus and the benchmark datasets are generated using the SAP-SAM dataset:
Kampik, T., Warmuth, C., Sola, D., Schäfer, B., Axworthy, L., Ivarsson, E., Ouda, K., & Eickhoff, D. (2022). SAP Signavio Academic Models (0.5.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7012043
The SAP-SAM dataset is published with a specific license (see "Rights"), which, therefore, also applies to the data published in this record.
THE DATASETS AND ASSOCIATED EVALUATION EXPERIMENTS ARE DESCRIBED IN THIS PAPER.
IN THIS REPOSITORY YOU FIND THE CODE AND RAW RESULTS OF EVALUATION EXPERIMENTS USING VARIOUS OPEN SOUCE LLMs TO SOLVE THE TASKS
Files
A_SAD.csv
Files
(1.3 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:736c4072cd44a343648707afb6201e6c
|
125.1 MB | Preview Download |
|
md5:7ce5ded7f481ce248e9262fd64e89738
|
58.8 MB | Preview Download |
|
md5:f7507e3649bb453a5fdf8e3fa5334f81
|
11.2 MB | Preview Download |
|
md5:c3f9c377a6e86123864571050d882648
|
998.8 MB | Preview Download |
|
md5:e06b38f9c88bf5d5155714c3389523e2
|
155.5 MB | Preview Download |