Process Behavior Corpus and Benchmarking Datasets

Rebmann, Adrian; Schmidt, Fabian David; Glavaš, Goran; van der Aa, Han

doi:10.5281/zenodo.14273161

Published December 4, 2024 | Version v2

Dataset Open

Process Behavior Corpus and Benchmarking Datasets

A corpus of process behaviors and benchmarking datasets for semantics-aware process mining tasks.

Files:

process_behavior_corpus.csv: the text corpus, which contains the behavior allowed by process models as sequences of activities (column string_traces).
T_SAD.csv: A benchmark dataset generated from the corpus to assess the following task: Given a trace σ, decide if σ is a valid execution of the underlying process or not, without knowing the behavior allowed in the process.
Each row contains a trace (column trace) with a corresponding label (column anomalous) indicating whether the trace represents a valid execution of the underlying process. The set of activities that can occur in the process are also given (column unique_activities).
A_SAD.csv: A benchmark dataset generated from the corpus to assess the following task: Given an eventually-follows relation ef = a ≺ b of
a trace σ, decide if ef represents a valid execution order of the two activities a and b that are executed in a process or not, without knowing the behavior allowed in the process.
Each row contains an eventually-follows relation (column eventually_follows) with a corresponding label (column out_of_order) indicating wether the two activities of the relation were executed in an invalid order (TRUE) or in a valid order (FALSE) according to the underlying process (model). The set of activities that can occur in the process are also given (column unique_activities).
S_NAP.csv: A benchmark dataset generated from the corpus to assess the following task: Given an event log L and a prefix p_k of length k, with 1 < k, predict the next activity a_k+1
Each row contains a trace prefix (column prefix) with a corresponding next activity (column next) indicating the activity that should be performed next after the last activity of the prefix according to the trace from which the prefix was generated. The set of activities that can occur in the process are also given (column unique_activities).
S-PMD.csv: A benchmark dataset generated from the corpus to assess the following tasks:
- Given a set of possible activities (column unique_activities), generate a difectly follows graph (column dfg) that captures the trace semantics of the process model.
- Given a set of possible activities (column unique_activities), generate a simple process tree (column pt) that captures the trace semantics of the process model.

Reference and legal info:

The corpus and the benchmark datasets are generated using the SAP-SAM dataset:

Kampik, T., Warmuth, C., Sola, D., Schäfer, B., Axworthy, L., Ivarsson, E., Ouda, K., & Eickhoff, D. (2022). SAP Signavio Academic Models (0.5.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7012043

The SAP-SAM dataset is published with a specific license (see "Rights"), which, therefore, also applies to the data published in this record.

THE DATASETS AND ASSOCIATED EVALUATION EXPERIMENTS ARE DESCRIBED IN THIS PAPER.

IN THIS REPOSITORY YOU FIND THE CODE AND RAW RESULTS OF EVALUATION EXPERIMENTS USING VARIOUS OPEN SOUCE LLMs TO SOLVE THE TASKS

Files

A_SAD.csv

Files (1.3 GB)

Name	Size	Download all
A_SAD.csv md5:736c4072cd44a343648707afb6201e6c	125.1 MB	Preview Download
process_behavior_corpus.csv md5:7ce5ded7f481ce248e9262fd64e89738	58.8 MB	Preview Download
S-PMD.csv md5:f7507e3649bb453a5fdf8e3fa5334f81	11.2 MB	Preview Download
S_NAP.csv md5:c3f9c377a6e86123864571050d882648	998.8 MB	Preview Download
T_SAD.csv md5:e06b38f9c88bf5d5155714c3389523e2	155.5 MB	Preview Download

	All versions	This version
Views	463	225
Downloads	1,106	604
Data volume	323.3 GB	173.2 GB

Process Behavior Corpus and Benchmarking Datasets

Authors/Creators

Description

Files

A_SAD.csv

Files (1.3 GB)