Published December 4, 2024 | Version v2
Dataset Open

Process Behavior Corpus and Benchmarking Datasets

Description

A corpus of process behaviors and benchmarking datasets for semantics-aware process mining tasks.

Files:

  • process_behavior_corpus.csv: the text corpus, which contains the behavior allowed by process models as sequences of activities (column string_traces).
  • T_SAD.csv: A benchmark dataset generated from the corpus to assess the following task: Given a trace σ, decide if σ is a valid execution of the underlying process or not, without knowing the behavior allowed in the process.
    Each row contains a trace (column trace) with a corresponding label (column anomalous) indicating whether the trace represents a valid execution of the underlying process. The set of activities that can occur in the process are also given (column unique_activities).
  • A_SAD.csv: A benchmark dataset generated from the corpus to assess the following task: Given an eventually-follows relation ef = a ≺ b of
    a trace σ, decide if ef represents a valid execution order of the two activities a and b that are executed in a process or not, without knowing the behavior allowed in the process.
    Each row contains an eventually-follows relation (column eventually_follows) with a corresponding label (column out_of_order) indicating wether the two activities of the relation were executed in an invalid order (TRUE) or in a valid order (FALSE) according to the underlying process (model). The set of activities that can occur in the process are also given (column unique_activities).
  • S_NAP.csv: A benchmark dataset generated from the corpus to assess the following task: Given an event log L and a prefix p_k of length k, with 1 < k, predict the next activity a_k+1
    Each row contains a trace prefix (column prefix) with a corresponding next activity (column next) indicating the activity that should be performed next after the last activity of the prefix  according to the trace from which the prefix was generated. The set of activities that can occur in the process are also given (column unique_activities).
  • S-PMD.csv: A benchmark dataset generated from the corpus to assess the following tasks:
    • Given a set of possible activities (column unique_activities), generate a difectly follows graph (column dfg) that captures the trace semantics of the process model. 
    • Given a set of possible activities (column unique_activities), generate a simple process tree (column pt) that captures the trace semantics of the process model.

Reference and legal info:

The corpus and the benchmark datasets are generated using the SAP-SAM dataset:

Kampik, T., Warmuth, C., Sola, D., Schäfer, B., Axworthy, L., Ivarsson, E., Ouda, K., & Eickhoff, D. (2022). SAP Signavio Academic Models (0.5.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.7012043

The SAP-SAM dataset is published with a specific license (see "Rights"), which, therefore, also applies to the data published in this record.

THE DATASETS AND ASSOCIATED EVALUATION EXPERIMENTS ARE DESCRIBED IN THIS PAPER.

IN THIS REPOSITORY YOU FIND THE CODE AND RAW RESULTS OF EVALUATION EXPERIMENTS USING VARIOUS OPEN SOUCE LLMs TO SOLVE THE TASKS

Files

A_SAD.csv

Files (1.3 GB)

Name Size Download all
md5:736c4072cd44a343648707afb6201e6c
125.1 MB Preview Download
md5:7ce5ded7f481ce248e9262fd64e89738
58.8 MB Preview Download
md5:f7507e3649bb453a5fdf8e3fa5334f81
11.2 MB Preview Download
md5:c3f9c377a6e86123864571050d882648
998.8 MB Preview Download
md5:e06b38f9c88bf5d5155714c3389523e2
155.5 MB Preview Download