Published January 21, 2026 | Version 1.1
Dataset Open

Process Model Forecasting Datasets

  • 1. ROR icon KU Leuven

Description

This repository provides benchmark datasets for process model forecasting and directly-follows relation (DF) time series analysis. The data are designed to bridge the process mining and time series forecasting communities by transforming event logs into structured multivariate temporal sequences with explicit control-flow and temporal dependency semantics.

The dataset is organized into four hierarchical levels:

1. raw_logs (XES format)

This directory contains the original event logs from publicly available process mining benchmarks, stored in standard XES format. Each dataset is accompanied by its original DOI (see below). No modification is applied at this level.

2. processed_logs (XES format)

This directory contains pre-processed versions of the raw event logs, following a standardized three-stage pipeline:

  1. variant filtering and case completion augmentation (including start and end activities),

  2. temporal trimming to remove initial warm-up and terminal cool-down periods, and

  3. extraction of directly-follows (DF) relations and their aggregation for time series construction.

The purpose of these steps is to reduce noise, stabilize temporal dynamics, and preserve the core process behavior for forecasting. A detailed description of the pre-processing procedure, parameter settings, and assumptions is provided in the corresponding research article https://doi.org/10.1007/s44311-025-00031-7.

3. time_series (Parquet format)

This directory contains multivariate time series constructed from directly-follows relations (DF) extracted from the processed event logs. Each variable represents the daily aggregated frequency of a specific DF, resulting in multivariate DF-based time series capturing the temporal evolution of process interaction structure, which are suitable for sequence modeling and forecasting. All files are stored in Parquet format for efficient storage and large-scale processing.

4. metadata (JSON format)

This directory contains structured metadata and configuration files describing the event logs, preprocessing pipeline, and resulting DF-based time series. It includes:

  • Event log statistics (number of cases, events, activities, variants, time span, DF relations before and after preprocessing)

  • Preprocessing parameters (variant filtering threshold, trimming percentages, start/end activity insertion, aggregation frequency)

  • Time series metadata (time range, sampling frequency, and dimensionality)

  • Descriptive statistics of each DF time series (mean, standard deviation, minimum, maximum, and total counts)

Included Benchmark Event Logs

Research Applications

  • Process model and control-flow forecasting

  • Directly-follows dynamics and concept drift detection

  • Multivariate time series modeling of business processes

Files

pmf_data_v1.1.zip

Files (58.4 MB)

Name Size Download all
md5:b0ddbd3a7964b8c8d3692f3817988174
58.4 MB Preview Download

Additional details

Related works

Is supplement to
Publication: 10.1007/s44311-025-00031-7 (DOI)