Published May 26, 2025 | Version v1
Dataset Open

Object-Centric Event Log (OCEL) of the Enron Email Dataset

Description

Description:

This dataset provides an object-centric event log (OCEL) representation of the publicly available Enron email corpus. The OCEL format allows for a richer analysis of interconnected processes and objects, making it particularly suitable for advanced process mining techniques, communication pattern analysis, and social network exploration.

The event logs were generated from a pre-processed CSV version of the Enron emails using a custom Python script leveraging the PM4Py library. The script parses individual emails to extract key information, including:

  • Timestamps: Derived from the 'Date' field of emails, parsed into timezone-aware datetime objects.
  • Activities: Inferred from email subject prefixes (e.g., "Re:" becomes "Response", "Fw:" becomes "Forwarding", "Invitation:" becomes "Invitation"). Emails without recognized prefixes are assigned a "Default" activity.
  • Objects: Two primary object types are identified:
    • EMAILADDRESS: Extracted from 'From', 'To', and 'Cc' fields.
    • MESSAGEID: Extracted from 'Message-ID', 'In-Reply-To', and 'References' fields, prefixed with "MID_" in the OCEL to ensure unique object identifiers across types.
  • Attributes: Event attributes include the original cleaned subject and content of the email.
  • Relationships: Events (emails) are linked to EMAILADDRESS objects with qualifiers 'FROM', 'TO', or 'CC'. Events are linked to MESSAGEID objects with qualifiers 'MESSAGEID' (for the email's own ID), 'INREPLYTO', or 'REFERENCES' to trace conversational threads.

To accommodate various analytical needs and computational resources, the dataset is provided in three distinct checkpoints:

  1. Top 10,000 Emails: An OCEL generated from the first 10,000 emails processed.
  2. Top 100,000 Emails: An OCEL generated from the first 100,000 emails processed.
  3. All Emails: An OCEL generated from all emails processed by the script from the input emails.csv file.

Each checkpoint is available in the .jsonocel format (OCEL 2.0 standard), ready for use with PM4Py and other OCEL-compatible process mining tools. This dataset can be valuable for researchers and practitioners seeking to apply object-centric process discovery, conformance checking, and enhancement techniques to a large, real-world communication log.

Keywords: Object-Centric Event Log, OCEL, Process Mining, Enron Dataset, Email Analysis, Communication Networks, Social Network Analysis, PM4Py

Files

Files (1.4 GB)

Name Size Download all
md5:7e39e9121770c2f9afedf92f184bf2fc
1.1 GB Download
md5:86b9a2e8e03f56d8b4dca615b8cf0fb8
19.4 MB Download
md5:2c02725803dcf45363860dea609fab20
250.5 MB Download