Enron Email Dataset

Description

This dataset provides an object-centric event log (OCEL) representation of the publicly available Enron email corpus. The OCEL format allows for a richer analysis of interconnected processes and objects, making it particularly suitable for advanced process mining techniques, communication pattern analysis, and social network exploration.

Object-Centric Event Log (OCEL) of the Enron Email Dataset

The event logs were generated from a pre-processed CSV version of the Enron emails using a custom Python script leveraging the PM4Py library. The script parses individual emails to extract key information, including:

  • Timestamps: Derived from the ‘Date’ field of emails, parsed into timezone-aware datetime objects.
  • Activities: Inferred from email subject prefixes (e.g., “Re:” becomes “Response”, “Fw:” becomes “Forwarding”, “Invitation:” becomes “Invitation”). Emails without recognized prefixes are assigned a “Default” activity.
  • Objects: Two primary object types are identified:
    • EMAILADDRESS: Extracted from ‘From’, ‘To’, and ‘Cc’ fields.
    • MESSAGEID: Extracted from ‘Message-ID’, ‘In-Reply-To’, and ‘References’ fields, prefixed with “MID_” in the OCEL to ensure unique object identifiers across types.
  • Attributes: Event attributes include the original cleaned subject and content of the email.
  • Relationships: Events (emails) are linked to EMAILADDRESS objects with qualifiers ‘FROM’, ‘TO’, or ‘CC’. Events are linked to MESSAGEID objects with qualifiers ‘MESSAGEID’ (for the email’s own ID), ‘INREPLYTO’, or ‘REFERENCES’ to trace conversational threads.

To accommodate various analytical needs and computational resources, the dataset is provided in three distinct checkpoints:

  1. Top 10,000 Emails: An OCEL generated from the first 10,000 emails processed.
  2. Top 100,000 Emails: An OCEL generated from the first 100,000 emails processed.
  3. All Emails: An OCEL generated from all emails processed by the script from the input emails.csv file.

Each checkpoint is available in the .jsonocel format (OCEL 2.0 standard), ready for use with PM4Py and other OCEL-compatible process mining tools. This dataset can be valuable for researchers and practitioners seeking to apply object-centric process discovery, conformance checking, and enhancement techniques to a large, real-world communication log.

Keywords: Object-Centric Event Log, OCEL, Process Mining, Enron Dataset, Email Analysis, Communication Networks, Social Network Analysis, PM4Py