LRMs Reasoning Steps
Description
This dataset provides an object-centric event log (OCEL) detailing the reasoning processes of various Large Reasoning Models (LRMs) when tackling tasks from the PMLRM-Bench benchmark. The PMLRM-Bench is an extension of the PM-LLM-Benchmark, designed to evaluate both the correctness of LRM outputs and the robustness of their reasoning processes in the domain of process mining.
Link to the benchmark’s repository
Link to the benchmark’s (pre-print) paper
The OCEL is generated from the textual “chain-of-thought” outputs of LRMs. Each reasoning step within these traces has been extracted and classified by its type (e.g., Deductive Reasoning, Hypothesis Generation) and its effect on the overall reasoning correctness (Positive, Indifferent, or Negative). This classification was performed using a judge LLM (Gemini-2.5-Pro-Preview-03-25), as detailed in the source paper.
Structure of the OCEL
The event log is structured with the following object types and event attributes:
-
Objects:
MOD
: Represents a specific Large Reasoning Model evaluated in the benchmark.QUE
: Represents a unique question or prompt from the PM-LLM-Benchmark dataset that the LRM responded to.MODQUE
: Represents a unique instance of a specific model (MOD
) answering a specific question (QUE
).
-
Events: Each event corresponds to a single reasoning step identified in the LRM’s output.
ocel:activity
: Stores the classified reasoning step, combining its type (e.g., PR, DR, HG) and its effect (PE, IND, NE). For example, “Deductive Reasoning - PE”.ocel:timestamp
: A synthetically generated timestamp to preserve the order of reasoning steps within a trace.text
: Contains the actual text snippet from the LRM’s reasoning trace that corresponds to this specific step.ocel:eid
: A unique identifier for the event.
-
Relations: Each event is linked to:
- The
MOD
object that produced the reasoning step. - The
QUE
object that the reasoning step is addressing. - The
MODQUE
object representing the specific answer instance.
- The
Purpose and Potential Use
This OCEL allows for in-depth analysis of LRM reasoning behaviors using process mining techniques. Researchers can explore:
- Common reasoning patterns across different models or question types.
- The sequence and frequency of various reasoning steps (e.g., how often Hypothesis Generation is followed by Validation).
- The impact of different reasoning strategies on task performance and correctness.
- Differences in reasoning approaches between high-performing and lower-performing LRMs.
The dataset is intended to complement the research paper “Configuring Large Reasoning Models using Process Mining: A Benchmark and a Case Study” by Berti et al., providing the structured data used to analyze and benchmark LRM reasoning capabilities.
File Information
The dataset contains one file: reasoning_benchmark.jsonocel
. This file is an object-centric event log formatted according to the OCEL 2.0 standard.
This dataset was generated using a Python script that parses the JSON files containing the classified reasoning steps (from the prel/final_abstract_steps
folder mentioned in the script, which corresponds to the outputs of the reasoning trace extraction and classification pipeline described in Section 3.1 of the paper).