Published March 2, 2026 | Version 1.0
Dataset Restricted

MozillaJIT: A Dataset for Just-in-Time Defect Prediction and Culprit Localization Simulation in Mozilla

  • 1. ROR icon Concordia University

Description

MozillaJIT is a Mozilla-specific just-in-time defect prediction dataset built from Mozilla Bugzilla and Mozilla’s autoland Mercurial repository. Each example represents a Bugzilla bug together with the net code change that landed for that bug. The extraction pipeline first collects Bugzilla bug metadata, then exports autoland commit metadata, identifies the newest contiguous block of commits whose message begins with “Bug <id>”, skips backout/revert commits, and computes a single unified diff from the first parent of the oldest landing commit to the newest landing commit. The resulting joined dataset contains bug IDs, regression-link metadata (regressed_by, regressions), net diffs, cleaned commit messages, landing revisions, timestamps, and binary indicators for whether a bug is itself a regression and whether it later regressed another bug.

This Zenodo release includes both raw intermediate artifacts and derived model-ready files. The raw artifacts are all_bugs.jsonl (705,750 Bugzilla bugs) and all_commits.jsonl (822,595 autoland changesets). The main joined dataset, mozilla_jit_2022.jsonl, contains 78,338 bug-plus-diff examples with non-empty landed diffs. The release also includes jit_llm_struc_2022.jsonl, an LLM-ready version of the same 78,338 examples in which each unified diff is converted into a structured XML-like representation and paired with a binary regressor label. In this snapshot, 9,504 joined examples are labeled as regression bugs, 6,871 are labeled as regressor bugs, and 883 are both. Although two filenames retain a historical “2022” suffix, the current snapshot spans bug creation times from 2022-01-01T12:45:19+00:00 to 2025-10-28T15:38:48+00:00 and landing commit times from 2022-01-03T06:27:03+00:00 to 2025-12-27T15:37:21+00:00.

MozillaJIT is intended for research on just-in-time defect prediction, regressor prediction, commit-risk modeling, LLM-based code-change classification, and debugging-policy simulation. The extraction and preparation scripts are available in the companion repository: https://github.com/Ali-Sayed-Salehi/jit-dp-llm. Relevant code is located under data_extraction/bugzilla, data_extraction/mercurial, data_extraction/data_preparation.py, and data_extraction/utils.py.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Additional details