# JML-BugDB Dataset & JavaMLBugDetective Replication Package

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.18161123.svg)](https://doi.org/10.5281/zenodo.18161123)
[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)

## Overview

This repository contains the complete replication package for the paper:

> **"Beyond the Gold Standard: Validating Evolutionary Context Modeling for Defect Prediction under Quantified Label Noise in Enterprise Java Systems"**
> 
> Turgay Taymaz & Kökten Ulaş Birant
> 
> *Information and Software Technology* (Under Review)

## Contents

```
zenodo_package/
├── README.md                    # This file
├── LICENSE                      # CC-BY-4.0 License
│
├── datasets/                    # JML-BugDB v1.0 Dataset
│   ├── kafka-dataset.csv        # Apache Kafka (69,840 instances)
│   ├── gson-dataset.csv         # Google Gson (6,278 instances)
│   └── commons-io-dataset.csv   # Apache Commons-IO (15,515 instances)
│
├── validation/                  # Manual Validation Results
│   ├── manual_validation_final.csv      # 398 manually reviewed samples
│   └── VALIDATION_METHODOLOGY.md        # Validation protocol description
│
├── results/                     # Experimental Results
│   ├── cross_project_results.csv        # Main experimental results
│   ├── ablation_study_results.csv       # Static vs Process vs Hybrid
│   └── feature_importance.csv           # Random Forest feature rankings
│
├── replication/                 # Replication Materials
│   ├── REPLICATION_GUIDE.md             # Step-by-step instructions
│   └── config.properties                # Configuration used in experiments
│
└── source_code/                 # Framework Source Code
    └── JavaMLBugDetective_Source_Snapshot.zip  # Complete source snapshot
```

## Dataset Statistics

| Project | Domain | Instances | Bug-Introducing (SZZ) |
|---------|--------|-----------|----------------------|
| Apache Kafka | Distributed Systems | 69,840 | 31.2% |
| Google Gson | JSON Library | 6,278 | 28.4% |
| Apache Commons-IO | I/O Utilities | 15,515 | 25.7% |
| **Total** | — | **91,633** | — |

**Total Commits Analyzed:** 25,480  
**Time Span:** 17 years (2002-2024)

## Label Quality Assessment

We conducted rigorous manual validation of the automated SZZ labeling:

| Metric | Value | 95% CI |
|--------|-------|--------|
| Sample Size | 398 | — |
| True Positives | 116 (29.2%) | 24.7% - 33.8% |
| False Positives | 282 (70.8%) | — |

### False Positive Breakdown
- Refactoring operations: 42.3%
- Feature additions: 18.1%
- Documentation/comments: 10.4%
- Other: 29.2%

## Feature Set (14 Dimensions)

### Static Metrics (7)
| Metric | Description |
|--------|-------------|
| WMC | Weighted Methods per Class |
| TCC | Tight Class Cohesion |
| RFC | Response For Class |
| LCOM | Lack of Cohesion in Methods |
| CBO | Coupling Between Objects |
| NCSS_CLASS | Non-Commenting Source Statements |
| CYCLO_SUM | Total Cyclomatic Complexity |

### Process Metrics (7)
| Metric | Description |
|--------|-------------|
| NR | Number of Revisions |
| NDEV | Number of Developers |
| AGE | File Age (days) |
| EXP | Developer Experience |
| LINES_ADDED | Lines Added in Commit |
| LINES_DELETED | Lines Deleted in Commit |
| HUNK_COUNT | Number of Diff Hunks |

## Key Results

### Cross-Project Validation (F1-Scores)
| Model | Kafka | Gson | Commons-IO |
|-------|-------|------|------------|
| Static-Only | 0.531 | 0.489 | 0.250 |
| Process-Only | 0.698 | 0.641 | 0.512 |
| **Hybrid** | **0.742** | **0.685** | **0.570** |

### Key Finding
> Process metrics provide **noise-tolerant** signal, outperforming static metrics by up to **128%** under high label noise conditions.

## Usage

### Quick Start
```bash
# 1. Unzip source code
unzip source_code/JavaMLBugDetective_Source_Snapshot.zip -d JavaMLBugDetective

cd JavaMLBugDetective

# 2. Build framework
mvn clean package -DskipTests

# 3. Run analysis
./clean_and_run.sh
```

### Detailed Instructions
See `replication/REPLICATION_GUIDE.md` for complete step-by-step instructions.

## Citation

If you use this dataset or framework, please cite the accompanying paper. 
Citation details will be updated upon publication.

## License

- **Dataset (JML-BugDB):** [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)
- **Source Code:** [MIT License](https://opensource.org/licenses/MIT)

## Contact

- **Turgay Taymaz** (Corresponding Author) - turgay@taymaz.org
- **Kökten Ulaş Birant** - ulas.birant@deu.edu.tr

**Institution:** Dokuz Eylül University, Department of Computer Engineering, İzmir, Turkey

## Acknowledgments

This work was conducted as part of doctoral research at Dokuz Eylül University Graduate School of Natural and Applied Sciences.

---

*Package Version: 1.0*  
*Last Updated: January 2026*
