Automated Quantum Issue Labeling in Qiskit: Large Language Models and Fine-Tuned Transformers

THATAMSETTY, POOJITHA; ZHANG, LEI

doi:10.5281/zenodo.18806462

Published February 27, 2026 | Version v5

Journal Open

Automated Quantum Issue Labeling in Qiskit: Large Language Models and Fine-Tuned Transformers

# Automated Quantum Issue Labeling in Qiskit: Large Language Models and Fine-Tuned Transformers

This repository contains the implementation code for the paper **"Automated Quantum Issue Labeling in Qiskit: Large Language Models and Fine-Tuned Transformers"** submitted to EMSE (Empirical Software Engineering, Springer). The code includes experiments comparing fine-tuned transformer models (DistilBERT, RoBERTa), Large Language Models (GPT-4o Mini, GPT-5 Mini, GPT-5 Nano), and RAG-enhanced LLM pipelines on quantum computing software issue classification.

## Table of Contents
- [Study Overview](#study-overview)
- [Repository Structure](#repository-structure)
- [Installation](#installation)
- [Configuration](#configuration)
- [Usage](#usage)
- [Input Data Format](#input-data-format)
- [Output and Results](#output-and-results)
- [Analysis Tools](#analysis-tools)
- [GPU Acceleration](#gpu-acceleration-notes)
- [Computational Requirements](#computational-requirements)
- [Hyperparameters](#hyperparameters)
- [Troubleshooting](#troubleshooting)
- [Citation](#citation)
- [License](#license)

## Study Overview

This study compares the performance of multiple approaches for classifying GitHub issues in quantum computing repositories:

**Fine-tuned Transformer Models:**
- DistilBERT (F1=0.95)
- RoBERTa (F1=0.94)

**Large Language Models (direct prompting, 242-issue test set):**
- GPT-5 Mini — zero-shot (F1=0.77) and few-shot (F1=0.82)
- GPT-5 Nano — zero-shot (F1=0.59) and few-shot (F1=0.65)
- GPT-4o Mini — zero-shot (F1=0.62) and few-shot (F1=0.64)

**RAG-Enhanced LLM Pipelines (with threshold tuning, 721-issue test set):**
- Agentic RAG + GPT-4o Mini: zero-shot F1=0.606, few-shot F1=0.682
- Adaptive RAG + GPT-4o Mini: zero-shot F1=0.613, few-shot F1=0.744
- Adaptive RAG + GPT-5 Mini: F1=0.836
- Direct GPT-5 Mini few-shot (no RAG, same 721-issue test set): F1=0.843

We evaluate models on their ability to automatically classify Qiskit GitHub repository issues across 12 quantum module labels (e.g., `mod: circuit`, `mod: transpiler`).

## Repository Structure

```
.
├── Data/
│ └── qiskit_repo_quantum_issues.json # 2,415 labeled Qiskit issues
│
├── Analysis/
│ ├── predictions/
│ ├── results/
│ ├── bot_labeling_analysis.py
│ ├── config.py
│ ├── ml_baselines.py
│ ├── quantum_term_analysis.py
│ ├── statisticalanalysis.py
│ └── analysis_requirements.txt
│
├── Fine_tuned_Experiments/
│ ├── output/
│ ├── distilbert_final.py
│ ├── finetunedconfig.py
│ ├── roberta_distilbert_requirements.txt
│ └── roberta_final.py
│
├── Gpt_Experiments/
│ ├── scripts/
│ │ ├── gpt4o_mini_fewshot_gridsearch.py
│ │ ├── gpt4o_mini_zeroshot_gridsearch.py
│ │ ├── gpt_5_mini_fewshot_gridsearch.py
│ │ ├── gpt_5_mini_zeroshot_gridsearch.py
│ │ ├── gpt_5_nano_fewshot.py
│ │ └── gpt_5_nano_zeroshot.py
│ ├── config.py
│ └── gpt_requirements.txt
│
├── RAG_Experiments/
│ ├── code/
│ │ ├── 01_agentic_rag_zeroshot.py # Agentic RAG, GPT-4o Mini, zero-shot
│ │ ├── 02_agentic_rag_fewshot.py # Agentic RAG, GPT-4o Mini, few-shot
│ │ ├── 03_adaptive_rag_zeroshot.py # Adaptive RAG, GPT-4o Mini, zero-shot
│ │ ├── 04_adaptive_rag_fewshot.py # Adaptive RAG, GPT-4o Mini, few-shot
│ │ ├── 05_adaptive_rag_gpt5mini.py # Adaptive RAG, GPT-5 Mini, few-shot
│ │ ├── 06_threshold_tuning.py # Per-label threshold tuning utility
│ │ ├── 07_gpt5mini_fewshot_direct.py # Direct GPT-5 Mini baseline (no RAG)
│ │ └── config.py # RAG-specific configuration
│ ├── predictions/ # Saved prediction JSON files
│ ├── results/ # Threshold tuning result JSONs
│ ├── rag_requirements.txt
│ └── README.md
│
├── finetuned_distilbert_results/
│ └── DistilBERT_Results/
│
├── finetuned_roberta_results/
│ └── RoBERTa_Results/
│
├── gpt_4o_mini_results/
│ ├── few_shot_results/
│ ├── visuals_insights/
│ └── zero_shot_results/
│
├── gpt_5_mini_results/
│ ├── few_shot_results/
│ └── zero_shot_results/
│
├── gpt_5_nano_results/
│ ├── few_shot_results/
│ └── zero_shot_results/
│
├── .gitignore
├── LICENSE
└── README.md
```

## Installation

### Requirements
- Python 3.8+
- CUDA-compatible GPU (recommended for fine-tuning experiments)
- PyTorch
- OpenAI API key (for GPT and RAG experiments)

### Setup

1. **Clone the repository:**
```bash
git clone [repository URL]
cd quantum-bug-labeling-main
```

2. **Create and activate a virtual environment:**
```bash
# For Windows
python -m venv venv
venv\Scripts\activate

# For macOS and Linux
python3 -m venv venv
source venv/bin/activate
```

3. **Install Fine-tuned experiments requirements:**
```bash
cd Fine_tuned_Experiments
pip install -r roberta_distilbert_requirements.txt
```

**Note**: The `roberta_distilbert_requirements.txt` includes:
- `torch>=1.12.0` — PyTorch deep learning framework
- `transformers>=4.30.0` — Hugging Face transformers
- `numpy>=1.24.0`, `pandas>=2.0.0` — Data handling
- `scikit-learn>=1.2.0` — Machine learning metrics
- `matplotlib>=3.7.0`, `seaborn>=0.12.0` — Visualizations
- `tqdm>=4.65.0` — Progress bars
- `datasets>=2.12.0` — Dataset utilities

4. **Install PyTorch (for Fine-tuned experiments):**

Option 1: CPU-only (simpler but slower for training)
```bash
pip install torch torchvision torchaudio
cd ..
```

Option 2: GPU with CUDA 11.8 (recommended)
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
cd ..
```

5. **Install GPT experiments requirements:**
```bash
cd Gpt_Experiments
pip install -r gpt_requirements.txt
cd ..
```

6. **Install RAG experiments requirements:**
```bash
cd RAG_Experiments
pip install -r rag_requirements.txt
cd ..
```

**Note**: The `rag_requirements.txt` includes:
- `openai>=1.13.3` — OpenAI API client (GPT-4o Mini + GPT-5 Mini + embeddings)
- `numpy>=1.24.0`, `pandas>=2.0.0` — Data handling
- `scikit-learn>=1.2.0` — NearestNeighbors retrieval index and metrics

7. **Install Analysis requirements:**
```bash
cd Analysis
pip install -r analysis_requirements.txt
cd ..
```

## Configuration

All data paths are **auto-resolved relative to the project root** — no manual path editing is needed as long as the repository structure is intact.

### API Keys (GPT and RAG experiments only)

```bash
# Set your OpenAI API key in:
# - Gpt_Experiments/scripts/config.py (for GPT experiments)
# - RAG_Experiments/code/config.py (for RAG experiments)
OPENAI_API_KEY = "your-api-key-here"
```

### GitHub Token (optional, for bot analysis only)

```bash
# Set in Analysis/config.py (only needed for bot_labeling_analysis.py):
GITHUB_TOKEN = "ghp_..."
```

## Usage

### Fine-tuned Experiments

```bash
cd Fine_tuned_Experiments
python distilbert_final.py
python roberta_final.py
```

### GPT Experiments

```bash
cd Gpt_Experiments/scripts
python gpt4o_mini_zeroshot_gridsearch.py # GPT-4o Mini zero-shot
python gpt4o_mini_fewshot_gridsearch.py # GPT-4o Mini few-shot
python gpt_5_mini_zeroshot_gridsearch.py # GPT-5 Mini zero-shot
python gpt_5_mini_fewshot_gridsearch.py # GPT-5 Mini few-shot
python gpt_5_nano_zeroshot.py # GPT-5 Nano zero-shot
python gpt_5_nano_fewshot.py # GPT-5 Nano few-shot
```

### RAG Experiments

Run from the project root directory. Scripts are self-contained and use `RAG_Experiments/code/config.py`.

**Step 1 — Run any RAG pipeline:**
```bash
python RAG_Experiments/code/01_agentic_rag_zeroshot.py
python RAG_Experiments/code/02_agentic_rag_fewshot.py
python RAG_Experiments/code/03_adaptive_rag_zeroshot.py
python RAG_Experiments/code/04_adaptive_rag_fewshot.py
python RAG_Experiments/code/05_adaptive_rag_gpt5mini.py
```

**Step 2 — Run per-label threshold tuning on saved predictions:**
```bash
python RAG_Experiments/code/06_threshold_tuning.py agentic_rag_zeroshot_predictions.json
python RAG_Experiments/code/06_threshold_tuning.py agentic_rag_fewshot_predictions.json
python RAG_Experiments/code/06_threshold_tuning.py adaptive_rag_zeroshot_predictions.json
python RAG_Experiments/code/06_threshold_tuning.py adaptive_rag_fewshot_predictions.json
python RAG_Experiments/code/06_threshold_tuning.py adaptive_rag_gpt5mini_predictions.json
```

**Step 3 — Run direct GPT-5 Mini baseline (no RAG, same 721-issue test set):**
```bash
python RAG_Experiments/code/07_gpt5mini_fewshot_direct.py
python RAG_Experiments/code/06_threshold_tuning.py gpt5mini_fewshot_direct_predictions.json
```

> **Note on runtime**: Each RAG script makes ~1–3 API calls per test issue (721 issues total). Expect 30–120 minutes per script depending on model and network latency.

### Analysis

```bash
cd Analysis
python ml_baselines.py # Classical ML baselines (LR, SVM)
python statisticalanalysis.py # McNemar's significance tests
python quantum_term_analysis.py # Quantum terminology analysis
python bot_labeling_analysis.py # Bot labeling analysis (requires GitHub token)
```

## Input Data Format

The experiments expect a JSON file with GitHub issues in the following format:

```json
[
{
"ID": "issue-123",
"Title": "Fix barrier label position when bits are reversed",
"Body": "Issue description text here...",
"Labels": ["mod: visualization", "bug"]
}
]
```

### Dataset Details
- **Source**: Qiskit GitHub repository issues
- **Total**: 2,415 issues across 12 quantum-specific categories
- **Labels**: `mod: algorithms`, `mod: circuit`, `mod: opflow`, `mod: primitives`, `mod: pulse`, `mod: qasm2`, `mod: qasm3`, `mod: qpy`, `mod: quantum info`, `mod: transpiler`, `mod: visualization`, `qamp`
- **Label type**: Multi-label (issues can have multiple labels)
- **Few-shot examples**: 13 curated examples excluded from all evaluation sets

## Output and Results

### Fine-tuned Models
- `finetuned_distilbert_results/DistilBERT_Results/`
- `finetuned_roberta_results/RoBERTa_Results/`

### GPT Models
- `gpt_4o_mini_results/` — GPT-4o Mini results (grid search)
- `gpt_5_mini_results/` — GPT-5 Mini results
- `gpt_5_nano_results/` — GPT-5 Nano results

### RAG Experiments
- `RAG_Experiments/predictions/` — JSON predictions for all 5 RAG variants + direct baseline
- `RAG_Experiments/results/` — Threshold tuning results (global sweep + per-label)

### Analysis Results
- `Analysis/predictions/` — Model predictions (`.npy` files for McNemar's test)
- `Analysis/results/` — Baseline metrics, quantum term analysis, bot labeling analysis

## Analysis Tools

### ML Baselines (`ml_baselines.py`)
- Trains Logistic Regression and Linear SVM with TF-IDF features
- **Outputs**: `baseline_results.csv`, `per_category_baseline_results.csv`, prediction `.npy` files

### Statistical Analysis (`statisticalanalysis.py`)
- McNemar's tests on 242 held-out test issues
- **Outputs**: `predictions/mcnemar_results.json`

### Quantum Terminology Analysis (`quantum_term_analysis.py`)
- Validates quantum-specific nature of dataset (hybrid TF-IDF + documentation approach)
- **Outputs**: Console statistics, domain specificity ratios

### Bot Labeling Analysis (`bot_labeling_analysis.py`)
- Compares bot labeling patterns across 10 classical + 10 quantum repositories
- Requires GitHub API token in `config.py`
- **Outputs**: Excel report, visualization PNG

## GPU Acceleration Notes

Fine-tuned models benefit from GPU acceleration:
- **Memory**: At least 8 GB GPU memory recommended
- **Training time**: 60–90 min with GPU (vs. days on CPU)
- Install PyTorch with CUDA as shown in the Installation section

## Computational Requirements

### Fine-tuned Models
- GPU: 8 GB+ VRAM
- Training: ~60 min (DistilBERT), ~90 min (RoBERTa) on RTX 3080
- Disk: ~300 MB (DistilBERT), ~500 MB (RoBERTa)

### GPT Experiments
- API costs: see `Gpt_Experiments/` for per-configuration cost breakdown
- Runtime: 30–60 min per configuration; several hours for full grid search

### RAG Experiments
- API costs: ~$0.50–2.00 per full run (embeddings + GPT calls for 721 issues)
- Runtime: 30–120 min per script

### Analysis Scripts
- RAM: 4 GB+
- Runtime: minutes

## Hyperparameters

### Fine-tuned Models (best configurations)
- **DistilBERT**: lr=8e-5, epochs=10, batch=12, weight_decay=0.005, cosine schedule
- **RoBERTa**: lr=3e-5, epochs=18, batch=32, weight_decay=0.15, cosine schedule

### GPT Models (Grid Search)
- Temperature: 0.0–1.0 (0.1 increments), Top-p: [0.8, 0.9, 1.0], Seed: 42

### GPT-5 Models (Responses API)
- Reasoning effort: [minimal, medium, high], Verbosity: low, Max output tokens: 300

### RAG — Threshold Tuning
- Global sweep: τ ∈ [0.05, 0.95] step 0.05
- Per-label tuning: independently optimized for labels with support < 50

## Troubleshooting

**OpenAI API Errors**
- `No API key found` → Set `OPENAI_API_KEY` in the relevant `config.py`
- Rate limit exceeded → Reduce concurrency or upgrade API tier
- `unsupported_parameter` with GPT-5 models → Scripts automatically retry without unsupported params

**GPU/CUDA Issues**
- `CUDA out of memory` → Reduce batch size in config or switch to CPU
- `CUDA not available` → Verify: `python -c "import torch; print(torch.cuda.is_available())"`

**Data Issues**
- JSON parsing errors → Ensure input file is valid UTF-8 JSON
- Missing labels in results → Confirm labels start with `mod:` or equal `qamp`

## Citation

If you use this work in your research, please cite:

```bibtex
@article{thatamsetty2026quantum,
title={Automated Quantum Issue Labeling in Qiskit: Large Language Models and Fine-Tuned Transformers},
author={Thatamsetty, Poojitha and Zhang, Lei},
journal={Empirical Software Engineering},
publisher={Springer},
year={2026},
note={Under review}
}
```

## Paper Status

This work has been submitted to **EMSE (Empirical Software Engineering, Springer)**. The replication package is publicly available on Zenodo (DOI: (https://doi.org/10.5281/zenodo.17946234)).

## Acknowledgments

This research is funded by the Strategic Awards for Research Transitions (START) at the University of Maryland, Baltimore County.

## Contact

For questions or issues, please open an issue in this repository or contact pthatam1@umbc.edu.

## License

This project is licensed under the MIT License — see the [LICENSE](LICENSE) file for details.

---

**Note**: Ensure your OpenAI API key is configured in the appropriate `config.py` before running GPT or RAG experiments. API keys and large model files are excluded from version control via `.gitignore`.

Files

quantum-bug-labeling-main-main.zip

Files (75.8 MB)

Name	Size	Download all
quantum-bug-labeling-main-main.zip md5:aac0810b430862106695ecc3d71aa6b7	75.8 MB	Preview Download

	All versions	This version
Views	96	11
Downloads	10	0
Data volume	406.8 MB	0 Bytes

Automated Quantum Issue Labeling in Qiskit: Large Language Models and Fine-Tuned Transformers

Authors/Creators

Description

Files

quantum-bug-labeling-main-main.zip

Files (75.8 MB)