Published November 21, 2024 | Version v1
Dataset Open

Exploring Fine-Grained Bug Report Categorization with Large Language Models and Prompt Engineering: An Empirical Study

Authors/Creators

Description

Replication Package for: Exploring Fine-Grained Bug Report Categorization with Large Language Models and Prompt Engineering: An Empirical Study

This repository contains the scripts to replicate the experiments for the paper "Exploring Fine-Grained Bug Report Categorization with Large Language Models and Prompt Engineering: An Empirical Study."

Zenodo

The replication package is available at Zenodo.

Dependencies

The following dependencies are required to run the experiments:

  • Ollama: The user must download models from Ollama before executing the pipeline.

  • Anaconda / Miniconda: Download and configure Anaconda or Miniconda for managing Python environments.

Create a Python environment using the environment file:

conda env create -f environment.yml

How to Run

The main script to run the experiments is prompt.py. To execute the experiment:

conda activate llm-fine-grained-bug-categorization
python prompt.py -c config.ini

config.ini contains the configurations for the experiments. Below is an example of the configuration file:

[Model]
name = gemma2:9b-instruct-q4_0
temperature = 0
set_system = True
num_ctx = 8192
num_runs = 1
;diid = XERCESC-211
;host = http://localhost:11435

[Prompt]
reverse = True
type = 83215003

Configuration Explanation:

  • name: Specifies the model name. You can list multiple models separated by commas.

    name = llama3.1:8b-instruct-q4_0, llama3:8b-instruct-q4_0, qwen2:7b-instruct-q4_0, gemma:7b-instruct-v1.1-q4_0, gemma2:9b-instruct-q4_0, starling-lm:7b-beta-q4_0, aya:8b-23-q4_0
    
  • temperature: Defines the temperature for the LLM.

  • set_system: Flag to set the system. Used to invoke the generate endpoint by setting the system message.

  • num_ctx: LLM context size.

  • num_runs: The number of runs to perform (currently only one run is supported).

  • diid: Bug report ID (for debugging purposes, used to test the prompt and LLM).

  • host: Host to use.

  • reverse: Flag to reverse the execution order of the dataset.

  • type: Specifies the prompt type. The number corresponds to the prompt template in the prompts/ folder.

Project Structure

  |--- README.md                           :  User guidance.
  |--- analysis                            :  Folder to save the extracted categories and classification metrics.
  |--- base                                :  Folder containing Ollama API implementation to interact with the LLM.
  |--- corrected_dataset                   :  Folder containing the corrected dataset of bug report categories, refined through validation methods to improve human annotations.
  |--- experimentResults                   :  Folder to save the LLM responses.
  |--- export                              :  Folder containing the dataset of 221,184 LLM-generated bug report categories, created using six LLMs, nine prompt types, and four output configurations for 1,024 bug reports exported in HTML format.
  |--- prompts                             :  Folder with prompt templates.
  |--- reports                             :  Folder containing the bug report XML files (zipped with `.gz`).
  |--- all_non_sensical.xlsx               :  Contains all LLM responses for which no valid bug report category could be extracted.
  |--- all_non_sensical-odd-analysis.xlsx  :  Manual analysis of Out-of-Dictionary bug report categories generated by LLM.
  |--- dataset_additional_analysis.xlsx    :  Catolino dataset analysis.
  |--- config.ini                          :  Example configuration file.

Analysis Scripts

The following scripts are used to analyze the results of the experiments. You can run them to replicate various parts of the analysis:

  1. analyze.py: Contains the script that calculates classification metrics for bug report categorization.

  2. analyzeCore.py: Contains the script for classification metrics analysis of bug report categorization.

  3. gainLoss.py: Analyzes the gain and loss of unique categories across different models and configurations.

  4. plotDiscardDist.py: Analyzes LLM responses and generates a plot of the distribution of common output deviations.

  5. labelCorrectness.py: Analyzes the correctness of labels using agreements and disagreements from LLM-generated responses via voting.

 

Files

llm-fine-grained-bug-categorization.zip

Files (487.4 MB)

Name Size Download all
md5:176c8a8cc8f0c9f601fbd98a203958cc
487.4 MB Preview Download