Artifact of SSBSE '25 Challenge: HotCat: Green and Effective Feature Selection toward Hotfix Bug Taxonomy
Authors/Creators
Description
This is the artefact of the paper. See the README file for further details. The clustering of hotfixes.
Code of Clustering
In folder clustering, the code is taken from: clone https://github.com/rashadulrakib/short-text-clustering-enhancement.git.
With clustering.py being an edited code in main.py. kmeans_sim.py is taken from PatchCat.
README
The File:
|
|
contains all the additional data and exact information. We add the context below.
# HotCat: Green and Effective Feature Selection for HotFix Bug Taxonomy
This repository contains the artefact of the paper for the clustering of hotfix as part of the SSBSE 2025 Challenge Track.
### 💡 Research Context
We followed the recommendation in the SSBSE 2025 Challenge Track:
> *We encourage approaches that combine Search-Based Software Engineering (SBSE) with Large Language Models (LLMs) to enhance effectiveness across any of these tasks.*
>
### 📊 Benchmark Dataset
- The benchmark dataset we used is available on GitHub:
[HotBugs (SSBSE 2025 Challenge)](https://github.com/carolhanna01/HotBugs-dot-jar/tree/v1-ssbse25challenge)
### 🔑 Code of Clustering
- Located in the `clustering/` folder.
- We utilised other, **previously published** clustering algorithms, connecting several together.
- The base code is cloned from: [short-text-clustering-enhancement](https://github.com/rashadulrakib/short-text-clustering-enhancement.git).
- `clustering.py` is an edited version of `main.py`.
- `kmeans_sim.py` is taken from [**PatchCat**](https://doi.org/10.5281/zenodo.15834984).
## Table of Contents
- [1. Prompt Template](#1️⃣-prompt-template)
- [2. Data](#2️⃣-data)
- [3. Bug Taxonomy](#3️⃣-bug-taxonomy)
- [4. Installation Instructions](#4️⃣-installation-instructions)
## 1️⃣ Prompt Template
Summarisation Prompt Template: These are used in our work as part of the methodology.
> **System Prompt — Setting the LLM’s Role**
>
> ```
> You are a summarization engine for SE engine with no human reading it.
> Output ONLY the summary as plain text.
> No preamble, no explanations, no phrases like 'Here is a summary'.
> ```
---
> **User Prompt (Template)**
>
> ```
> Summarize the following text using around {max_words} words.
> Please do not start with phrases like:
> - "Here is a summary of the text in around ..."
> - "This code ..."
> - "The summary is ..."
> - "The text appears to be ..."
> - "This text is about ..."
>
> Output only the summary, nothing else.
>
> Text:
> {record}
> ```
The user prompt, as Template: {max_words}, is substituted (e.g., 15 or 25), and {record} is replaced by the concatenated HotFix record.
## 2️⃣ Data
### Dataset, Projected Dataset and Chromosome Definition
We utilise the SSBSE Hotfix dataset [HotBugs](https://github.com/carolhanna01/HotBugs-dot-jar/tree/v1-ssbse25challenge).
We use [PatchCat](https://zenodo.org/records/15834984) for assessing a selection of features to describe a patch.
The original dataset contains, for each hotfix instance, the corresponding project name, a reference to the initiating Jira ticket, the targeted Java version, build configuration details, the assigned bug category, and the rationale provided by the dataset creators for classifying the issue as a hotfix. We extended and defined 18 distinct features.
This was used to construct our dataset, where information was collected from multiple repository files.
The enriched dataset comprises **155 records** (final dataset size), each with **18 features** (bitmask vector size, listed in the table below) across **17 categories** (cluster classes, see Bug Taxonomy in 3️⃣).
In general, the 18 features cover four main groups:
- **Project Info.**
Hotfix project metadata, such as library details and setup information.
- **Hotfix Classification**
Manually annotated ground truth labelling the hotfix type with additional context (e.g., rationale for inclusion in the dataset).
- **Bug Information**
Extracted from the hotfix’s Jira report: number of contributors involved in bug resolution, time-to-fix, comments, and priority.
- **Code Information**
Code-level changes recorded in GitHub commits related to the bug, along with stack traces from associated test results.
The following table summarises the projected dataset columns that are actually passed into **PatchCat**:
| Index | Column Name | Description ||-------|-------------------|-----------------------------------------------------------------------------|| 1 | `setup` | Build and environment setup details || 2 | `details` | Additional metadata or contextual details || 3 | `customer-facing` | Boolean/flag for whether the bug was visible to customers || 4 | `hotfix-reason` | Reason provided for marking the bug as hotfix || 5 | `bug.title` | Title of the bug report || 6 | `bug.description` | Full description of the bug || 7 | `bug.summary` | Short summary of the bug || 8 | `bug.priority` | Priority level assigned to the bug || 9 | `bug.resolution` | Resolution field (e.g., fixed, won’t fix, duplicate) || 10 | `bug.type` | Type/category of the bug (e.g., crash, UI, performance) || 11 | `bug.votes` | Number of votes the bug received || 12 | `bug.watches` | Number of users watching the bug || 13 | `bug.component` | Affected component/module of the software || 14 | `bug.duration` | Time duration metric associated with the bug || 15 | `bug.comments` | Number of comments in the bug discussion || 16 | `bug.user_count` | Number of users impacted by the bug || 17 | `developer-patch` | Developer-provided patch details || **18** | `test-results` | Associated test results verifying the patch |
For example: A configuration of 000000000100000010 means that we projected the original set to be only with `bug.type` and `developer-patch`.
#### Dataset Summary
To recap, we work with four different datasets derived from *HotBugs.jar*, each progressively enriched or augmented for analysis.
#### Original Dataset
- **Size:** 88 rows
- **Categories:** 9
- **Features:** a few
- **Description:** The base dataset collected from *HotBugs.jar*.
#### Enriched Dataset
- **Size:** 88 rows
- **Categories:** 17
- **Features:** 18
- **Description:** Extended feature set incorporating hotfix metadata, bug details, and code commit diffs.
#### Balanced Augmented Dataset (RQ1)
- **Size:** 155 rows
- **Categories:** 17
- **Features:** 18
- **Description:** Two-stage augmentation ensuring each cluster had at least three records. Used primarily for RQ1 analysis.
#### Training Accuracy Augmented Dataset
- **Size:** 155 + (50 × 17) rows
- **Categories:** 17
- **Features:** 18
- **Description:** Added 50 records per category (post-optimization) to test whether augmentation improves model training and evaluation.
---
## 3️⃣ Bug Taxonomy
Below you can find the details of the Bug taxonomy description and data sources for clusters 1–17.
### 📊 Bug Taxonomy (Clusters 1–17) | # | Description | Hotfix | Centroids | Le Chat | PatchCat | |----|-------------|--------|-----------|---------|-----------| | 1 | Test suite, tests, test folder | ✔ | ✔ | | | | 2 | Crash or Hang | ✔ | ✔ | | | | 3 | Missing Code or Components, or Incomplete Implementation | ✔ | ✔ | ✔ | 14,3 | | 4 | Start, Access, or Availability of Service Issues | ✔ | ✔ | | | | 5 | Security Vulnerability or Permission Issues | ✔ | ✔ | | | | 6 | Configuration Dependency, Versioning or Deprecation Issues | ✔ | ✔ | ✔ | | | 7 | Configuration Build or CI Failures | ✔ | ✔ | ✔ | | | 8 | Buggy Configuration or Broken Config Files | ✔ | ✔ | | | | 9 | Database | ✔ | ✔ | ✔ | | | 10 | API / Parsing / Syntax errors | ✔ | ✔ | | | | 11 | Exceptions, Error Handling, or Missing Checks | ✔✔ | ✔ | | | | 12 | Unsupported, Undefined or unspecified behaviour | ✔ | ✔ | ✔ | 18 | | 13 | Network | ✔ | ✔ | | | | 14 | Performance | ✔ | ✔ | ✔ | 11,14 | | 15 | Permission Deprecation, Access Control or Policy Issues | ✔ | ✔ | ✔ | | | 16 | Functionality issue (Logical Bugs) | ✔✔✔ | ✔ | | | | **17** | Concurrency or Race Conditions | ✔ | ✔ | ✔ | 11 |
Further, the cluster centroids are:
### 🧪 Clusters 1–17 Centroids Setup | # | Description | |----|-------------| | 1 | Test suite, tests, test folder, flaky or unstable test; broken/incomplete test code; source code changes without updated tests; build/CI disruptions; misconfiguration or permission issues | | 2 | Crash or hang, fatal error, failure, freeze | | 3 | Missing code, function, or components; incomplete implementation | | 4 | Service failed to start or access; service availability issues | | 5 | Potential harm, threats, or security vulnerabilities; permission/privilege issues | | 6 | Configuration dependency or versioning issues; deprecation-related configuration problems | | 7 | Configuration build or CI failures | | 8 | Buggy configuration or broken configuration files | | 9 | Database query, schema, or selection issues | | 10 | API / parsing / syntax errors | | 11 | Thrown exception; error handling or missing checks (e.g., null checks, guards) | | 12 | Unsupported, undefined, or unspecified behaviour | | 13 | Network-related issues | | 14 | Performance problems | | 15 | Permission deprecation, access control, or policy issues | | 16 | Functionality issues; wrong functionality; logical bugs | | **17** | Concurrency or race conditions; deadlocks |
The **all** data is located in the Dataset folder.
## 4️⃣ Installation Instructions
```
sudo apt update
sudo apt install python3.10-venv python3.10-distutils python3-pip
pip3 install -r requirements.txt
python3 -m nltk.downloader punkt
python3 -m nltk.downloader punkt_tab
```
Install perf:
```
sudo apt-get update
sudo apt-get install linux-tools-common linux-tools-generic linux-tools-$(uname -r)
```
Install Ollama:
```
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.2
```
See here: https://ollama.com/download/linux and https://ollama.com/library/llama3.1
Then, clone the clustering algorithm we work with:
```
cd clustering
git clone https://github.com/rashadulrakib/short-text-clustering-enhancement.git
cp short-text-clustering-enhancement/*.py .
cp short-text-clustering-enhancement/stopWords.txt .
```