Published January 23, 2023 | Version 1.001
Dataset Open

Technical Debt Classification in Issue Trackers using Natural Language Processing based on Transformers

  • 1. University of Oslo, Norway
  • 2. Norwegian Computing Center

Description

In order to ensure transparency and reproducibility, we have made everything available publicly here, including the Code, Models, Datasets and more. All the files and their functionality used in this paper are explained clearly in the README.md file.

Background:  Technical Debt (TD) needs to be controlled and tracked during software development. Support to automatically track TD in issue trackers is limited. 

Aim: We explore the usage of a large dataset of developer-labeled TD issues in combination with cutting-edge Natural Language Processing (NLP) approaches to automatically classify TD in issue trackers.

Method:  We mine and analyze more than 160GB of textual data from GitHub projects, collecting over 55,600 TD issues and consolidating them into a large dataset (GTD dataset). We use such datasets to train and test Transformer ML models. Then we test the model's generalization ability by testing them on six unseen projects. Finally, we re-train the models including part of the TD issues from the target project to test their adaptability. 

Results and Conclusion: (i) We create and release the GTD dataset, a comprehensive dataset including TD issues from 6,401 public repositories with various contexts; (ii) By training Transformers using the GTD dataset, we achieve performance metrics that are promising; (iii) Our results are a significant step forward towards supporting the automatic classification of TD in issue trackers, especially when the models are adapted to the context of unseen projects after fine-tuning.

Files

README.md

Files (41.9 GB)

Name Size Download all
md5:04b567b8dd2a195a263b410cbb7e46ad
25.1 kB Preview Download
md5:c05a7fb502770246d20e03c37743cc93
41.9 GB Preview Download