Technical Debt identification in Issue Trackers using Natural Language Processing based on Transformers
In order to ensure transparency and reproducibility, we have made everything available publicly here, including the Code, Models, Datasets and more. All the files and their functionality used in this paper are explained clearly in the README.md file.
Background: Technical Debt (TD) needs to be controlled and tracked during software development. Current support, such as static analysis tools and even ML-based automatic tagging, is still ineffective, especially for context-dependent TD.
Aim: We study the usage of a large TD dataset in combination with cutting-edge Natural Language Processing (NLP) approaches to classify TD automatically in issue trackers, allowing the identification and tracking of informal TD conversations.
Method: We mine and analyse more than 160GB of textual data from GitHub projects, collecting over 55,600 TD issues and consolidating them into a large dataset (GTD-dataset). We then use our dataset to train state-of-the-art Transformer ML models, before performing a quantitative case study on three projects and evaluating the performance metrics during inference. Additionally, we study the adaptation of our model to classify context-dependent TD in an unseen project, by retraining the model including different percentages of the TD issues in the target project.
Results: (i) We provide GTD- dataset, the most comprehensive datasets of TD issues to date, including issues from 6,401 unique public repositories with various contexts;
(ii) By training state-of-the-art Transformers using the GTD-dataset, we achieve performance metrics that outperform previous approaches;
(iii) We show that our model can provide a relatively reliable tool to classify automatically TD in issue trackers, especially when adapted to unseen projects where the training includes a small portion of TD issues in the new project.
Conclusion: Our results indicate that we have taken significant steps towards closing the gap to practically and semi-automatically track TD issues in issue trackers.