Preview

Description of Datasets and Models

The entirety of this project is made up of two independent processes, "Dataset Creation" , "Technical Debt (TD) Model training and testing" .

1. Dataset creation

Dataset Creation: This folder contains code and raw CSV files that are required to create all the Technical Debt (TD) datasets used in this study. The folder is divided into many subfolders, each one and their included files, are responsible for the creation of all datasets mentioned in this study.

Step1 - GH archrive_mining: This subfolder contains two Python scripts required for mining "Issue event" from the Github Archive (GH Archive). If desired, practioners may select a custom date interval from the date range for the issue events by changing the date fields in these scripts.
- GH_all_issue.py: This script is used to extract all "Issue Event" in a specified date range from GH Archive.
- GHarchive_debt.py: This script is used to extract TD issues only, doing so by extracting "Issue event" with keyword "debt" in their payload.
all_issue_event: In this folder, all the "Issue event" mined from GH Archive is available as multiple CSV files with dates ranging from 1st Jan 2021 to 21st July 2022. The script used to mine this data is GH_all_issue.py, which can be found in folder "GH archrive_mining".
TD_dataset_events: All the mined "Issues event" with keyword "debt" (case insensitive) in the "Issues event payload" of GH archive, are available as CSV files with dates ranging from 1st Jan 2015 to 21st July 2022 (more than 7.5 years) in this folder. The GHarchive_debt.py script is used to mine this data and can be found in folder "Step1 - GH archrive_mining".
TD_Dataset_creation: In this folder you will find all the important code and CSV files used to create the "Main Ground TD dataset". Most of the exploratory data analysis and cleaning is also included in this folder. The main code files are as follows:
- TD_dataset_creation.ipynb: This Notebook reads all the CSV files' "Issues event" with keyword "debt" from the "TD_dataset_events" folder, then concatenates them into one big CSV file. The TD issues in this dataset are picked based on their "Label" field having the keyword "debt" in their payload, which we mined using the script GHarchive_debt.py.
- Extract_issues_by_type.ipynb: this Notebook reads all other the "Issues event" CSV files from the folder "all_issue_event" that we mined using the script GH_all_issue.py. The notebook is then used concatenate all the non-TD issues with labels types such as "bug", "improvements", "questions" and "documentation", and so forth (except from "debt" types) in order to create the non-TD collection of our dataset.
- TD_Data_combining.ipynb: This Notebook is used to combine the TD collection and non-TD collection of CSV files, in order to create the TD dataset that is used for binary classification of TD- and non-TD issues.
- TD_Dataset_EDA_and_filter_Repos.ipynb: This Notebook is used for two purposes. In the first half of the Notebook steps for exploratory data analysis, data cleaning and extracting information is completed. The information includes such as the number of unique repositories (repo's) and the top 25 repo's where TD issues have been collected from during data mining. Additionally, the Notebook displays other metadata about the repo's, as well as approximately randomly samples an even distribution of non-TD issues to the TD issues for our final TD datasets. In the second part, we extract and filter out all the TD issues from the "microsoft/vscode" , "apache/trafficcontrol" , "UBC-Thunderbots/Software" , "owncloud/core" , "department-of-veterans-affairs/va.gov-team" repo so that we can create a dataset that can be trained to classify TD- and non-TD issues for these specific project.
- TD_Dataset_statistics.ipynb: This Notebook shows all the important TD dataset statistics, such as the number of TD- and non-TD issues, the mean and median text lenghts and other summary statistics for the TD Dataset. We also include a word cloud for TD issues and non-TD issues and finally the N-gram analysis for the dataset before splitting it.
Chromium_Dataset : This folder contains all the data and code required for the data pre-processing of the Chromium Dataset and statistics of the dataset's code. We obtained the data for this from URL https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=549541.
- Chromium_data_processing.ipynb: This Notebook cleans the Chromium TD dataset (including lowercasing text, removing text in square brackets, links, punctuation and words containing numbers).
- Chromium_Dataset_statistics.ipynb: This Notebook shows important statistics about the Chromium TD dataset, such as the number of TD issues and non-TD issues, the mean and median text lenghts and other summary statistics for the dataset. We also include a word cloud for TD issues and non-TD issues and finally the N-gram analysis for the dataset before splitting it.
Jira_dataset : This folder contains all the data and code required to extract the TD- and non-TD issues from the JIRA public repository. The JIRA data for this was extracted from the MongoDB data dump available on the Zenodo URL at https://zenodo.org/record/5901956.
- Data_extract_mongodb.ipynb: This Notebook extracts all TD- and non-TD issues from the MongoDB data dump, which we loaded into our local MongoDB instance (size of ~60GB). From the MongoDB schema we were able to identify projects with TD issues and extracted around 375 TD issues acrosss four different projects, namely "Apache", "MongoDB", "Sonatype", and "JiraEcosystem". With these TD issues extracted from the four different DB collections, we also randomly sampled approximately an equal number of non-TD issues types across the four different project and stored all the data in a H5 file called jira_all.h5.
- Jira_data_processing.ipynb: This Notebook is used to extract the TD- and non-TD issues from the jira_all.h5 file, then perform some data pre-processing and cleaning steps (make text lowercase, remove text in square brackets, remove links, remove punctuation and remove words containing numbers) to create the final JIRA TD datasets.
- Jira_Dataset_stastistics.ipynb: This Notebook shows important JIRA TD dataset statistics, such as the number of TD issues and non-TD issues, the mean and median text lenghts and other summary statistics for the dataset. We also include a word cloud for TD issues and non-TD issues and finally the N-gram analysis for the dataset before splitting it.
VS_code_dataset : This folder contains all the data and code required to build the extracted VS Code TD- and non-TD issues from the "microsoft/vscode" public repository, which we mined and filtered using our GH Archive mining script.
- VS_Code_dataset_creation.ipynb: This file is used to extract non-TD issues from the "microsoft/vscode" repository using the CSV file that we mined and stored it in the folder "all_issue_event". With this we combine the VS Code TD issues that we extracted and filtered from the main TD dataset, using the file TD_Dataset_EDA_VS_code_filter.ipynb, in order to create a dataset that contains both TD- and non-TD issues for the VS Code dataset. The filtered main TD dataset, which we removed "microsoft/vscode" TD issues from, is also stored as omitted_vscode_TD_dataset.csv and used to both train and test the VS Code TD dataset.
- TD_Dataset_No_VScode_statistics.ipynb: This Notebook shows important TD Dataset statistics after removing the VS Code project, such as the number of TD issues and non-TD issues, the mean and median text lenghts and other summary statistics for the dataset. We also include a word cloud for TD issues and non-TD issues and finally the N-gram analysis for the dataset before splitting it.
- VSCODE_Dataset_statistics.ipynb: This Notebook shows all the important VS Code TD dataset statistics, such as the number of TD issues and non-TD issues, the mean and median text lenghts and other summary statistics for the dataset. We also include a word cloud for TD issues and non-TD issues and finally the N-gram analysis for the dataset before splitting it.
va_gov_dataset : This folder contains all the data and code required to build the extracted VA Gov TD- and non-TD issues from the "department-of-veterans-affairs/va.gov-team" public repository, which we mined and filtered using our GH Archive mining script.
- va_gov_dataset_creation.ipynb: This file is used to extract non-TD issues from the "department-of-veterans-affairs/va.gov-team" repository using the CSV file that we mined and stored it in the folder "all_issue_event". With this we combine the VA_Gov TD issues that we extracted and filtered from the main TD dataset, using the file TD_Dataset_EDA_and_filter_Repos.ipynb, in order to create a dataset that contains both TD- and non-TD issues for the VA Gov dataset ('va_gov_debt_TD_dataset.csv'). The filtered main TD dataset which we removed "department-of-veterans-affairs/va.gov-team" TD issues from, is also stored as omit_va_gov_TD_dataset.csv and used to both train and test the VA Gov TD dataset.
apache_trafficcontrol_dataset : This folder contains all the data and code required to build the extracted Apache trafficcontrol TD- and non-TD issues from the "apache/trafficcontrol" public repository, which we mined and filtered using our GH Archive mining script.
- apache_traffic_dataset_creation.ipynb: This file is used to extract non-TD issues from the "apache/trafficcontrol" repository using the CSV file that we mined and stored it in the folder "all_issue_event". With this we combine the Apache trafficcontrol TD issues that we extracted and filtered from the main TD dataset, using the file TD_Dataset_EDA_and_filter_Repos.ipynb, in order to create a dataset that contains both TD- and non-TD issues for the Apache trafficcontrol dataset ('apache_traffic_TD_dataset.csv'). The filtered main TD dataset which we removed "apache/trafficcontrol" TD issues from, is also stored as omit_apache_traffic_TD_dataset.csv and used to both train and test the Apache trafficcontrol TD dataset.
owncloud_dataset : This folder contains all the data and code required to build the extracted Owncloud TD- and non-TD issues from the "owncloud/core" public repository, which we mined and filtered using our GH Archive mining script.
- owncloud_dataset_creation.ipynb: This file is used to extract non-TD issues from the "owncloud/core" repository using the CSV file that we mined and stored it in the folder "all_issue_event". With this we combine the Owncloud TD issues that we extracted and filtered from the main TD dataset, using the file TD_Dataset_EDA_and_filter_Repos.ipynb, in order to create a dataset that contains both TD- and non-TD issues for the Owncloud dataset ('owncloud_TD_dataset.csv'). The filtered main TD dataset which we removed "owncloud/core" TD issues from, is also stored as omit_owncloud_TD_dataset.csv and used to both train and test the Owncloud TD dataset.
UBC_thunder_dataset : This folder contains all the data and code required to build the extracted UBC-Thunderbots TD- and non-TD issues from the "UBC-Thunderbots/Software" public repository, which we mined and filtered using our GH Archive mining script.
- UBC_thunder_dataset_creation.ipynb: This file is used to extract non-TD issues from the "UBC-Thunderbots/Software" repository using the CSV file that we mined and stored it in the folder "all_issue_event". With this we combine the Owncloud TD issues that we extracted and filtered from the main TD dataset, using the file TD_Dataset_EDA_and_filter_Repos.ipynb, in order to create a dataset that contains both TD- and non-TD issues for the UBC_thunder dataset ('UBC_thunder_TD_dataset.csv'). The filtered main TD dataset which we removed "UBC-Thunderbots/Software" TD issues from, is also stored as omit_ubc_thunder_TD_dataset.csv and used to both train and test the UBC_thunder TD dataset.

2. Technical Debt (TD) Model training and testing

TD_model : This folder contains all the required code to train and test the model performance, as well as the metadata required to replicate these experiments. We have saved the dataset used in each experiment and model in a set of binary files format, all the notebooks used in the model training and testing inference have been seeded for reproducibility.

TD_Dataset_train: This folder contains all the files required to train on a Pre-trained Language Model (PLM) like the DeBERTaV3 transformer model using the main TD datasets, perform inference testing on completely different datasets (like JIRA and Chromium) and test split of the main TD dataset. Making it so that the generalisability and adaptability of our model in different projects and TD discussions on different platforms can be explored.
- Train_TD_derberta.ipynb: This notebook contains the code to fine-tune the PLM DeBERTa model for text classification on TD- and non-TD issues using the main TD dataset. We split this dataset using a 85/15 train/split, where 85% was used for training and 15% for testing using a grouped K-fold cross-validation technique (we used 3 folds for all our training run). During each folds, the training is run on different folds (or a subsection of the 85% train split is used for training and running the validation). The model performance metrics are logged for each fold train run. The data resulting from this, like training loss, validation loss, validation accuracy and validation F1-score is also automatically stored in weights and biases, which is also made publically available via the link associated with each run. The trained model is then used for testing inference on the test split on 15% of the main TD dataset. These performance metric on the testing inference are reported both in this notebook and the paper.
- Inference_on_Chromium_dataset.ipynb: The trained model from the Train_TD_derberta.ipynb Notebook is used for testing the inference on the Chromium dataset. This makes it possible to test the generalisability and adaptability of our model in different projects and other TD discussions in different platforms.
- Inference_on_Jira_dataset.ipynb: The trained model from the Train_TD_derberta.ipynb Notebook is used for testing the inference on the JIRA dataset. This makes it possible to test the generalisability and adaptability of our model in different projects and other TD discussions in different platforms.
Chromium15: This folder contains the code, saved model, other metadata and results after fine-tuning the model and training it with 85% of the TD dataset and 15% of the Chromium TD Dataset. We experimented and tested the generalisability and adaptability of our model after adding 15% of randomly sampled Chromium dataset issues to 85% of the TD dataset, then tested during inference the remaining 85% of the Chromium dataset.
- Chromium15.ipynb: This notebook is used to Fine-tune the DeBERTAv3 PLM with 85% TD dataset + 15% Chromium Dataset for training and then tested in inference the remaining 85% Chromium dataset issues.
Chromium30: This folder contains the code, saved model, other metadata and results after fine-tuning the model and training it with 85% of the TD dataset and 30% of the Chromium TD Dataset. We experimented and tested the generalisability and adaptability of our model after adding 30% of randomly sampled Chromium dataset issues to 85% of the TD dataset, then tested during inference the remaining 70% of the Chromium dataset.
- Chromium30.ipynb: This notebook is used to fine-tune the DeBERTAv3 PLM with 85% TD dataset + 30% Chromium dataset for training, then testing in inference the remaining 70% of the Chromium issues.
Chromium50: This folder contains the code, saved model, other metadata and results after fine-tuning the model and training it with 85% of the TD dataset and 50% of the Chromium TD Dataset. We experimented and tested the generalisability and adaptability of our model after adding 50% of randomly sampled Chromium dataset issues to 85% of the TD dataset, then tested during inference the remaining 50% of the Chromium dataset.
- Chromium50.ipynb: This notebook is used to fine-tune the DeBERTAv3 PLM with 85% TD dataset + 50% Chromium dataset for training, then testing in inference the remaining 50% of the Chromium issues.
JIRA_15_percent: This folder contains the code, saved model, other metadata and results after fine-tuning the model and training it with 85% of the TD dataset and 15% of the JIRA TD Dataset. We experimented and tested the generalisability and adaptability of our model after adding 15% of randomly sampled JIRA dataset issues to 85% of the TD dataset, then tested during inference the remaining 85% of the JIRA dataset.
- jira_15.ipynb: This notebook is used to Fine-tune the DeBERTAv3 PLM with 85% TD dataset + 15% JIRA Dataset for training and then tested in inference the remaining 85% JIRA dataset issues.
JIRA_30_percent: This folder contains the code, saved model, other metadata and results after fine-tuning the model and training it with 85% of the TD dataset and 30% of the JIRA TD Dataset. We experimented and tested the generalisability and adaptability of our model after adding 30% of randomly sampled JIRA dataset issues to 85% of the TD dataset, then tested during inference the remaining 70% of the JIRA dataset.
- jira_30.ipynb: This notebook is used to Fine-tune the DeBERTAv3 PLM with 85% TD dataset + 30% JIRA Dataset for training and then tested in inference the remaining 70% JIRA dataset issues.
JIRA_50_percent: This folder contains the code, saved model, other metadata and results after fine-tuning the model and training it with 85% of the TD dataset and 50% of the JIRA TD Dataset. We experimented and tested the generalisability and adaptability of our model after adding 50% of randomly sampled JIRA dataset issues to 85% of the TD dataset, then tested during inference the remaining 50% of the JIRA dataset.
- jira_50.ipynb: This notebook is used to fine-tune the DeBERTAv3 PLM with 85% TD dataset + 50% JIRA dataset for training, then testing in inference the remaining 50% of the JIRA issues.
VS_code_project: This folder contains all the files required to train on a Pre-trained Language Model (PLM) like the DeBERTaV3 transformer model using the TD_no_vscode (Removed VS code entry) dataset. Making it so that we can perform inference testing on the VS Code TD dataset to test the generalisability and adaptability of our model to a specific project, which in our case is the VS Code.
- VS_code_TD.ipynb: This Notebook is used to train the 100% TD_no_vscode (removed VS code entry) dataset, then test it during inference on 100% of VS Code TD dataset (Unseen Dataset).
VS_CODE_15_TD: This folder contains the code, saved model, other metadata and results after fine-tuning the model and training it with 85% "TD_no_vscode" (removed VS code entry) dataset and 15% of the VS Code dataset. We experimented and tested the generalisability and adaptability of our model after adding 30% of randomly sampled VS code dataset issues to 85% of the "TD_no_vscode", then tested during inference the remaining 85% of the VS Code dataset.
- VS_code_Train_15TD.ipynb: This notebook is used to fine-tune the DeBERTAv3 PLM with 85% "TD_no_vscode" + 15% VS Code dataset for training, then testing in inference the remaining 85% of the VS Code issues.
VS_CODE_30_TD: This folder contains the code, saved model, other metadata and results after fine-tuning the model and training it with 85% "TD_no_vscode" (removed VS code entry) dataset and 30% of the VS Code dataset. We experimented and tested the generalisability and adaptability of our model after adding 30% of randomly sampled VS Code dataset issues to 85% of the "TD_no_vscode", then tested during inference the remaining 70% of the VS Code dataset.
- "VS_code_Train_30TD.ipynb": This notebook is used to fine-tune the DeBERTAv3 PLM with 85% "TD_no_vscode" + 30% VS Code dataset for training, then testing in inference the remaining 70% of the VS Code issues.
VS_CODE_50_TD: This folder contains the code, saved model, other metadata and results after fine-tuning the model and training it with 85% "TD_no_vscode" (removed VS code entry) dataset and 50% of the VS Code dataset. We experimented and tested the generalisability and adaptability of our model after adding 30% of randomly sampled VS Code dataset issues to 85% of the "TD_no_vscode", then tested during inference the remaining 50% of the VS Code dataset.
- VS_code_Train_50TD.ipynb: This notebook is used to fine-tune the DeBERTAv3 PLM with 85% "TD_no_vscode" + 50% VS Code dataset for training, then testing in inference the remaining 50% of the VS Code issues.
Different_organiation_size&type: This folder contains training and testing before and after fine-tuning of different custom created TD dataset based on differrent organisation size and type , VA_Gov is a government organisation and Apache trafficcontrol , Owncloud and UBC thunder software are from organisation of varying type and size ( small , medium and large organisation) and all these repository have different programming language and other github stats like number of stars , fork counts , watchers and so on. We used these diferent mix to test the generalisability of our Model before and after finetunning.
- Apache_traffic_control : In this folder we have two main notebook and apache_traffic datasets. -"Apache_traffic_TD_output.ipynb" : This Notebook is used to train the 100% omit_apache_traffic_TD_dataset.csv (removed Apache traffic entry) dataset, then test it during inference on 100% of Apache traffic TD dataset apache_traffic_TD_dataset.csv (Unseen Dataset). -"Apache_traffic_30TD_output.ipynb" : This notebook is used to fine-tune the DeBERTAv3 PLM with 85% "omit_apache_traffic_TD_dataset.csv" (removed Apache traffic entry) + 30% Apache traffic TD dataset "apache_traffic_TD_dataset.csv" for training, then testing in inference the remaining 70% of the Apache traffic TD dataset issues.
- owncloud_project : In this folder we have two main notebook and Owncloud datasets. -"Owncloud_TD_output.ipynb" : This Notebook is used to train the 100% omit_owncloud_TD_dataset.csv (removed Owncloud entry) dataset, then test it during inference on 100% of Owncloud TD dataset owncloud_TD_dataset.csv (Unseen Dataset). -"Owncloud_30TD_output.ipynb" : This notebook is used to fine-tune the DeBERTAv3 PLM with 85% "omit_owncloud_TD_dataset.csv" (removed Owncloud entry) + 30% Owncloud TD dataset "owncloud_TD_dataset.csv" for training, then testing in inference the remaining 70% of the Owncloud TD dataset issues.
- UBC_thunder_project : In this folder we have two main notebook and UBC_thunder datasets. -"UBC_thunder_Train_TD_deberta.ipynb" : This Notebook is used to train the 100% omit_ubc_thunder_TD_dataset.csv (removed UBC_thunder entry) dataset, then test it during inference on 100% of UBC thunder TD dataset ubc_thunder_TD_dataset.csv (Unseen Dataset). -"UBC_thunder_30TD_output.ipynb" : This notebook is used to fine-tune the DeBERTAv3 PLM with 85% "omit_ubc_thunder_TD_dataset.csv" (removed UBC_thunder entry) + 30% UBC_thunder TD dataset "ubc_thunder_TD_dataset.csv" for training, then testing in inference the remaining 70% of the UBC_thunder TD dataset issues.
- Va_gov_project : In this folder we have two main notebook and Va_gov datasets. -"UBC_thunder_Train_TD_deberta.ipynb" : This Notebook is used to train the 100% omit_va_gov_TD_dataset.csv (removed Va_gov entry) dataset, then test it during inference on 100% of Va_gov TD dataset va_gov_TD_dataset.csv (Unseen Dataset). -"UBC_thunder_30TD_output.ipynb" : This notebook is used to fine-tune the DeBERTAv3 PLM with 85% "omit_va_gov_TD_dataset.csv" (removed Va_gov entry) + 30% Va_gov TD dataset "va_gov_TD_dataset.csv" for training, then testing in inference the remaining 70% of the Va_gov TD dataset issues.

System Requirements

We trained our model on an Nvidia A100 GPU with 40GB memory.

Installation Instructions

You will find All_requirements.txt files, which you can adapt to your needs and use to install packages that have been used throughout this study.

Licenses

The code is licensed under MIT. The license is included in this repository and further information can be found at URL https://opensource.org/licenses/MIT.

The data is licensed under CC BY 4.0. The license is included in this repository and further information can be found at URL https://creativecommons.org/licenses/by/4.0/.