Published October 20, 2020 | Version 0.1
Dataset Open

Exploring Design Smells for Smell-Based Defect Prediction

  • 1. Ben-Gurion University of the Negev, Beer-Sheva, Israe
  • 2. Faculty of Engineering of University of Porto, Portugal


The archived file includes the datasets used for supporting the conclusions in the article Exploring Design Smells for Smell-Based Defect Prediction.

In this paper, we answer two research questions:

RQ1. Do Design code smells contribute to the performance of defect prediction models trained with Traditional code smells?

RQ2. How do the different categories of Design smells impact the performance of the defect prediction models?

Therefore, after extracting the archived file documents, you will find two sub-directories, respectively named "RQ1" and "RQ2". They include the results obtained for each one of the research questions, thus supporting our conclusions.

(You will also find a README.pdf file with these same instructions regarding the datasets.)

Inside "RQ1," you will find two directories, respectively named "configuration_1" and "configuration_2". They represent the different configurations for the experiments. "configuration_1" contains the datasets with results for the ten classifiers configurations with the highest scores and "configuration_2" contains the datasets with the results classifier configuration with the overall best results - Support Vector Machine with C=0.1. Furthermore, within each directory, there are three sub-directories, respectively named "designite," "designite_traditional," and "traditional." These have the datasets for each of the considered smell sets in our study. Inside "RQ2," you will find four directories. Each corresponds to a category from the design smells for the dataset "designite_traditional." These datasets were build from the same configuration as "configuration_2".

Then, within every directory, there are 97 sub-directories representing the 97 projects analyzed in this study.

Every project folder follows the same structure, which we define as follows.

  • The "dataset" directory contains the original training and testing dataset used.
  • The "oversamples" directory contains the training dataset after oversampling for each of the feature selection approaches.
  • The "score_summary" directory contains all classifier configurations considered, not only the 10 with the highest scores.
  • The "scores.csv" file contains all the scores for the main classifier configurations studied in the particular experiment.
  • The "selected_features" directory contains the selected features' information and the selected features dataset for each feature_selection method.
  • The "selected_testing_X" directory contains the testing datasets.
  • The "top_scores_summary" directory contains the classifier configurations and hyper-parameter scores for the top 10 highest scores.


Files (43.1 MB)

Name Size Download all
43.1 MB Preview Download