10.5281/zenodo.3906023
https://zenodo.org/records/3906023
oai:zenodo.org:3906023
Stefano Dalla Palma
Stefano Dalla Palma
0000-0002-5611-0546
Tilburg University
Defect Prediction Tool Validation Dataset 1
Zenodo
2020
ansible
metrics
supervised models
2020-06-24
eng
10.5281/zenodo.3906022
https://zenodo.org/communities/eu
v1.0
Creative Commons Attribution 4.0 International
The dataset provides IaC-oriented, delta and process metrics extracted from open-source GitHub repositories based on the Ansible language, as well as the corresponding validated defect-prediction models.
Context
Infrastructure-as-code (IaC) is the DevOps strategy that allows management and provisioning of infrastructure through the definition of machine-readable files and automation around them, rather than physical hardware configuration or interactive configuration tools.
On the one hand, although IaC represents an ever-increasing widely adopted practice nowadays, still little is known concerning how to best maintain, speedily evolve, and continuously improve the code behind the IaC strategy in a measurable fashion.
On the other hand, source code measurements are often computed and analyzed to evaluate the different quality aspects of the software developed.
In particular, Infrastructure-as-Code is simply "code", as such it is prone to defects as any other programming languages.
This dataset targets the YAML-based Ansible language to devise **defects prediction** approaches for IaC based on Machine-learning.
Content
The dataset contains metrics extracted from 85 open-source GitHub repositories based on the Ansible language that satisfied the following criteria:
* The repository has at least one push event to its master branch in the last six months;
* The repository has at least 2 releases;
* At least 11% of the files in the repository are IaC scripts;
* The repository has at least 2 core contributors;
* The repository has evidence of continuous integration practice, such as the presence of a .travis.yaml file;
* The repository has a comments ratio of at least 0.2%;
* The repository has commit frequency of at least 2 per month on average;
* The repository has an issue frequency of at least 0.023 events per month on average;
* The repository has evidence of a license, such as the presence of a LICENSE.md file
* The repository has at least 190 source lines of code.
Metrics are grouped into three categories:
* IaC-oriented: metrics of structural properties derived from the source code of infrastructure scripts;
* Delta: metrics that capture the amount of change in a file between two successive releases, collected for each IaC-oriented metric;
* Process: metrics that capture aspects of the development process rather than aspects about the code itself. Description of the process metrics in this dataset can be found [here](https://pydriller.readthedocs.io/en/latest/processmetrics.html).
In addition to the metrics, the dataset contains the pre-trained models (one per repositories) in the folder models.zip. The folder contains the following two files for each repository;
* a pickle file containing the pre-trained model. It has the following naming convention: <repoowner__reponame>.pkl
To load the model in python use the following:
```
from joblib import load
model = load('path/to/the/pickle/file.pkl'), mmap_mode='r')
```
* a son file containing the attributes used to train the model. Those attributed are needed to use the corresponding model correctly. It has the following naming convention: <repoowner__reponame>.json. To use the model after loading it as described above, use:
```
import json
with open('path/to/the/pickle/file.json'), 'r') as in_file:
model_features = json.load(in_file)
```
You can then collect and pass those features to the model for the final prediction.
Acknowledgements
Thanks to the open-source community :)
Inspiration
What source code properties and properties about the development process are good predictors of defects in Infrastructure-as-Code scripts?
Note
Further information about the dataset and the validation results can be found at: https://github.com/stefanodallapalma/Defect-Prediction-IaC-TSE2020-replication-package
European Commission
10.13039/501100000780
825040
Rational decomposition and orchestration for serverless computing