Dataset Open Access
The dataset provides IaC-oriented, delta and process metrics extracted from open-source GitHub repositories based on the Ansible language, as well as the corresponding validated defect-prediction models.
Infrastructure-as-code (IaC) is the DevOps strategy that allows management and provisioning of infrastructure through the definition of machine-readable files and automation around them, rather than physical hardware configuration or interactive configuration tools.
On the one hand, although IaC represents an ever-increasing widely adopted practice nowadays, still little is known concerning how to best maintain, speedily evolve, and continuously improve the code behind the IaC strategy in a measurable fashion.
On the other hand, source code measurements are often computed and analyzed to evaluate the different quality aspects of the software developed.
In particular, Infrastructure-as-Code is simply "code", as such it is prone to defects as any other programming languages.
This dataset targets the YAML-based Ansible language to devise **defects prediction** approaches for IaC based on Machine-learning.
The dataset contains metrics extracted from 85 open-source GitHub repositories based on the Ansible language that satisfied the following criteria:
* The repository has at least one push event to its master branch in the last six months;
* The repository has at least 2 releases;
* At least 11% of the files in the repository are IaC scripts;
* The repository has at least 2 core contributors;
* The repository has evidence of continuous integration practice, such as the presence of a .travis.yaml file;
* The repository has a comments ratio of at least 0.2%;
* The repository has commit frequency of at least 2 per month on average;
* The repository has an issue frequency of at least 0.023 events per month on average;
* The repository has evidence of a license, such as the presence of a LICENSE.md file
* The repository has at least 190 source lines of code.
Metrics are grouped into three categories:
* IaC-oriented: metrics of structural properties derived from the source code of infrastructure scripts;
* Delta: metrics that capture the amount of change in a file between two successive releases, collected for each IaC-oriented metric;
* Process: metrics that capture aspects of the development process rather than aspects about the code itself. Description of the process metrics in this dataset can be found [here](https://pydriller.readthedocs.io/en/latest/processmetrics.html).
In addition to the metrics, the dataset contains the pre-trained models (one per repositories) in the folder models.zip. The folder contains the following two files for each repository;
* a pickle file containing the pre-trained model. It has the following naming convention: <repoowner__reponame>.pkl
To load the model in python use the following:
from joblib import load
model = load('path/to/the/pickle/file.pkl'), mmap_mode='r')
* a son file containing the attributes used to train the model. Those attributed are needed to use the corresponding model correctly. It has the following naming convention: <repoowner__reponame>.json. To use the model after loading it as described above, use:
with open('path/to/the/pickle/file.json'), 'r') as in_file:
model_features = json.load(in_file)
You can then collect and pass those features to the model for the final prediction.
Thanks to the open-source community :)
What source code properties and properties about the development process are good predictors of defects in Infrastructure-as-Code scripts?
Further information about the dataset and the validation results can be found at: https://github.com/stefanodallapalma/Defect-Prediction-IaC-TSE2020-replication-package