There is a newer version of this record available.

Dataset Open Access

# Defect Prediction Tool Validation Dataset 1

Stefano Dalla Palma

The dataset provides IaC-oriented, delta and process metrics extracted from open-source GitHub repositories based on the Ansible language, as well as the corresponding validated defect-prediction models.

Context

Infrastructure-as-code (IaC) is the DevOps strategy that allows management and provisioning of infrastructure through the definition of machine-readable files and automation around them, rather than physical hardware configuration or interactive configuration tools.

On the one hand, although IaC represents an ever-increasing widely adopted practice nowadays, still little is known concerning how to best maintain, speedily evolve, and continuously improve the code behind the IaC strategy in a measurable fashion.
On the other hand, source code measurements are often computed and analyzed to evaluate the different quality aspects of the software developed.
In particular, Infrastructure-as-Code is simply "code", as such it is prone to defects as any other programming languages.

This dataset targets the YAML-based Ansible language to devise **defects prediction** approaches for IaC based on Machine-learning.

Content

The dataset contains metrics extracted from 85 open-source GitHub repositories based on the Ansible language that satisfied the following criteria:

* The repository has at least one push event to its master branch in the last six months;
* The repository has at least 2 releases;
* At least 11% of the files in the repository are IaC scripts;
* The repository has at least 2 core contributors;
* The repository has evidence of continuous integration practice, such as the presence of a  .travis.yaml file;
* The repository has a comments ratio of at least 0.2%;
* The repository has commit frequency of at least 2 per month on average;
* The repository has an issue frequency of at least 0.023 events per month on average;
* The repository has evidence of a license, such as the presence of a LICENSE.md file
* The repository has at least 190 source lines of code.

Metrics are grouped into three categories:

* IaC-oriented: metrics of structural properties derived from the source code of infrastructure scripts;

* Delta: metrics that capture the amount of change in a file between two successive releases, collected for each IaC-oriented metric;

* Process: metrics that capture aspects of the development process rather than aspects about the code itself. Description of the process metrics in this dataset can be found [here](https://pydriller.readthedocs.io/en/latest/processmetrics.html).

In addition to the metrics, the dataset contains the pre-trained models (one per repositories) in the folder models.zip. The folder contains the following two files for each repository;

* a pickle file containing the pre-trained model. It has the following naming convention: <repoowner__reponame>.pkl
To load the model in python use the following:




* a son file containing the attributes used to train the model. Those attributed are needed to use the corresponding model correctly. It has the following naming convention: <repoowner__reponame>.json. To use the model after loading it as described above, use:

import json
with open('path/to/the/pickle/file.json'), 'r') as in_file:


You can then collect and pass those features to the model for the final prediction.

Acknowledgements

Thanks to the open-source community :)

Inspiration

What source code properties and properties about the development process are good predictors of defects in Infrastructure-as-Code scripts?

Note

Further information about the dataset and the validation results can be found at: https://github.com/stefanodallapalma/Defect-Prediction-IaC-TSE2020-replication-package

Files (251.9 MB)
Name Size
ansible-metrics.csv
md5:f7b3df28d6b778d0157af43e1a768956
251.0 MB
models.zip
md5:c139b1318128b884b20340af48e92514
857.9 kB
79
26
views