There is a newer version of this record available.

Dataset Open Access

Defect Prediction Tool Validation Dataset 1

Stefano Dalla Palma

The dataset provides IaC-oriented, delta and process metrics extracted from open-source GitHub repositories based on the Ansible language, as well as the corresponding validated defect-prediction models.


Infrastructure-as-code (IaC) is the DevOps strategy that allows management and provisioning of infrastructure through the definition of machine-readable files and automation around them, rather than physical hardware configuration or interactive configuration tools.

On the one hand, although IaC represents an ever-increasing widely adopted practice nowadays, still little is known concerning how to best maintain, speedily evolve, and continuously improve the code behind the IaC strategy in a measurable fashion. 
On the other hand, source code measurements are often computed and analyzed to evaluate the different quality aspects of the software developed.
In particular, Infrastructure-as-Code is simply "code", as such it is prone to defects as any other programming languages.

This dataset targets the YAML-based Ansible language to devise **defects prediction** approaches for IaC based on Machine-learning.


The dataset contains metrics extracted from 85 open-source GitHub repositories based on the Ansible language that satisfied the following criteria:

* The repository has at least one push event to its master branch in the last six months;
* The repository has at least 2 releases;
* At least 11% of the files in the repository are IaC scripts;
* The repository has at least 2 core contributors;
* The repository has evidence of continuous integration practice, such as the presence of a  .travis.yaml file;
* The repository has a comments ratio of at least 0.2%;
* The repository has commit frequency of at least 2 per month on average;
* The repository has an issue frequency of at least 0.023 events per month on average;
* The repository has evidence of a license, such as the presence of a file
* The repository has at least 190 source lines of code.

Metrics are grouped into three categories:

* IaC-oriented: metrics of structural properties derived from the source code of infrastructure scripts;

* Delta: metrics that capture the amount of change in a file between two successive releases, collected for each IaC-oriented metric;

* Process: metrics that capture aspects of the development process rather than aspects about the code itself. Description of the process metrics in this dataset can be found [here](

In addition to the metrics, the dataset contains the pre-trained models (one per repositories) in the folder The folder contains the following two files for each repository;

* a pickle file containing the pre-trained model. It has the following naming convention: <repoowner__reponame>.pkl 
  To load the model in python use the following: 

from joblib import load
model = load('path/to/the/pickle/file.pkl'), mmap_mode='r')

* a son file containing the attributes used to train the model. Those attributed are needed to use the corresponding model correctly. It has the following naming convention: <repoowner__reponame>.json. To use the model after loading it as described above, use:
import json
with open('path/to/the/pickle/file.json'), 'r') as in_file:
        model_features = json.load(in_file)

You can then collect and pass those features to the model for the final prediction.



Thanks to the open-source community :)


What source code properties and properties about the development process are good predictors of defects in Infrastructure-as-Code scripts?



Further information about the dataset and the validation results can be found at:

Files (251.9 MB)
Name Size
251.0 MB Download
857.9 kB Download
All versions This version
Views 7959
Downloads 267
Data volume 6.7 GB1.0 GB
Unique views 7155
Unique downloads 134


Cite as