There is a newer version of the record available.

Published June 24, 2020 | Version v1.0
Dataset Open

Defect Prediction Tool Validation Dataset 1

  • 1. Tilburg University

Description

The dataset provides IaC-oriented, delta and process metrics extracted from open-source GitHub repositories based on the Ansible language, as well as the corresponding validated defect-prediction models.

Context

Infrastructure-as-code (IaC) is the DevOps strategy that allows management and provisioning of infrastructure through the definition of machine-readable files and automation around them, rather than physical hardware configuration or interactive configuration tools.

On the one hand, although IaC represents an ever-increasing widely adopted practice nowadays, still little is known concerning how to best maintain, speedily evolve, and continuously improve the code behind the IaC strategy in a measurable fashion. 
On the other hand, source code measurements are often computed and analyzed to evaluate the different quality aspects of the software developed.
In particular, Infrastructure-as-Code is simply "code", as such it is prone to defects as any other programming languages.

This dataset targets the YAML-based Ansible language to devise **defects prediction** approaches for IaC based on Machine-learning.

Content

The dataset contains metrics extracted from 85 open-source GitHub repositories based on the Ansible language that satisfied the following criteria:

* The repository has at least one push event to its master branch in the last six months;
* The repository has at least 2 releases;
* At least 11% of the files in the repository are IaC scripts;
* The repository has at least 2 core contributors;
* The repository has evidence of continuous integration practice, such as the presence of a  .travis.yaml file;
* The repository has a comments ratio of at least 0.2%;
* The repository has commit frequency of at least 2 per month on average;
* The repository has an issue frequency of at least 0.023 events per month on average;
* The repository has evidence of a license, such as the presence of a LICENSE.md file
* The repository has at least 190 source lines of code.

Metrics are grouped into three categories:

* IaC-oriented: metrics of structural properties derived from the source code of infrastructure scripts;

* Delta: metrics that capture the amount of change in a file between two successive releases, collected for each IaC-oriented metric;

* Process: metrics that capture aspects of the development process rather than aspects about the code itself. Description of the process metrics in this dataset can be found [here](https://pydriller.readthedocs.io/en/latest/processmetrics.html).

In addition to the metrics, the dataset contains the pre-trained models (one per repositories) in the folder models.zip. The folder contains the following two files for each repository;

* a pickle file containing the pre-trained model. It has the following naming convention: <repoowner__reponame>.pkl 
  To load the model in python use the following: 

```
from joblib import load
model = load('path/to/the/pickle/file.pkl'), mmap_mode='r')
```

* a son file containing the attributes used to train the model. Those attributed are needed to use the corresponding model correctly. It has the following naming convention: <repoowner__reponame>.json. To use the model after loading it as described above, use:
```
import json
with open('path/to/the/pickle/file.json'), 'r') as in_file:
        model_features = json.load(in_file)
```

You can then collect and pass those features to the model for the final prediction.

 

Acknowledgements

Thanks to the open-source community :)


Inspiration

What source code properties and properties about the development process are good predictors of defects in Infrastructure-as-Code scripts?

 

Note

Further information about the dataset and the validation results can be found at: https://github.com/stefanodallapalma/Defect-Prediction-IaC-TSE2020-replication-package

Files

ansible-metrics.csv

Files (251.9 MB)

Name Size Download all
md5:f7b3df28d6b778d0157af43e1a768956
251.0 MB Preview Download
md5:c139b1318128b884b20340af48e92514
857.9 kB Preview Download

Additional details

Funding

RADON – Rational decomposition and orchestration for serverless computing 825040
European Commission