Published March 16, 2018 | Version v1
Dataset Open

Structured Information on State and Evolution of Dockerfiles on GitHub

  • 1. University of Zurich

Description

Docker containers are standardized, self-contained units of applications, packaged with their dependencies and execution environment. The environment is defined in a Dockerfile that specifies the steps to reach a certain system state as infrastructure code, with the aim of enabling reproducible builds of the container. To lay the groundwork for research on infrastructure code, we collected structured information about the state and the evolution of Dockerfiles on GitHub and release it as a PostgreSQL database archive (over 100,000 unique Dockerfiles in over 15,000 GitHub projects). Our dataset enables answering a multitude of interesting research questions related to different kinds of software evolution behavior in the Docker ecosystem.

Notes

Detailed information on the dataset can be found in the paper "Structured Information on State and Evolution of Dockerfiles on GitHub" accepted at the Data Showcase Track of the International Conference on Mining Software Repositories 2018 (MSR 2018). The software used to collect the dataset and instructions on how to use the dataset can be found in the paper's online appendix: https://github.com/sealuzh/msr18-docker-dataset

Files

Files (1.5 GB)

Name Size Download all
md5:cf7583effc1d699a3f6dcccf34f1e2e0
1.5 GB Download