A dataset of GitHub Actions workflow histories

Cardoen, Guillaume

doi:10.5281/zenodo.13985548

Published October 25, 2024 | Version 2024-10-25

Dataset Open

A dataset of GitHub Actions workflow histories

Cardoen, Guillaume (Researcher)¹

1. University of Mons

Contributors

Researchers:

1. University of Mons

This replication package accompagnies the dataset and exploratory empirical analysis reported in the paper "A dataset of GitHub Actions workflow histories" published in the IEEE MSR 2024 conference. (The Jupyter notebook can be found in previous version of this dataset).

Important notice : It looks like Zenodo is compressing gzipped files two times without notice, they are "double compressed". So, when you download them they should be named : x.gz.gz instead of x.gz. Notice that the provided MD5 refers to the original file.

2024-10-25 update : updated repositories list and observation period. The filters relying on date were also updated.

2024-07-09 update : fix sometimes invalid valid_yaml flag.

The dataset was created as follow :

First, we used GitHub SEART (on October 7th, 2024) to get a list of every non-fork repositories created before January 1st, 2024. having at least 300 commits and at least 100 stars where at least one commit was made after January 1st, 2024. (The goal of these filter is to exclude experimental and personnal repositories).
We checked if a folder .github/workflows existed. We filtered out those that did not contained this folder and pulled the others (between 9th and 10th
of October 2024).
We applied the tool gigawork (version 1.4.2) to extract every files from this folder. The exact command used is python batch.py -d /ourDataFolder/repositories -e /ourDataFolder/errors -o /ourDataFolder/output -r /ourDataFolder/repositories_everything.csv.gz -- -w /ourDataFolder/workflows_auxiliaries. (The script batch.py can be found on GitHub).
We concatenated every files in /ourDataFolder/output into a csv (using cat headers.csv output/*.csv > workflows_auxiliaries.csv in /ourDataFolder) and compressed it.
We added the column uid via a script available on GitHub.
Finally, we archived the folder with pigz /ourDataFolder/workflows (tar -c --use-compress-program=pigz -f workflows_auxiliaries.tar.gz /ourDataFolder/workflows)

Using the extracted data, the following files were created :

workflows.tar.gz contains the dataset of GitHub Actions workflow file histories.
workflows_auxiliaries.tar.gz is a similar file containing also auxiliary files.
workflows.csv.gz contains the metadata for the extracted workflow files.
workflows_auxiliaries.csv.gz is a similar file containing also metadata for auxiliary files.
repositories.csv.gz contains metadata about the GitHub repositories containing the workflow files. These metadata were extracted using the SEART Search tool.

The metadata is separated in different columns:

repository: The repository (author and repository name) from which the workflow was extracted. The separator "/" allows to distinguish between the author and the repository name
commit_hash: The commit hash returned by git
author_name: The name of the author that changed this file
author_email: The email of the author that changed this file
committer_name: The name of the committer
committer_email: The email of the committer
committed_date: The committed date of the commit
authored_date: The authored date of the commit
file_path: The path to this file in the repository
previous_file_path: The path to this file before it has been touched
file_hash: The name of the related workflow file in the dataset
previous_file_hash: The name of the related workflow file in the dataset, before it has been touched
git_change_type: A single letter (A,D, M or R) representing the type of change made to the workflow (Added, Deleted, Modified or Renamed). This letter is given by gitpython and provided as is.
valid_yaml: A boolean indicating if the file is a valid YAML file.
probably_workflow: A boolean representing if the file contains the YAML key on and jobs. (Note that it can still be an invalid YAML file).
valid_workflow: A boolean indicating if the file respect the syntax of GitHub Actions workflow. A freely available JSON Schema (used by gigawork) was used in this goal.
uid: Unique identifier for a given file surviving modifications and renames. It is generated on the addition of the file and stays the same until the file is deleted. Renamings does not change the identifier.

Both workflows.csv.gz and workflows_auxiliaries.csv.gz are following this format.

Files

Files (3.5 GB)

Name	Size	Download all
repositories.csv.gz md5:f3494a699893b66e0ca2e65cb6f74061	2.0 MB	Download
workflows.csv.gz md5:15b51003f6d2929b1c188d2c72ca72c3	191.3 MB	Download
workflows.tar.gz md5:5079c6fb5392d08d8b8d0cb728ddf7f8	2.2 GB	Download
workflows_auxiliaries.csv.gz md5:db834bf6aae420f495f29365db968595	196.5 MB	Download
workflows_auxiliaries.tar.gz md5:e2b9ce703918e32f07c8eec3a917fd11	963.9 MB	Download

	All versions	This version
Views	798	107
Downloads	740	306
Data volume	346.1 GB	226.5 GB

A dataset of GitHub Actions workflow histories

Creators

Contributors

Researchers:

Description

Files

Files (3.5 GB)