A dataset of GitHub Actions workflow histories
Description
This replication package accompagnies the dataset and exploratory empirical analysis reported in the paper "A dataset of GitHub Actions workflow histories" published in the IEEE MSR 2024 conference. (The Jupyter notebook can be found in previous version of this dataset).
Important notice : It looks like Zenodo is compressing gzipped files two times without notice, they are "double compressed". So, when you download them they should be named : x.gz.gz
instead of x.gz
. Notice that the provided MD5 refers to the original file.
2024-10-25 update : updated repositories list and observation period. The filters relying on date were also updated.
2024-07-09 update : fix sometimes invalid valid_yaml
flag.
The dataset was created as follow :
- First, we used GitHub SEART (on October 7th, 2024) to get a list of every non-fork repositories created before January 1st, 2024. having at least 300 commits and at least 100 stars where at least one commit was made after January 1st, 2024. (The goal of these filter is to exclude experimental and personnal repositories).
- We checked if a folder
.github/workflows
existed. We filtered out those that did not contained this folder and pulled the others (between 9th and 10th
of October 2024). - We applied the tool
gigawork
(version 1.4.2) to extract every files from this folder. The exact command used ispython batch.py -d /ourDataFolder/repositories -e /ourDataFolder/errors -o /ourDataFolder/output -r /ourDataFolder/repositories_everything.csv.gz -- -w /ourDataFolder/workflows_auxiliaries
. (The scriptbatch.py
can be found on GitHub). - We concatenated every files in
/ourDataFolder/output
into a csv (usingcat headers.csv output/*.csv > workflows_auxiliaries.csv
in/ourDataFolder
) and compressed it. - We added the column
uid
via a script available on GitHub. - Finally, we archived the folder with pigz
/ourDataFolder/workflows
(tar -c --use-compress-program=pigz -f workflows_auxiliaries.tar.gz /ourDataFolder/workflows
)
Using the extracted data, the following files were created :
workflows.tar.gz
contains the dataset of GitHub Actions workflow file histories.workflows_auxiliaries.tar.gz
is a similar file containing also auxiliary files.workflows.csv.gz
contains the metadata for the extracted workflow files.workflows_auxiliaries.csv.gz
is a similar file containing also metadata for auxiliary files.repositories.csv.gz
contains metadata about the GitHub repositories containing the workflow files. These metadata were extracted using the SEART Search tool.
The metadata is separated in different columns:
repository
: The repository (author and repository name) from which the workflow was extracted. The separator "/" allows to distinguish between the author and the repository namecommit_hash
: The commit hash returned by gitauthor_name
: The name of the author that changed this fileauthor_email
: The email of the author that changed this filecommitter_name
: The name of the committercommitter_email
: The email of the committercommitted_date
: The committed date of the commitauthored_date
: The authored date of the commitfile_path
: The path to this file in the repositoryprevious_file_path
: The path to this file before it has been touchedfile_hash
: The name of the related workflow file in the datasetprevious_file_hash
: The name of the related workflow file in the dataset, before it has been touchedgit_change_type
: A single letter (A,D, M or R) representing the type of change made to the workflow (Added, Deleted, Modified or Renamed). This letter is given bygitpython
and provided as is.valid_yaml
: A boolean indicating if the file is a valid YAML file.probably_workflow
: A boolean representing if the file contains the YAML keyon
andjobs
. (Note that it can still be an invalid YAML file).valid_workflow
: A boolean indicating if the file respect the syntax of GitHub Actions workflow. A freely available JSON Schema (used by gigawork) was used in this goal.uid
: Unique identifier for a given file surviving modifications and renames. It is generated on the addition of the file and stays the same until the file is deleted. Renamings does not change the identifier.
Both workflows.csv.gz
and workflows_auxiliaries.csv.gz
are following this format.