Published October 25, 2024 | Version 2024-10-25
Dataset Open

A dataset of GitHub Actions workflow histories

  • 1. ROR icon University of Mons

Contributors

  • 1. ROR icon University of Mons

Description

This replication package accompagnies the dataset and exploratory empirical analysis reported in the paper "A dataset of GitHub Actions workflow histories" published in the IEEE MSR 2024 conference. (The Jupyter notebook can be found in previous version of this dataset).

Important notice : It looks like Zenodo is compressing gzipped files two times without notice, they are "double compressed". So, when you download them they should be named : x.gz.gz instead of x.gz. Notice that the provided MD5 refers to the original file. 

2024-10-25 update : updated repositories list and observation period. The filters relying on date were also updated.

2024-07-09 update : fix sometimes invalid valid_yaml flag.

The dataset was created as follow : 

  1. First, we used GitHub SEART (on October 7th, 2024) to get a list of every non-fork repositories created before January 1st, 2024. having at least 300 commits and at least 100 stars where at least one commit was made after January 1st, 2024. (The goal of these filter is to exclude experimental and personnal repositories).
  2. We checked if a folder .github/workflows existed. We filtered out those that did not contained this folder and pulled the others (between 9th and 10th
    of October 2024).
  3. We applied the tool gigawork (version 1.4.2) to extract every files from this folder. The exact command used is python batch.py -d /ourDataFolder/repositories -e /ourDataFolder/errors -o /ourDataFolder/output -r /ourDataFolder/repositories_everything.csv.gz -- -w /ourDataFolder/workflows_auxiliaries. (The script batch.py can be found on GitHub).
  4. We concatenated every files in /ourDataFolder/output into a csv (using cat headers.csv output/*.csv > workflows_auxiliaries.csv in /ourDataFolder)  and compressed it.
  5. We added the column uid via a script available on GitHub.
  6. Finally, we archived the folder with pigz /ourDataFolder/workflows (tar -c --use-compress-program=pigz -f workflows_auxiliaries.tar.gz /ourDataFolder/workflows)

Using the extracted data, the following files were created :

  1. workflows.tar.gz contains the dataset of GitHub Actions workflow file histories.
  2. workflows_auxiliaries.tar.gz is a similar file containing also auxiliary files.
  3. workflows.csv.gz contains the metadata for the extracted workflow files.
  4. workflows_auxiliaries.csv.gz is a similar file containing also metadata for auxiliary files.
  5. repositories.csv.gz contains metadata about the GitHub repositories containing the workflow files. These metadata were extracted using the SEART Search tool. 

The metadata is separated in different columns:

  1. repository: The repository (author and repository name) from which the workflow was extracted. The separator "/" allows to distinguish between the author and the repository name
  2. commit_hash: The commit hash returned by git
  3. author_name: The name of the author that changed this file
  4. author_email: The email of the author that changed this file
  5. committer_name: The name of the committer
  6. committer_email: The email of the committer
  7. committed_date: The committed date of the commit
  8. authored_date:  The authored date of the commit
  9. file_path:  The path to this file in the repository
  10. previous_file_path: The path to this file before it has been touched
  11. file_hash: The name of the related workflow file in the dataset
  12. previous_file_hash: The name of the related workflow file in the dataset, before it has been touched
  13. git_change_type: A single letter (A,D, M or R) representing the type of change made to the workflow (Added, Deleted, Modified or Renamed). This letter is given by gitpython and provided as is. 
  14. valid_yaml: A boolean indicating if the file is a valid YAML file.
  15. probably_workflow: A boolean representing if the file contains the YAML key on and jobs. (Note that it can still be an invalid YAML file).
  16. valid_workflow: A boolean indicating if the file respect the syntax of GitHub Actions workflow. A freely available JSON Schema (used by gigawork) was used in this goal.
  17. uid: Unique identifier for a given file surviving modifications and renames. It is generated on the addition of the file and stays the same until the file is deleted. Renamings does not change the identifier.

Both workflows.csv.gz and workflows_auxiliaries.csv.gz are following this format.

Files

Files (3.5 GB)

Name Size Download all
md5:f3494a699893b66e0ca2e65cb6f74061
2.0 MB Download
md5:15b51003f6d2929b1c188d2c72ca72c3
191.3 MB Download
md5:5079c6fb5392d08d8b8d0cb728ddf7f8
2.2 GB Download
md5:db834bf6aae420f495f29365db968595
196.5 MB Download
md5:e2b9ce703918e32f07c8eec3a917fd11
963.9 MB Download