Published May 22, 2026 | Version 2026-05-22
Dataset Open

A dataset of GitHub Actions workflow histories

  • 1. ROR icon University of Mons

Contributors

  • 1. ROR icon University of Mons

Description

This replication package accompagnies the dataset and exploratory empirical analysis reported in the paper "A dataset of GitHub Actions workflow histories" published in the IEEE MSR 2024 conference. (The Jupyter notebook can be found in previous version of this dataset).

Important notice : It looks like Zenodo is compressing gzipped files two times without notice, they are "double compressed". So, when you download them they should be named : x.gz.gz instead of x.gz. Notice that the provided MD5 refers to the original file. 

2026-05-22 update: update repositories list and observation period. We now have 4.1M+ workflows from 52.9K+ repositories. We consider repositories with at least one commit after March 26th, 2025, and they were pulled on March 26th, 2026.

2025-10-09 update: update repositories list and observation period. We now have 3M+ workflows from 49.2K+ repositories. We consider repositories with at least one commit after August 25th, 2024, and they were pulled on August 25th-26th, 2025.

2025-04-15 update: fix missing metadata and minor notation bug. (unchanged observation period)

2024-10-25 update: update repositories list and observation periodWe now have 2.3M+ workflows from 43.3K+ repositories. We consider repositories with at least one commit after January 1st, 2024, and they were pulled on October 7th, 2024.

2024-07-09 update: fix sometimes invalid valid_yaml flag.

2024-04-30: initial version

The dataset was created as follow : 

  1. First, we used GitHub SEART (on March 26th, 2026) to get a list of every non-fork repositories created at least one year before. having at least 300 commits and at least 100 stars where at least one commit was made in the last year. (The goal of these filter is to exclude experimental and personnal repositories).
  2. We checked if a folder .github/workflows existed. We filtered out those that did not contained this folder and pulled the others (on March 26th, 2026).
  3. We applied the tool gigawork (version 1.4.2) to extract every files from this folder. The exact command used is python batch.py -d /ourDataFolder/repositories -e /ourDataFolder/errors -o /ourDataFolder/output -r /ourDataFolder/repositories_everything.csv.gz -- -w /ourDataFolder/workflows_auxiliaries. (The script batch.py can be found on GitHub).
  4. We concatenated every files in /ourDataFolder/output into a csv (using cat headers.csv output/*.csv > workflows_auxiliaries.csv in /ourDataFolder)  and compressed it.
  5. We added the column uid via a script available on GitHub.
  6. Finally, we archived the folder with pigz /ourDataFolder/workflows (tar -c --use-compress-program=pigz -f workflows_auxiliaries.tar.gz /ourDataFolder/workflows)

Using the extracted data, the following files were created :

  1. workflows.tar.gz contains the dataset of GitHub Actions workflow file histories.
  2. workflows_auxiliaries.tar.gz is a similar file containing also auxiliary files.
  3. workflows.csv.gz contains the metadata for the extracted workflow files.
  4. workflows_auxiliaries.csv.gz is a similar file containing also metadata for auxiliary files.
  5. repositories.csv.gz contains metadata about the GitHub repositories containing the workflow files. These metadata were extracted using the SEART Search tool. 

The metadata is separated in different columns:

  1. repository: The repository (author and repository name) from which the workflow was extracted. The separator "/" allows to distinguish between the author and the repository name
  2. commit_hash: The commit hash returned by git
  3. author_name: The name of the author that changed this file
  4. author_email: The email of the author that changed this file
  5. committer_name: The name of the committer
  6. committer_email: The email of the committer
  7. committed_date: The committed date of the commit
  8. authored_date:  The authored date of the commit
  9. file_path:  The path to this file in the repository
  10. previous_file_path: The path to this file before it has been touched
  11. file_hash: The name of the related workflow file in the dataset
  12. previous_file_hash: The name of the related workflow file in the dataset, before it has been touched
  13. git_change_type: A single letter (A,D, M or R) representing the type of change made to the workflow (Added, Deleted, Modified or Renamed). This letter is given by gitpython and provided as is. 
  14. valid_yaml: A boolean indicating if the file is a valid YAML file.
  15. probably_workflow: A boolean representing if the file contains the YAML key on and jobs. (Note that it can still be an invalid YAML file).
  16. valid_workflow: A boolean indicating if the file respect the syntax of GitHub Actions workflow. A freely available JSON Schema (used by gigawork) was used in this goal.
  17. uid: Unique identifier for a given file surviving modifications and renames. It is generated on the addition of the file and stays the same until the file is deleted. Renamings does not change the identifier.

Both workflows.csv.gz and workflows_auxiliaries.csv.gz are following this format.

Files

Files (3.9 GB)

Name Size Download all
md5:a345384e924083ae4da327e4fdf3af74
2.5 MB Download
md5:b557752520849d50a99d3750a8f62b68
311.3 MB Download
md5:6b1f38ba8327734f8d155f25e3011507
1.4 GB Download
md5:7c9785879c0964200cb7a635314c2c0a
316.5 MB Download
md5:f9c89241deee0c43e5b5ecf7ce19e1e9
1.8 GB Download