TerraDS: A Dataset for Terraform HCL Programs
Description
TerraDS
The TerraDS dataset provides a comprehensive collection of Terraform programs written in the HashiCorp Configuration Language (HCL). As Infrastructure as Code (IaC) gains popularity for managing cloud infrastructure, Terraform has become one of the leading tools due to its declarative nature and widespread adoption. However, a lack of publicly available, large-scale datasets has hindered systematic research on Terraform practices. TerraDS addresses this gap by compiling metadata and source code from 62,406 open-source repositories with valid licenses. This dataset aims to foster research on best practices, vulnerabilities, and improvements in IaC methodologies.
Structure of the Database
The TerraDS dataset is organized into two main components: a SQLite database containing metadata and an archive of source code (~335 MB). The metadata, captured in a structured format, includes information about repositories, modules, and resources:
1. Repository Data:
- Contains 62,406 repositories with fields such as repository name, creation date, star count, and permissive license details.
- Provides cloneable URLs for access and analysis.
- Tracks additional metrics like repository size and the latest commit details.
2. Module Data:
- Consists of 279,344 modules identified within the repositories.
- Each module includes its relative path, referenced providers, and external module calls stored as JSON objects.
3. Resource Data:
- Encompasses 1,773,991 resources, split into managed (1,484,185) and data (289,806) resources.
- Each resource entry details its type, provider, and whether it is managed or read-only.
Structure of the Archive
The provided archive contains the source code of the 62,406 repositories to allow further analysis based on the actual source instead of the metadata only. As such, researcher can access the permissive repositories and conduct studies on the executable HCL code.
Tools
The "HCL Dataset Tools" file contains a snapshot of the https://github.com/prg-grp/hcl-dataset-tools repository - for long term archival reasons. The tools in this repository can be used to reproduce this dataset.
One of the tools - "RepositorySearcher" - can be used to fetch metadata for various other GitHub API queries, not only Terraform code. While the RepositorySearcher allows usage for other types of repository search, the other tools provided are focused on Terraform repositories.
Files
HCL Dataset Tools.zip
Files
(785.0 MB)
Name | Size | Download all |
---|---|---|
md5:73bb51903d76149c6fbeafc3cf32d703
|
66.2 kB | Preview Download |
md5:92e2fffd8208b2a290204555ec500977
|
433.7 MB | Download |
md5:664716c1dd690b14add572cd5a41f192
|
351.2 MB | Download |