Published November 25, 2024 | Version v1
Dataset Open

TerraDS: A Dataset for Terraform HCL Programs

  • 1. ROR icon University of St. Gallen
  • 2. armasuisse

Description

TerraDS

The TerraDS dataset provides a comprehensive collection of Terraform programs written in the HashiCorp Configuration Language (HCL). As Infrastructure as Code (IaC) gains popularity for managing cloud infrastructure, Terraform has become one of the leading tools due to its declarative nature and widespread adoption. However, a lack of publicly available, large-scale datasets has hindered systematic research on Terraform practices. TerraDS addresses this gap by compiling metadata and source code from 62,406 open-source repositories with valid licenses. This dataset aims to foster research on best practices, vulnerabilities, and improvements in IaC methodologies.

Structure of the Database

The TerraDS dataset is organized into two main components: a SQLite database containing metadata and an archive of source code (~335 MB). The metadata, captured in a structured format, includes information about repositories, modules, and resources:

1. Repository Data:

  • Contains 62,406 repositories with fields such as repository name, creation date, star count, and permissive license details.
  • Provides cloneable URLs for access and analysis.
  • Tracks additional metrics like repository size and the latest commit details.

2. Module Data:

  • Consists of 279,344 modules identified within the repositories.
  • Each module includes its relative path, referenced providers, and external module calls stored as JSON objects.

3. Resource Data:

  • Encompasses 1,773,991 resources, split into managed (1,484,185) and data (289,806) resources.
  • Each resource entry details its type, provider, and whether it is managed or read-only.

Structure of the Archive

The provided archive contains the source code of the 62,406 repositories to allow further analysis based on the actual source instead of the metadata only. As such, researcher can access the permissive repositories and conduct studies on the executable HCL code.

Tools

The "HCL Dataset Tools" file contains a snapshot of the https://github.com/prg-grp/hcl-dataset-tools repository - for long term archival reasons. The tools in this repository can be used to reproduce this dataset.

One of the tools - "RepositorySearcher" - can be used to fetch metadata for various other GitHub API queries, not only Terraform code. While the RepositorySearcher allows usage for other types of repository search, the other tools provided are focused on Terraform repositories.

 

Files

HCL Dataset Tools.zip

Files (785.0 MB)

Name Size Download all
md5:73bb51903d76149c6fbeafc3cf32d703
66.2 kB Preview Download
md5:92e2fffd8208b2a290204555ec500977
433.7 MB Download
md5:664716c1dd690b14add572cd5a41f192
351.2 MB Download