SQuaD: The Software Quality Dataset - Dataset

Robredo, Mikel; Esposito, Matteo; Taibi, Davide; Peñaloza, Rafael; Lenarduzzi, Valentina

doi:10.5281/zenodo.17566691

Published November 8, 2025 | Version v1

Dataset Open

SQuaD: The Software Quality Dataset - Dataset

1. University of Oulu
2. Tampere University
3. University of Milano-Bicocca

This is a re-direction Zenodo repository that presents the "SQuaD: The Software Quality Dataset" submitted to MSR 2026 Data and Tool Showcase Track, and provides the link address to each of the supplementary materials (see below).

Version: 1.0
DOI: https://doi.org/10.5281/zenodo.17566690
Authors: Mikel Robredo, Matteo Esposito, Davide Taibi, Rafael Peñaloza, Valentina Lenarduzzi
Affiliations: University of Oulu, University of Southern Denmark, University of Milano-Bicocca

Access and Usage

The dataset and all supplementary materials are available through Zenodo and IDA* repositories:

CSV Raw Data (IDA): https://doi.org/10.23729/fd-c528d131-2c8c-3e61-91f1-a075931e73dc
MongoDB BSON (IDA): https://doi.org/10.23729/fd-f9dc7d2c-0465-3991-961f-56128ee518d0
Replication Package (Zenodo):https://doi.org/10.5281/zenodo.17541471

On IDA: IDA (ida.fairdata.fi) is a research data storage service organized by the Finnish Ministry of Education and Culture and produced by CSC — IT Center for Science. The service is intended for storing stable research data, both raw data and processed data, which is included to research datasets published in the FAIRdata (FAIR: Findable, Accessible, Interoperable, and Reusable) Etsin service. The service is offered free of charge to users affiliated with Finnish universities and polytechnics and Finnish research institutes.

Each link corresponds to a specific data access format, along with replication scripts and diagrams for database structure.

Main abbreviations:
- Static Analysis Tool (SAT): A software static analysis tool is an automated program that examines a software's source code without executing it to find potential bugs, security vulnerabilities, and deviations from coding standards.
- Issue Tracking System (ITS): A software issue tracking system is a tool used to manage and track software bugs, feature requests, and other problems from initial report to final resolution. It acts as a centralized database, allowing teams to create, assign, and monitor issues, ensuring a structured and organized approach to problem-solving and collaboration.

Overview

The Software Quality Dataset (SQuaD) is a multi-dimensional, time-aware collection of software quality metrics extracted from 450 mature open-source projects across diverse ecosystems, including Apache, Mozilla, FFmpeg, and the Linux kernel.

SQuaD integrates nine state-of-the-art Static Analysis Tools (SATs) and combines both product and process metrics to support large-scale empirical research on software quality, maintainability, evolution, and technical debt.

This dataset was submitted to a major software engineering conference in 2025 and is the result of a seven-month large-scale mining effort.

Dataset Summary

Attribute	Description
Projects analyzed	450 open-source projects
Releases analyzed	63,586 releases/tags
Static Analysis Tools	9 tools (SonarQube, CodeScene, PMD, Understand, CK, JaSoMe, RefactoringMiner, RefactoringMiner++, PyRef)
Unique metrics	725 metrics
Defect tickets	628,178
Commits analyzed	2,622,413
Detected vulnerabilities	1,479 CVEs and 175 CWEs
Average project age	9 years
Average LOC per project	125,500
Average GitHub stars	2,465
Average contributors	104

Data Contents

The dataset includes a variety of entities and metric tables, covering product, process, and vulnerability information.
Each entity corresponds to a CSV table or a MongoDB collection:

Table	Description
PROJECTS	GitHub repository metadata
COMMITS	Commit hash, message, date, author alias
ISSUES	Issue tickets from GitHub, Jira, and Bugzilla
RELEASES	Identifiers of project releases and related commit hashes
GITHUB_METRICS	Stars, contributors, watchers, and project statistics
PRJ_ITS_VLN_LINKAGE	Links between projects, issue trackers, and detected vulnerabilities
CVE / CWE	Official vulnerability and weakness data from NIST and MITRE
PROCESS_METRICS	14 process metrics computed for each release
TOOL tables	Output metrics from each SAT at method, class, file, and project levels

Available Formats

SQuaD is distributed in two complementary formats to facilitate different research and analysis needs:

1. CSV Format

Each entity is provided as a separate CSV file.
Ideal for direct exploration, statistical analysis, and integration into scripts or notebooks.
Mirrors the same relational structure as the MongoDB database.

2. MongoDB Format

A NoSQL version of the dataset is provided as a compressed BSON dump (Zstandard-compressed).
Can be imported into MongoDB for scalable querying and time-aware analyses.
Recommended for researchers dealing with large-scale data analytics or custom pipelines.

NOTE: - The full data weighs approximately 1.9 TB, so ensure sufficient storage and RAM before extraction and import.

Step 1 — Decompress the Archive (Zstandard)

The dataset is distributed as a .tar.zst file. To extract it, install Zstandard and decompress as follows:

# Install Zstandard (if not already installed)
sudo apt install zstd

# Decompress the archive (this may take several hours)
unzstd SQuaD_MongoDB_Dump.tar.zst

# Extract the BSON dump files
tar -xvf SQuaD_MongoDB_Dump.tar

Step 2 — Import into MongoDB

Once decompressed, you can import each collection using mongorestore (bundled with MongoDB tools):
```
# Example: restore entire database
mongorestore --db squad_db /path/to/SQuaD_MongoDB_Dump
```

Methodology Overview

The dataset construction follows four key stages (illustrated in the paper’s Figure 1):

Mining version control data
- Cloned 501 repositories (filtered to 450 active, mature projects).
- Retrieved commits, tags, issues, and metadata from issue tracking systems (ITS) such as GitHub, Jira, and Bugzilla.
Mining software quality metrics
- Applied nine SATs in parallel across all releases.
- Extracted metrics at multiple granularity levels (method, class, file, project).
Extracting vulnerabilities
- Parsed CVE and CWE references from issue tickets.
- Fetched official vulnerability descriptions via NIST and MITRE APIs.
Collecting process metrics
- Computed 14 release-level process metrics (e.g., churn, contributor count, commit density) using GitPython.

Research Opportunities

SQuaD provides a comprehensive foundation for a variety of software engineering research domains:

Software evolution and maintainability analysis
Defect prediction and Just-In-Time learning
Technical debt and code smell benchmarking
Refactoring impact analysis
Software vulnerability detection and risk assessment
Transformer-based and AI-driven quality modeling

Its combination of product and process metrics supports both statistical and machine learning–based investigations.

Acknowledgments

This work was supported by:

CSC – IT Center for Science, Finland (Mahti Supercomputer, Allas Cloud Storage, cPouta services)
FAST Doctoral Research Network, funded by the Finnish Ministry of Education and Culture
SciTools, for providing academic support and licenses for Understand

Files

zenodo-msr-dataset.zip

Files (2.3 MB)

Name	Size	Download all
zenodo-msr-dataset.zip md5:529b025b2c09313f35614eaa1fe3d2b0	2.3 MB	Preview Download

	All versions	This version
Views	822	822
Downloads	97	97
Data volume	254.4 MB	254.4 MB

SQuaD: The Software Quality Dataset - Dataset

Authors/Creators

Description

Access and Usage

Overview

Dataset Summary

Data Contents

Available Formats

1. CSV Format

2. MongoDB Format

Step 1 — Decompress the Archive (Zstandard)

Step 2 — Import into MongoDB

Methodology Overview

Research Opportunities

Acknowledgments

Files

zenodo-msr-dataset.zip

Files (2.3 MB)