Published March 20, 2023 | Version v1
Dataset Open

PENTACET data - 23 Million Contextual Code Comments and 500,000 SATD comments

  • 1. University of Oulu

Contributors

Project manager:

  • 1. University of Oulu

Description

PENTACET is a large Curated Contextual Code Comments per Contributor and the most extensive SATD data. We mine 9,096 Open Source Software Java projects with a total of 435 million LOC. The outcome is dataset with 23 million code comments, preceding and succeeding source code context for each comment, and more than 500,000 comments labeled as SATD, including both ‘Easy to Find’ and ‘Hard to Find’ SATD.

Files

soccminer_mined_data_jsons.zip

Files (32.1 GB)

Name Size Download all
md5:533f8e4e70321d2ae12d95600045acce
12.2 GB Download
md5:d3722af4403ef99132825589dca198f3
12.2 GB Download
md5:61fe5d205df078ba580a18cb29471bbf
384.3 MB Download
md5:2552c2532034b7821e9713deb9565415
7.4 GB Preview Download

Additional details

Funding

Research Council of Finland
Detecting Technical Debt with Natural Language Processing 328058