Less is More: On the Importance of Data Quality for Unit Test Generation

Anonymous Author

doi:10.5281/zenodo.11519949

Published June 7, 2024 | Version v1

Model Restricted

Less is More: On the Importance of Data Quality for Unit Test Generation

Anonymous Author

Introduction

This is the replication package for the paper "Less is More: On the Importance of Data Quality for Unit Test Generation".

Task Definition

Given the focal method, this task aims to explore the effect of noise in test generation datasets. This replication repository includes experimental datasets and scripts to compare the test generation and bug detection performances of different datasets on four large language models.

File Information Description

dataset.zip: This compressed file contains the dataset, which includes a training set folder (i.e., training_dataset) and a validation set folder (i.e., validation_dataset). The training set folder contains both the full dataset (i.e., all_dataset, including all_train.csv and eval_all.csv) and the filtered dataset (i.e., filter_dataset, including filter_train.csv and filter_valid.csv). The validation set folder includes the test generation dataset (i.e., modified_14projects_tests_d4j.csv) and the defect detection dataset (i.e., trigger_bug_fm_all_projects_d4j_processed_final.csv) extracted from the Defects4J benchmark.
evaluation.zip: This is a compressed file of performance evaluation scripts, including CodeBLEU, syntactic correctness rate, compilation passing rate, line coverage, branch coverage, and the number of detected bugs.
models.zip: This is a compressed file of the weight of five LLMs (CodeBERT, CodeT5, CodeGPT, CodeLlama7B, and StarCoder).
parser.zip: This is a compressed file of the tree-sitter.
readme.md: This is a readme file, including the detailed information of dataset and script steps to conduct the experiment.
result.zip: This a compressed file of the all experimental results.
script.zip: This a compressed file, including the scripts of proposed automated noise-cleaning framework (CleanTest) and the experiment script.

Files

Restricted

The record is publicly accessible, but files are restricted to users with access.

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	36	36
Downloads	91	91
Data volume	4.9 GB	4.9 GB

Less is More: On the Importance of Data Quality for Unit Test Generation

Creators

Description

Introduction

Task Definition

File Information Description

Files

Restricted