Published June 7, 2024
| Version v1
Model
Restricted
Less is More: On the Importance of Data Quality for Unit Test Generation
Creators
Description
Introduction
This is the replication package for the paper "Less is More: On the Importance of Data Quality for Unit Test Generation".
Task Definition
Given the focal method, this task aims to explore the effect of noise in test generation datasets. This replication repository includes experimental datasets and scripts to compare the test generation and bug detection performances of different datasets on four large language models.
File Information Description
- dataset.zip: This compressed file contains the dataset, which includes a training set folder (i.e., training_dataset) and a validation set folder (i.e., validation_dataset). The training set folder contains both the full dataset (i.e., all_dataset, including all_train.csv and eval_all.csv) and the filtered dataset (i.e., filter_dataset, including filter_train.csv and filter_valid.csv). The validation set folder includes the test generation dataset (i.e., modified_14projects_tests_d4j.csv) and the defect detection dataset (i.e., trigger_bug_fm_all_projects_d4j_processed_final.csv) extracted from the Defects4J benchmark.
- evaluation.zip: This is a compressed file of performance evaluation scripts, including CodeBLEU, syntactic correctness rate, compilation passing rate, line coverage, branch coverage, and the number of detected bugs.
- models.zip: This is a compressed file of the weight of five LLMs (CodeBERT, CodeT5, CodeGPT, CodeLlama7B, and StarCoder).
- parser.zip: This is a compressed file of the tree-sitter.
- readme.md: This is a readme file, including the detailed information of dataset and script steps to conduct the experiment.
- result.zip: This a compressed file of the all experimental results.
- script.zip: This a compressed file, including the scripts of proposed automated noise-cleaning framework (CleanTest) and the experiment script.