ToGA Artifact

This repository contains the replication artifact for TOGA: A Neural Method for Test Oracle Generation to appear in ICSE 2022.

Testing is widely recognized as an important stage of the softwaredevelopment lifecycle. Effective software testing can provide benefits such as documentation, bug finding, and preventing regressions. In particular, unit tests document a unit’s intended functionality. A test oracle, typically expressed as an condition, documents the intended behavior of the unit under a given test prefix. Synthesizing a functional test oracle is a challenging problem, as it has to capture the intended functionality and not the implemented functionality. In our paper, we propose TOGA (Test Oracle GenerAtion), a unified transformer-based neural approach to infer both exceptional and assertion test oracles based on the context of the focal method.

Our artifact reproduces the results for all RQs in the paper's evaluation. The artifact includes source code and download links for datasets and models produced in the paper, fulfilling the requirements for reproduced, resuable, and available badges. We assume basic unix familiarity and ability to run python. Our artifact is given as a docker image for linux.

Note: For convenience, we provide a self-contained docker image to reproduce all results without any setup. We recommend using this to reproduce the results in the paper. See directions for using the docker image under setup.

Organization

The core oracle generation tool is implemented in toga.py in the base directory. Other modules and scripts are organized in the following subdirectories:

  • model/: modules and scripts for modeling. Used directly by toga.py for generating model inputs and running models to infer oracles.
    • model/exception_data.py: Utility methods for generating exception model inputs. The API call exception_data.get_labeled_tests(tests, methods) takes two lists of tests and their associated focal methods as input and returns four lists:test_prefix, methods, labels, idxs, which are respectively: tests stripped oracles, aligned focal methods, expected exception labels (0 or 1), and the index of labeled test triple into the original list (some tests may be dropped due to parsing errors).
    • model/assertion_data.py: Utility methods for generating assertion model inputs. The API call assertion_data.get_model_inputs(tests, methods, vocab) takes as input two aligned lists of tests and their associated focal methods, along with a vocab dictionary of common values for types that are used in generating candidate assertions. For example: {int: [1, 10,...], float: [0.0, 1.1,...],...}. This returns two lists with the format [(focal_method, test_prefix, candidate_assert, label),...], idx, where focal_method, normalized_test, and candidate_assert represent focal method, test prefix pairs associated with a possible assertion oracle. The final label value is unused during inference, but is kept for consistency with the training data format. The second list idxs represents the indexes of the prepared inputs, since some tests may be dropped due to parsing errors.
    • model/exceptions/: Scripts and code for running the exception model. Use scripts run_train.sh and run_eval.sh to train and run the model.
    • model/assertions/: Scripts and code for running the assertion model. Use scripts run_train.sh and run_eval.sh to train and run the model.
  • scripts/: scripts for generating datasets. See directions under Datasets section for usage to generate each dataset used in our evaluation.
  • eval/: scripts for reproducing the evaluation from the TOGA paper. Note that rq3 uses the toga.py tool directly. See directions under Evaluation for usage to replicate each result in the paper.

toga.py usage: The toga tool takes (focal method, test) pairs as input and generates oracles using the two step inference procedure described in the paper. toga.py directly uses the modules and scripts under model.

Datasets

Our approach is trained and evaluated on three datasets. The first, Atlas* is an adaption of the Atlas dataset. The preprocessed dataset Atlas* is included in data/atlas_star/

The second dataset, Methods2Test* is an adaption of the Methods2Test dataset. The preprocessed dataset methods2test* is included in data/methods2test_star

The third dataset is our test dataset generated from Evosuite tests. This is located in data/evosuite_tests/. Since the dataset size is very large (>400k tests), we provide two smaller sample datasets that can be used to reproduce the bug counts result and false positive rate result respectively from Table 3 for Our Approach in data/evosuite_reaching_tests/ and data/evosuite_5project_tests/. reaching_tests contains bug-reaching tests only, while 5project_tests contains the tests generated for the same 5 defects4j projects used in [1]'s evaluation.

Setup

First, pull the docker image:
docker pull edinella/toga-artifact

Connect to it: docker run -i -t edinella/toga-artifact

Then, setup some environment variables:

export PATH=$PATH:/home/defects4j/framework/bin    
export ATLAS_PATH=/home/icse2022_artifact/data/atlas---deep-learning-assert-statements/

Models

The pretrained models are available in:

icse2022_artifact/model/assertions/pretrained/    
icse2022_artifact/model/exceptions/pretrained/

Optionally, e provided scripts for training from scratch in model/exceptions/run_train.sh and model/assertions/run_train.sh. Training was performed on a vm with one P100 GPU.

Evaluation

In our paper, we evaluate three research questions (RQs).

  1. RQ1: Is our grammar representative of most developer-written assertions? To evaluate this research question:
    cd eval/rq1 && python rq1.py

  2. RQ2: Can we infer assertions and exceptional behavior with high accuracy?

Exception Inference:

To reproduce the exception results shown in table 1 for TOGA Model, run:

cd eval/rq2/exception_inference
bash rq2.sh

This script uses the pretrained exception model to predict whether a test is expected to trigger an exception or not, evaluated for accuracy and f1 score on the methods2test_star dataset. Note that the model used in the artifact has been retrained, so the results are slightly different from the submission (accuracy=85\% instead of 86\%, f1 score is 0.40 instead of 0.39).

Assertion Inference: To reproduce the exception results shown in table 2 for TOGA Model, run:

cd eval/rq2/assertion_inference
bash rq2.sh

This script uses the pretrained assertion model to predict an assertion given a test prefix and method under test's signature. we evaluate for accuracy and f1 score on the atlas_star dataset.

  1. RQ3: Can we catch bugs with low false alarms?

To reproduce the results shown in Table 3 for Our Approach, run toga.py on either the bug-reaching inputs (for bug results) or 5 project sample (for false positive rate). By default, the toga tool will use the test metadata labels to evaluate oracles predicted by the models and print results. The tool also generates a predicted_oracles.csv file that can be used to generate executable test suites.

Note that we have improved our implementation since the submission and now find 4 additional bugs (58 total instead of 54) with a lower FP rate (22\% instead of 25\%).

To faciliate faster evaluation, the toga.py will automatically check its predicted oracles against labels included in its metadata input. This can save time since generating and executing test suites is potentially very time consuming. Note that toga will overestimate the FP rate when checking against labels, so the False Positive rate on generated tests will be lower.

To validate the results, use the eval/rq3/rq3.sh to generate and run test suites from the toga generated oracles.

To reproduce table 3 bug result by running only bug-reaching tests (this will not reproduce the FP rate, which requires running on all of the tests):

python toga.py data/evosuite_reaching_tests/inputs.csv data/evosuite_reaching_tests/meta.csv

To reproduce table 3 false positive rate result on a 5 project sample (2+ hour runtime):

python toga.py data/evosuite_5project_tests/inputs.csv data/evosuite_5project_tests/meta.csv

Both bug and FP rate results on the entire dataset (potentially 12+ hour runtime):

python toga.py data/evosuite_5project_tests/inputs.csv data/evosuite_5project_tests/meta.csv

References

  1. Tufano, Michele, et al. "Unit Test Case Generation with Transformers." arXiv preprint arXiv:2009.05617 (2020).