Published June 10, 2022 | Version v1.0
Software Open

Artifact for Article (CI-DD-Perses): Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Models

  • 1. University of Houston

Description

This artifact contains the proposed prediction-preserving program reduction framework for CI models and the corresponding reduced data using Perses and DD algorithms that support the findings of our paper 'Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Models' accepted at MAPS'22. DOI: https://doi.org/10.1145/3520312.3534869

 

Given a set of input programs, the proposed approach reduces each input program using DD/Perses while preserving the same prediction of the CI model. The main insight is that, by reducing some input programs of a target label, we can identify key input features of the CI model for that target label. The approach removes irrelevant parts from an input program and keeps the minimal code snippet that the CI model needs to preserve its prediction. DD is syntax-unaware and does not follow the syntax of the programming language during the reduction, therefore, an additional input program validity checking is required. On the other hand, Perses is syntax-guided and follow the syntax of the programming language during the reduction. Also, DD reduces token-by-token (or, char-by-char) from the deltas of the input program, while Perses reduces node-by-node from the tree of the input program. Having knowledge about program syntax for avoiding generating syntactically invalid programs helps Perses to run faster with valid programs than DD. As a result, Perses and DD end up with a different set of features for explaining the model's prediction.

 

The proposed approach is model-agnostic and can be applied to various tasks and programming datasets. For the experimentation of the approach, we study two well-known code intelligence models (Code2Vec and Code2Seq), a popular code intelligence task (MethodName), and one commonly used programming language dataset (Java-Large) with different types of input programs (Frequent, Rare, Large, Small). We first provide a systematic comparison between the syntax-guided program reduction (Perses) and the syntax-unaware program reduction (DD) in terms of token reduction, reduction steps, and reduction time. Then, we provide the summary of extracted input features and their effects on multiple explanations and adversarial attacks.

Notes

The proposed approach can be extended to other models and datasets by following the steps mentioned in the "code/README.md" file.

Files

mdrafiqulrabin/CI-DD-Perses-v1.0.zip

Files (43.9 MB)

Name Size Download all
md5:8ff06ad6d28b1cf2adb884528c706b14
43.9 MB Preview Download

Additional details

Related works

Is derived from
Conference paper: 10.1145/3520312.3534869 (DOI)
Preprint: https://arxiv.org/abs/2205.14374 (URL)
Is supplement to
Project deliverable: https://github.com/mdrafiqulrabin/CI-DD-Perses/tree/v1.0 (URL)