Artifact for Article (CI-DD-Perses): Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Models

Md Rafiqul Islam Rabin; Aftab Hussain; Mohammad Amin Alipour

doi:10.5281/zenodo.6630188

Published June 10, 2022 | Version v1.0

Software Open

Artifact for Article (CI-DD-Perses): Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Models

1. University of Houston

This artifact contains the proposed prediction-preserving program reduction framework for CI models and the corresponding reduced data using Perses and DD algorithms that support the findings of our paper 'Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Models' accepted at MAPS'22. DOI: https://doi.org/10.1145/3520312.3534869

Given a set of input programs, the proposed approach reduces each input program using DD/Perses while preserving the same prediction of the CI model. The main insight is that, by reducing some input programs of a target label, we can identify key input features of the CI model for that target label. The approach removes irrelevant parts from an input program and keeps the minimal code snippet that the CI model needs to preserve its prediction. DD is syntax-unaware and does not follow the syntax of the programming language during the reduction, therefore, an additional input program validity checking is required. On the other hand, Perses is syntax-guided and follow the syntax of the programming language during the reduction. Also, DD reduces token-by-token (or, char-by-char) from the deltas of the input program, while Perses reduces node-by-node from the tree of the input program. Having knowledge about program syntax for avoiding generating syntactically invalid programs helps Perses to run faster with valid programs than DD. As a result, Perses and DD end up with a different set of features for explaining the model's prediction.

The proposed approach is model-agnostic and can be applied to various tasks and programming datasets. For the experimentation of the approach, we study two well-known code intelligence models (Code2Vec and Code2Seq), a popular code intelligence task (MethodName), and one commonly used programming language dataset (Java-Large) with different types of input programs (Frequent, Rare, Large, Small). We first provide a systematic comparison between the syntax-guided program reduction (Perses) and the syntax-unaware program reduction (DD) in terms of token reduction, reduction steps, and reduction time. Then, we provide the summary of extracted input features and their effects on multiple explanations and adversarial attacks.

Notes

The proposed approach can be extended to other models and datasets by following the steps mentioned in the "code/README.md" file.

Files

mdrafiqulrabin/CI-DD-Perses-v1.0.zip

Files (43.9 MB)

Name	Size	Download all
mdrafiqulrabin/CI-DD-Perses-v1.0.zip md5:8ff06ad6d28b1cf2adb884528c706b14	43.9 MB	Preview Download

Additional details

Is derived from: Conference paper: 10.1145/3520312.3534869 (DOI); Preprint: https://arxiv.org/abs/2205.14374 (URL)
Is supplement to: Project deliverable: https://github.com/mdrafiqulrabin/CI-DD-Perses/tree/v1.0 (URL)

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	85	84
Downloads	3	3
Data volume	131.8 MB	131.8 MB

Artifact for Article (CI-DD-Perses): Syntax-Guided Program Reduction for Understanding Neural Code Intelligence Models

Creators

Description

Notes

Files

mdrafiqulrabin/CI-DD-Perses-v1.0.zip

Files (43.9 MB)

Additional details

Related works