Importance-driven deep learning system testing

Deep Learning (DL) systems are key enablers for engineering intelligent applications. Nevertheless, using DL systems in safety- and security-critical applications requires to provide testing evidence for their dependable operation. We introduce DeepImportance, a systematic testing methodology accompanied by an Importance-Driven (IDC) test adequacy criterion for DL systems. Applying IDC enables to establish a layer-wise functional understanding of the importance of DL system components and use this information to assess the semantic diversity of a test set. Our empirical evaluation on several DL systems and across multiple DL datasets demonstrates the usefulness and effectiveness of DeepImportance.


INTRODUCTION
Recently, Deep Learning (DL) systems have achieved unprecedented progress, commensurate with the cognitive abilities of humans. Despite the manifold potential applications, using DL systems in safety-and security-critical applications requires the provision of assurance evidence for their trustworthy behaviour. From a safety assurance perspective, testing has been among the primary instruments for evaluating quality properties of software systems. Inspired by traditional software engineering testing paradigms, recent research proposes novel testing techniques and coverage criteria [2][3][4][5][6]. * Work done while at Bogazici University.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). ICSE '20 Companion, October 5-11, 2020 In this paper, we introduce DeepImportance, a systematic testing methodology accompanied by an Importance-Driven test adequacy criterion for DL systems based on relevance propagation. By analysing the activity of a DL system and its internal neuron behaviours, DeepImportance develops a layer-wise functional understanding that signifies the contribution of internal neurons to the output through the layers. This contribution enables to determine the causal relationship between the neurons and the DL system behaviour as more influential neurons have a stronger causal relationship and can explain which high-level features influence more the decision-making. DeepImportance establishes this relationship by computing a decomposition of the decision made by the DL system and iteratively redistributing the relevance in a layer-wise manner proportional to how prominent each neuron and its connections are. The Importance-Driven adequacy criterion instrumented by DeepImportance measures the adequacy of an input set as the ratio of combinations of important neurons clusters covered by the set.

DEEPIMPORTANCE
DeepImportance enables the systematic testing and evaluation of DL systems. Using a pre-trained DL system, DeepImportance analyses the training set T to establish a fundamental understanding of the overall contribution made by internal neurons of the DL system. This enables to identify the most important neurons that are core contributors to the decision-making process. Then, DeepImportance carries out a quantisation step which produces an automatically-determined finite set of clusters of neuron activation values that characterises, to a sufficient level, how the behaviour of the most important neurons changes with respect to inputs from the training set. Finally, DeepImportance uses the produced clusters of the most important neurons to assess the coverage adequacy of the test setInformally, the Importance-Driven test adequacy criterion of DeepImportance is satisfied when all combinations of important neurons clusters are exercised. Neuron Importance Analysis: The purpose of importance analysis is to identify neurons within a DL system that are key contributors to decision-making. The activity of some neurons influences more the capabilities of the system to make correct decisions. We compute a decomposition of the decision ( ) made by the system for input ∈ and use layer-wise relevance propagation [1] to traverse the network graph and redistribute the decision value in a layer-wise manner proportional to the contribution made by each neuron within the layer. Unlike state-of-the-art testing adequacy criteria for DL systems, DeepImportance captures the actual contribution made by each neuron to the decision. Important Neurons Clustering: Having established the important neurons, we are now ready to determine regions within their value domain which are central to the DL system execution. Since each neuron is responsible for perceiving specific features within the input domain, we argue that for inputs with similar features the activation values of those important neurons are concentrated into specific regions within their value domain. Informally, those regions form a pattern that captures the activity of the most influential neurons of the DL system. We reinforce cluster extraction with the Silhouette index, thus supporting the automatic identification of a neuron-specific optimal strategy for clustering the activation values of each important neuron. Importance-Driven Coverage: Given an input set , we can measure the degree to which it covers the clusters of important neurons, termed Importance-Driven Coverage (IDC). Since important neurons are core contributors in decision-making, it is significant to establish that inputs triggering combinations of activation value clusters of those neurons have been covered adequately. Doing this, enables to test the most influential neurons, thus increasing our confidence in the correct operation of the DL system and reducing the risk for wrong decisions. The vector of important neurons cluster combinations (INCC) is given by where the function Centroid(Ψ ) measures the "centre of mass" of the -th cluster for the -th important neuron. We define Importance-Driven Coverage to be the ratio of INCC covered by all ∈ over the size of the INCC set. Compared to all other elements in INCC, the -th INNC element is covered if there exists an input for which the Euclidean distance between the activation values of all important neurons ∈ and the corresponding neuron's clusters centroids in is minimised. Formally (2) Achieving a high IDC score entails a systematically diverse input set that exercises many combinations of important neurons clusters.

EVALUATION
We evaluate DeepImportance on three popular publicly-available datasets. MNIST, CIFAR-10 and the Udacity self-driving car challenge dataset. For MNIST, we study three DL systems namely LeNet-1, LeNet-4 and LeNet-5. For CIFAR-10, we employ a 20 layer convolutional neural network. For the Udacity self-driving car challenge, we used the pre-trained Dave-2 self-driving car DL system from Nvidia. We perform a thorough unbiased evaluation of DeepImportance by comparing it against state-of-the-art coverage criteria for DL systems. For each criterion, we use the hyper-parameters from its original research. Our results show that • DeepImportance can detect the most important neurons of a DL system and those neurons are more sensitive to changes in relevant pixels of a given input (Importance).
• Coverage results for all DL systems are smaller for the semantically diverse set compared to the numerically diverse set for state-of-the-art coverage criteria. IDC is more sensitive to perturbations to relevant input features. DeepImportance with its IDC coverage criterion can support software engineers to create a diverse test set that comprises semantically different test inputs (Diversity).
• Unmodified test set and sets enhanced with perturbed inputs using white noise and adversarial inputs carefully crafted using state-of-the-art adversarial generation techniques yield IDC results proving effectiveness of IDC. Meaning that, IDC is sensitive to adversarial inputs and is effective in detecting misbehaviours in test sets with inputs semantically different than those encountered before (Effectiveness).
• IDC shows a similar behaviour to state-of-the-art coverage criteria for DL systems (e.g., neuron coverage [5]). Hence, there is a positive correlation between them (Correlation).
• The chosen target layer affects the result of IDC. Since the penultimate layer is responsible to understand semantically important high-level features, we argue it is a suitable choice to assess the adequacy of a test set using IDC (Layer Sensitivity).

CONCLUSION
Ensuring the trustworthiness of DL systems requires their thorough and systematic testing. DeepImportance is a systematic testing methodology reinforced by an Importance-Driven (IDC) test adequacy criterion for DL systems. DeepImportance analyses the internal neuron behaviour to create a layer-wise functional understanding and automatically establish a finite set of clusters that represent the behaviour of the most important neurons to an adequate level of granularity. The Importance-Driven adequacy criterion measures the adequacy of a test set as the ratio of combinations of important neurons clusters covered by the set. Our experimental evaluation shows that IDC achieves higher results for test sets with semantically-diverse inputs. IDC is also sensitive to adversarial inputs and, thus, effective in detecting misbehaviour in test sets.