Distribution Models for Falsification and Verification of DNNs

DNN validation and verification approaches that are input distribution agnostic waste effort on irrelevant inputs and report false property violations. Drawing on the large body of work on model-based validation and verification of traditional systems, we introduce the first approach that leverages environmental models to focus DNN falsification and verification on the relevant input space. Our approach, DFV, automatically builds an input distribution model using unsupervised learning, prefixes that model to the DNN to force all inputs to come from the learned distribution, and reformulates the property to the input space of the distribution model. This transformed verification problem allows existing DNN falsification and verification tools to target the input distribution – avoiding consideration of infeasible inputs. Our study of DFV with 7 falsification and verification tools, two DNNs defined over different data sets, and 93 distinct distribution models, provides clear evidence that the counterexamples found by the tools are much more representative of the data distribution, and it shows how the performance of DFV varies across domains, models, and tools.


I. INTRODUCTION
A deep neural network (DNN) is trained to accurately approximate a partial target function, f : R n → R m . The domain of definition of f -referred to as the data distribution D -is typically an infinitesimal portion the full domain, |D| |R n | ≈ 0. However, much of the recent literature on validation and verification of DNNs ignores the partiality of a DNN's definition with significant negative consequences. First, existing test generation techniques [1], [2], [3], [4], [5] have been shown to produce a majority of inputs that lie off of the data distribution [6], [7]. Second, white-box DNN coverage criteria [8], [2] do not take the distribution into account and this can drive coverage-directed test generators off the distribution [6] and give misleading reports of the coverage achieved [7]. Third, faults that are detected for off distribution inputs constitute false reports [6], [7] which can lead to wasted effort in fault triage, localization, and fixing.
Whereas recent research has begun to explore how to leverage models of the data distribution for testing [9], [7], in this paper we present the first approach to use such models to support techniques for DNN verification and falsification -a form of property-driven validation [10]. Our distribution-based falsification and verification (DFV) approach for DNNs draws inspiration from the large body of research exploiting environ-mental models of the feasible input domain for software systems to focus Verification and Validation (V&V). These models are typically built from the system requirements and can be expressed in a variety of forms, e.g., simulations [11], statemachines [12], or logical specifications [13]. Such environment models [14] have become an essential component of validation and verification approaches for software systems [15], [16], [17], [18], [19] and this has led them to be adopted in several domains [20].
To be amenable for V&V, environment models must satisfy three requirements. First, they must be accurate in defining the set of feasible inputs. For example, for an underapproximating analysis, e.g., [18], an underapproximating model is required to guarantee feasible counter-examples; dually an overapproximating analysis requires an overapproximating environment model. Second, they must be generative, providing the ability to be executed, interpreted, or solved, so they can be leveraged to generate feasible inputs. For example, generating feasible counter-examples when verifiers or falsifiers detect property violations [21]. Third, for verification they must be amenable to constraint-based encoding in a form that can be leveraged by the verification algorithm. For example, for a SMT-based verification method, e.g., [18], an environment model must be convertible to logical formulae in a supported theory. For abstract interpretation, e.g., [22], an environment model must be convertible to supported abstract domains.
In this paper, we adapt the concept of an environment model to support existing DNN verification and falsification techniques [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [10]. To do this, we have to address the challenge that it is intractable, in general, to specify an accurate model of the feasible inputs for a complex DNN -like those that process images captured by a forward facing camera (see Figure 3b [39]). A key insight of this work is that we can leverage the rich body of research that the machine learning (ML) community has developed for learning generative models of the data distribution, which we use as environment models.
DFV transforms a DNN and a correctness property into a falsification or verification problem focused on the data distribution in three steps. First, a generative model of the data distribution for a DNN is trained [40], [41]. Unlike manually developed environment models for traditional software V&V, these environment models are constructed automatically through an unsupervised training process. The design and training of the model of the data distribution can leverage ML best-practices to produce a suitably accurate model [42], [43]. Second, the approach modifies the original DNN to use the appropriate component of the trained generative model, e.g., the decoder of a variational autoencoder (VAE), as a set of prefix layers to the DNN under analysis. This forces all inputs to the DNN to come from the learned data distribution. Third, the approach reformulates the correctness property over the input space of the generative model. DFV supports the reporting of feasible on-distribution counter-examples, when property violations are detected, and reporting that specified subsets of the data distribution are free of violations, when verifiers are able to discharge such proofs.
We evaluate DFV on DNNs trained to recognize images of clothing [44] and trained to control a drone from image data [39], for a range of challenging correctness properties. We find clear evidence that DFV enables existing falsification and verification techniques to produce counter-examples that are much more representative of the data distribution than are computed otherwise -both visually and in terms of standard measures of similarity. While scaling of verification techniques is challenging, we also find evidence that distribution models can enable them to prove properties over the data distribution. Building on these promising findings, we study how varying the architecture of the model and how shifting between different families of generative models impacts the effectiveness of the technique. Our results can be used to guide the development of models to support DFV.
The primary contributions of this work are the: (1) formulation of the first model-based verification and falsification method for DNNs; (2) demonstration that distribution models yield substantially better counter-examples from verification and falsification; and (3) exploration of different models of the data distribution and their trade-offs.

II. BACKGROUND
In this section we provide background on deep neural networks, DNN verification and falsification, and DNN testing approaches that exploit the data distribution.

A. Deep Neural Networks
A deep neural network, N , is a type of machine learning model that is trained to approximate a partial target function f : R n → R m . For example, f may classify some ndimensional input (e.g., an image) as one of m possible classes (e.g., a digit in the range 0 to 9). f is partial in the sense that it is trained to generalize to a target data distribution, D ⊆ R n . For inputs off of the distribution, x ∼ D, its behavior, N (x), should be considered as undefined.
DNNs are comprised of layers, l 0 , ..., l k , each of which performs some computation on their input (e.g., matrix multiplication or convolution). A typical linear architecture defines a DNN as the composition of layers, N = l k • · · · • l 1 • l 0 . Layers are comprised of neurons. The input of a neuron is (a) VAE with encoder E trained to learn the parameters of the latent distribution and decoder D trained to learn the likelihood of an input given values in the latent space.
(b) GAN with trainable input generator G and discriminator D that predicts the probability that an input is from the true data distribution. Fig. 1: Generative latent distribution models that produce unseen samples from the data distribution.
defined as the weighted sum of the outputs of a set of neurons in a preceding layer where the connections between neurons have trainable parameters. The output of a neuron applies a non-linear activation function to the input.
Training involves initializing the parameters and then applying N to samples, (x, y), from the training set, T , and repeatedly updating parameters based on N (x)−y . While the goal of training is to learn the partial function f defined over D, D is generally unmanageably large (e.g., the set of road images visible on a forward facing camera). Consequently, the training set is defined as a representative sample of the data distribution, T ∼ D. A well-trained network is said to generalize to the data distribution [45].

B. Models of the Data Distribution
The field of machine learning has long understood the importance of modeling the data distribution. Broadly speaking, the field has developed two types of approaches. Out-ofdistribution detectors [46] are designed to determine whether a data point lies on the data distribution, but are generally unable to generate new data from the distribution. In contrast, generative models are designed to generate unseen samples from the data distribution. There are three broad classes of generative models: variational autoencoders (VAE) [40], generative adversarial networks (GAN) [41], and autoregressive models such as PixelCNN++ [47]. Among these, VAEs and GANs can be classified as latent variable models since they make explicit the mathematical structure of the learned latent space which models D. We leverage generative latent variable models of the data distribution in this work. Fig. 1a depicts a VAE as comprised of a pair of trainable models: an encoder, E, and a decoder, D [40]. The encoder, or inference network, is trained to learn the parameters of the latent distribution, q(z | x), that, through a regularization term, seeks to match a given prior distribution -usually a multivariate Gaussian, N (0, 1) d . The decoder, or generative network, is trained to learn the likelihood of an input given values in the latent space, p(x | z). These networks are trained together on inputs drawn from the data distribution, D, by minimizing the difference between posterior and latent prior and maximizing the likelihood estimation of the input. A VAE is generative in the sense that one can sample from the latent space, z ∼ N (0, 1) d , and then run the decoder, D(z) =x, to produce a sample that lies on the data distribution. VAEs can be leveraged for out-of-distribution detection by exploiting the fact that for x ∼ D, E(x) produces a distribution that can be sampled to generate inputs,x. Computing x −x for a number of samples yields the encoder-stochastic reconstruction error (ESRE) [48]. We adapt ESRE to use the structural similarity index measure (SSIM) [49] to assess the quality of generated image data in §IV. Fig. 1b depicts a GAN as comprised of a pair of trainable generator, G : R d → R n , and discriminator, D : R n → R, [41]. The generator produces an input,x, from a set of latent variables. The discriminator predicts the probability that an input is from the true data distribution, p(x ∼ D). The GAN is trained by presenting generated,x, and training inputs, x, to the discriminator without disclosing their source, x ? . The generator loss function is high when the generated data is classified as generated data by the discriminator, i.e., p(x ∼ D) is low. The loss function of the discriminator is high when it incorrectly classifies data, i.e., p(x ∼ D) is high or p(x ∼ D) is low, and a low value when it is correct. The weights of the generator and discriminator are updated to decrease their respective loss values. Through this process, the generator learns to produce data close to the data distribution.
There is a rich literature on the design of VAEs and GANs exploring the impact of latent dimension, complexity of model architectures, and variation in loss functions on the accuracy of the learned model. Generally GANs are thought to possess better precision, i.e., produce sharper images, but suffer from poor recall, whereas VAEs are thought to be the opposite, i.e., good recall, but produce blurry images. Leveraging distribution models for V&V requires a measure of both, but the ML literature continues to improve in this regard. For example, VAEs can now achieve precision that outperforms well-tuned GANs while retaining good recall [50].

C. DNN Verification and Falsification
A correctness problem is a pair, ψ = N , φ , of a DNN, N : R n → R m , and a property specification φ, formed to determine whether N |= φ is valid or invalid. The property specification defines a set of constraints over the inputs, φ X -the pre-condition, and a set of constraints over the outputs, φ Y -the post-condition. Verification of φ(N ) seeks to prove or falsify: ∀x ∈ R n : φ X (x) → φ Y (N (x)). Falsification seeks only to falsify that formula.
Two common types of DNN properties are robustness and reachability. Robustness originated with the study of adversarial examples [51], [52], and specifies that inputs from a given input region are all classified the same. This type of property is common for evaluating verifiers [53], [26], [36], [31]. Reachability properties define the post-condition using constraints over output values, specifying that inputs from a given input region reach outputs within a given safe output region. This type of property has been used to evaluate several DNN verifiers [53], [35], [54].
Complementary to verification, falsification checks properties of DNNs by attempting to find examples that violate the specification for a given model. Two categories of techniques that have been developed for falsifying DNN correctness problems are adversarial attacks and fuzzing. Adversarial attacks are methods optimized for detecting violations of robustness properties [52]. Fuzzing methods randomly generate inputs within a given input region, and checking whether the outputs they produce violate the post-condition. Fuzzing techniques include TensorFuzz [3] and DeepHunter [4]. More recently, the range of applicability of adversarial attacks and fuzzing has been increased to correctness properties by DNNF, which reduces general DNN correctness properties to robustness properties [10], [56].

D. Distribution-aware DNN Testing
Recent work in testing has begun to explore models of the input domain to support DNN testing. Riccio and Tonella construct explicit models of the input space using domain knowledge to generate test cases in order to characterize the space of DNN misbehaviors [57]. Dola et al. show that not considering the input distribution can bias the assessment of DNN testing techniques, and then use VAEs to model the input distribution and to augment the objective functions of test generation techniques to remedy that bias [7]. Byun and Rayadurgam also describe how to model the input distribution with a VAE and use it to generate test inputs [58]. Our approach differs in that it composes the distribution model, e.g., a VAE decoder, with the DNN under analysis and, because we focus on verification and falsification, develops constraint-based encodings that over-approximate designated portions of the input space of that model.

III. APPROACH
We present DFV, our approach for focusing DNN verification and falsification techniques on the data distribution. Our insight is that properties should be verified not over the entire input domain, R n , of a DNN, but rather over the data distribution, D, used to train the DNN, i.e., its domain of definition. Since D is a small subset of the domain there is the potential to enable property verification when violating inputs lie off of

A. DFV Overview
M is the generative model, N is the network, with pre-condition φX and post-condition φY .
With M analyses can be formulated over the low-dimensional latent space, Z ≈ N (0, 1) d , to reason about the behavior of systems on D. Two classes of such models that we explore, introduced earlier in §II, are variational autoencoders (VAE) and generative adversarial networks (GAN).
Our goal is to define the set of inputs that lay on the data distribution and that satisfy the property pre-condition, i.e., x ∼ D ∧ φ X (x). Fig. 2a depicts the enforcement of these constraints using two mechanisms: the use of M as a prefix network, and the forwarding of generated inputs to enable the enforcement of the pre-condition prior to checking post-conditions -as described in [10]. We note that it is also possible to combine M and φ X by defining a generative model capable only of producing inputs satisfying the precondition, in which case the forwarding generated inputs is not needed. With these elements the verification problem can be reformulated as ∀z ∈ Z : φ X (M(z)) → φ Y (N (M(z))).
In this paper, we explore in detail an instantiation of DFV, depicted in Fig. 2b, that uses a VAE as the generative model and targets output properties of the DNN, i.e., where the precondition is true. As explained in §II, the VAE's encoder and decoder are trained together, but we only exploit the decoder, M = D, for DFV.
Since the dimension of the latent space of the VAE is generally much smaller than the input dimension, the generated set of inputs, {M(z) : z ∈ Z}, takes up an infinitesimal portion of the ambient input space. For falsification, this allows the problem to be reformulated from the input space, ∃x ∼ D : ¬φ Y (N (x)), to the latent space, ∃z ∈ Z : ¬φ Y (N (M(z))). More importantly, since existing falsification algorithms cannot test that x ∼ D, this approach is the first to yield counterexamples that lie on the data distribution -subject to the precision of M in modeling D. Similarly verification can be reformulated to the latent space, ∀z ∈ Z : φ Y (N (M(z))).
Existing verification and falsification tools require that their input space be bounded. When using M that input space is the latent space of the distribution model and its Gaussian structure allows us to formulate a meaningful bound. For a d-dimensional latent space, the l 2 d-ball of radius c, as depicted in Fig. 2c, contains all points within c standard deviations of the mean. The value of c can be specified to contain an arbitrarily large portion of the distribution, e.g., c = 5 specifies verification of 99.99994% of the distribution. However, existing DNN falsifiers and verifiers do not support the non-linear constraints necessary for defining the l 2 d-ball, so we formulate hypercube approximations.
As depicted in Fig. 2c, the l ∞ d-ball is the smallest hypercube that overapproximates the l 2 d-ball. We denote with Z c the hypercube with radius c -side-length 2c. Formulating constraints that restrict each of the d dimensions to the interval [−c, c] yields a verification problem, ∀z ∈ Z c : φ Y (N (M(z))), that will soundly verify c standard deviations in Z.

B. DFV Algorithm
Algorithm 1 defines DFV through a series of transformations followed by the invocation of the verifier or falsifier.
The DFV algorithm accepts a correctness problem, comprised of a DNN, N , and correctness property, φ. In addition, it takes a model, M of the data distribution, D, a verifier or falsifier, V , and a radius, c, which defines how much of the data distribution should be subjected to analysis. Its output indicates that either a property violation has been detected, the property is valid within radius c (for verifiers), or the result is unknown -due to limitations in the verifier or falsifier used. When a violation is reported a counter-example, ce, is returned as well. We now describe the algorithm in more detail.
Decoupling the Training of the Distribution Model from DFV. Algorithm 1 consumes M, separating the training of M from DFV. This is important because there are many degrees of freedom when training M, often dependent on the type of model and training used. For example, for a VAE, such as those used in §IV, the effectiveness of the learned model depends on many factors, including the architecture of the model (e.g., number, type, and configuration of layers) and the training parameters (e.g., optimizer, batch size, and learning rate). Independent of the model and training process, the goal 320 return result of this pre-stage to DFV is for M to approximate D. We note that the development of distribution models that have high precision and recall is an active area of ML research [42], [43], and that recent research has defined high-precision VAEs [50]. We note, however, that models with lower levels of precision can still be quite valuable. As we show in §IV, rather simple VAE models can yield much more meaningful counterexamples than those produced without using a distribution model. In this work we explore VAE variations, but we leave to future work a broader study of how model accuracy impacts the cost and benefit of DFV.
Problem Transformation. DFV transforms the correctness problem as described in lines 2-5 of Algorithm 1. Line 2 modifies the original DNN by prefixing it with a generative model, as shown in Fig. 2a. This transformation prefixes the original DNN with a latent variable generative model, such as the decoder of a VAE. Because the generative model maps inputs from a known distribution to the learned data distribution, this step ensures that verification and falsification will only check inputs from the data distribution learned by the prefixed model. That is, the tools can focus on the inputs that are within the distribution. Line 3 replaces the input precondition of the original property with a new pre-condition specifying that inputs come from the latent space of the generative model -M.d denotes its dimension -and that these inputs satisfy the pre-condition. Because the verifiers and falsifiers require inputs to be bounded, we assert bounds on the latent space. When the latent space distribution of the generative model is Gaussian, we require z to be within c standard deviations of the mean. We approximate this with a hypercube of radius c centered at the origin. Line 5 joins the outcomes of the transformation from line 2 and 3 to redefine the correctness problem on the data distribution.
Verification and Falsification. After transforming the problem, on lines 6-9, falsifiers and verifiers can be run on the modified correctness problem, ψ = N , φ . If a counter-example to ψ is found, then it can be mapped to a valid counterexample of the original property by performing inference with  the generative model, M. DFV will report, violation, along with the counter-example, otherwise it will report the valid or unknown result returned by V .

IV. STUDY
In this section we assess the cost-effectiveness and scalability of DFV by applying it in conjunction with multiple falsifiers and verifiers. Our evaluation will answer the following research questions: 1) What is the cost-effectiveness of applying falsification and verification with DFV? 2) How does the configuration of the model used by DFV affect the quality and quantity of counter-examples? 3) How well does DFV scale to more complex input domains that required more sophisticated models?

A. Design
We now describe the problem benchmarks, generative models, falsifiers, verifiers, and metrics that constitute the three experiments in our study. We describe the experimental procedures using these items under each research question.
1) Problem Benchmarks: Our criteria for the selection of problem benchmarks required for them to have models developed by others that offer a range of challenges in terms of architectural complexity, domains and tasks, and architectures and training data. They also had to have global reachability properties that are supported by DFV, and be amenable to the application of existing tool sets. In the end, we identified two benchmarks of correctness problems that fit our criteria.
The GHPR-FMNIST benchmark is a new DNN correctness problem benchmark, based on the GHPR-MNIST benchmark from the evaluation of DNNF [10]. The benchmark consists of 20 global reachability properties applied to a small FashionMNIST [44] network. A sample of images from the FashionMNIST training set is shown in Fig. 3a. The network used is based on the architecture of the small MNIST network from the evaluation of the Neurify verifier [53]. There are 2 formulations of properties in this benchmark. The first 10 properties, which we will refer to as type A, are of the form: for all inputs, if class a has the maximal value, then the output values for classes a and b are closer to one another than the output values for classes a and c. For example, one of the properties states that for all inputs, if that input is classified  [39], which predicts a steering angle and probability of collision for a quadrotor from 200 by 200 black and white images. DroNet is a large DNN model consisting of 3 residual blocks and over 475,000 neurons. A sample of images from the DroNet training set is shown in Fig. 3b. The properties are of the form: for all inputs, if the probability of collision is between p min and p max , then the steering angle is within d degrees of 0. As p min increases, so does d, capturing the intuition that if the probability of collision is low, then the quadrotor vehicle should not make sharp turns.
2) Generative Models: We consider two powerful types of latent variable generative models to learn the data distribution of the training set -VAEs and GANS. We selected these models because: 1) they meet the requirements of the approach, 2) they are among the most popular unsupervised learning approaches to encode a data distribution, and 3) they work in different ways and provide different tradeoffs. Given the number of variables involved in our experiments, we chose VAEs for RQ1 and RQ2, and incorporated GANs for RQ3. Through the study we explored a total of 93 models, 91 to characterize the distribution of GHPR-FMNIST and 2 to characterize the distribution of GHPR-DroNet. All models used in the study are shown in Table I. Details for the configuration of those models is provided in the experimental procedures for each of the research questions.
3) Falsifiers and Verifiers: For the falsifiers, we will use four common adversarial techniques included in the DNNF tool [10]. DNNF reduces correctness problems to adversarial robustness problems to allow them to be falsified by off-theshelf adversarial attacks. We chose to use FGSM [59], Basic Iterative Method (BIM) [60], DeepFool [61], and Projected Gradient Descent (PGD) [62] as they were the top performing falsifiers in the DNNF study. We use the same parameters for each adversarial attack method as used in that study.
We will also use three top performing DNN verifiers. We chose to use Neurify [53], VeriNet [63], and nnenum [64], which are all supported by DNNV [38], and have performed well in recent benchmarks [65], [66]. In addition, each of these verifiers have the ability to return counter-examples.

4) Metrics:
For each run of the falsifiers and verifiers we report the number of counter-examples found and the time to find each counter-example. To judge the quality of each counter-example we compute the mean reconstruction similarity (MRS) which, as discussed in §II, adapts ESRE to use the SSIM metric. Given a reference VAE, V, MRS computes for a given input, x, the expected similarity for a set of reconstructed inputs, x). In this work, we estimate the mean using a sample size, N , of 100 reconstructions.
For each problem domain we also require a VAE model to use as a ground truth for measuring the MRS. For Fashion-MNIST we trained a fully-connected VAE model, VAE MRS , with a 100-dimensional latent space, and symmetric encoder and decoder, each with two hidden layers, one of 256 neurons and one of 512 neurons, and ReLU activations. The decoder uses a Sigmoid activation so that output values are in the range 0 to 1. We chose to use a model significantly larger than those used for DFV for evaluating MRS under the assumption that a larger model would be able to better model the distribution and thus provide accurate MRS measures for all models tested. For DroNet we trained a convolutional VAE model, Conv-VAE DroN et , with symmetric encoder and decoder, and a 512 dimensional latent space. The decoder consists of 8 blocks, each composed of a convolutional transpose operation followed by batch normalization and an ELU activation, except for the final block, which uses a Sigmoid activation so that output values are in the range 0 to 1. We chose to use this model as the baseline for MRS, since we expected a convolutional model to perform well on the image data of the DroNet benchmark.

5) Computing Resources:
The experiments in this work were run on nodes with Intel Xeon Silver 4214 processors at 2.20 GHz and 512GB of memory. For RQ1 and RQ2 each job was allowed to use 1 processor core and unrestricted memory, and had a time limit of 1 hour, while falsification jobs in RQ2 -exploring the factors of the VAE -had a time limit of 5 minutes. For RQ3, each job was allowed to use 2 processor cores and had a time limit of 1 hour.

A. RQ-1: on DFV efficacy
In this first experiment, we quantitatively and qualitatively assess the effectiveness of DFV and its costs when applied in conjunction with 4 falsifiers and 3 verifiers.
Experimental Procedure. To answer RQ1, we use the GHPR-FMNIST benchmark. We run both the verifiers and falsifiers on this benchmark, with and without our DFV with VAE RQ1 . We designed VAE RQ1 so that all existing tools could successfully run on it. This meant that we had to constrain its size and type of activation functions so that existing verifiers could process it. More specifically, we design VAE RQ1 with a single hidden layer of 24 neurons in the decoder, and instead of a Sigmoid activation on the output, it uses an approximation of the Sigmoid function with ReLU activations, since, of the verifiers explored in this work, only VeriNet supported non-ReLU activation functions. The approximation used is as follows Sigmoid(x) ≈ ReLU (−ReLU (−0.25 * x+0.5)+1). We run each tool 5 times on every problem to account for random noise and we record the number of problems that return a sat result, indicating that a counter-example was found, as well as the MRS of each counter-example. Each falsification and verification job had a timeout of 1 hour and used a radius of 3 in the latent space.
Analysis and Findings. We start by examining the mean reconstruction similarity (MRS) measures for the counterexamples generated by DFV. The MRS values are computed based on their reconstruction with VAE MRS . Fig. 4 shows box plots representing the distribution of the MRS of the counter-examples found by each of the 7 tools (x-axis) when applied to the original DNN (red) and the DNN with the VAE RQ1 decoder (blue) generated by DFV. We find that, across all tools, the use of a model with DFV renders counterexamples that are reconstructed better by VAE MRS than those found in the original DNN. Indeed, the median MRS for the counter-examples found in the original DNN is under 0.1, while the median MRS for the tools applied with DFV is above 0.6. This implies that they are closer to the distribution learned by VAE MRS and thus may be closer to the true input distribution. A statistical analysis of variance with the Kruskal-Wallis method 1 confirmed that the differences between using and not using DFV on any given tool are significant at p=0.05. This portion of the study also revealed an interesting opportunity for verifiers. Based on the property design, we expected that the verifiers would not be able to prove a property of type B, and were unlikely to prove one of type A on the original DNN, which was indeed the case. However, when we used DFV, the nnenum verifier was able to prove 25 problems that held under the reduced input space encoded by the VAE RQ1 . This observation points to an opportunity for enabling verification to prove properties that may not hold over the whole input space but may hold over the relevant input space as per the training distribution. In order for such an approach to be effective, further studies are needed to guarantee that the generative model encodes a faithful model of the input distribution. We discuss this further in future work.
We now qualitatively examine the counter-examples generated with and without DFV. The tabulated images in Fig. 5a are the counter-examples with the highest SSIM generated by each tool on the DNN, and the ones in Fig. 5b with DFV. Without using DFV, we see in Fig. 5a the images generated  by all the falsifiers look like random noise, while the images generated by the verifiers have a bit more structure, with larger blocks of similarly valued pixels, but still have little discernible pattern. On the other hand, most of the counter-examples generated with DFV in Fig. 5b bear some resemblance to the training images (e.g., boots, pants, sandals), and some of them are clearly identifiable. We also notice that the counterexamples found with DFV for some properties correspond to distinct classes. We argue that when no counter-examples are found for a property with a model, but are found for the original DNN, like for Property A-1 and A-3, those counterexamples are likely to be invalid as they reside outside the data distribution. By the same token, when counter-examples are found with a model but not found without a model, like for Property A-4, we argue that the model reduction enables tools to explore the pruned space more extensively, enabling their generation.
Last, we briefly examine the time distribution for each tool to generate the counter-examples. Fig. 6 presents box plots for each of the tools, and we again plot the number of counterexamples on the y2-axis. As expected, falsifiers are faster than verifiers. Looking at the 0.75 quartiles of the times spent by the different tools, we can see that all falsifiers took under 1.5 seconds, while the verifiers took up to 1444.5 seconds. PGD detected the most counter-examples, 85 on the original DNN and 71 with the approach, while its median execution time was just over a second. When comparing the boxes within a tool, we find that incorporating DFV did not have a major impact on the time taken by any of the tools 3 .
Major Findings: Tools applied in conjunction with DFV generate fewer counter-examples, with four times higher MRS, in negligible time, and that visually appear to be much better aligned with the training distribution.

B. RQ-2: on VAE structure effects on DFV
We now explore the effects of the VAE's latent space size, number and size of layers, and radius from the center of the latent space distribution on the efficacy of DFV.
Experimental Procedure. To answer RQ2, we again use the GHPR-FMNIST benchmark while exploring several factors that may affect the efficacy of DFV. Given that there is a large number of configurations to explore, that the larger VAE configurations are not runnable by all verifiers, and that the performances of all falsifiers were similar in the first experiment, we selected to run all configurations and DFV with only the PGD falsifier. We first explore factors related to the VAE architecture, varying the size of the latent space, the number of hidden layers, and the size of each layer. We explored latent space sizes of 1, 2, 4, 8, 16, and 32; hidden layer counts of 1, 2, and 4; and layer sizes of 16, 32, 64, 128, and 256. For each combination of factors, we trained a VAE on the Fashion-MNIST data and transformed the correctness problems using DFV, and ran PGD on the resulting problems. We will refer to each as VAE d,l,h , where d is the latent space size, l is the number of hidden layers, and h is the size of each hidden layer. The model VAE 8,2,256 has an 8-dimensional latent space with 2 hidden layers, each with 256 neurons. Second, we explore how the size of latent space region of the model affects the quality of the found counter-examples. We will specify the size of the input region by restricting the radius of the l ∞ d-ball in the latent space of the VAE. We will explore this factor with radii of 0.25 to 4, in 0.25 increments. To reduce the number of experiments, we use only the VAE that performed best in the first part of the experiment -with a high number of counter-examples found and high MRS. For this question, each falsification job was run 5 times to reduce the effects of random noise and each job was given a timeout of 5 minutes. For each combination of factors we report the number of counter-examples found, as well as the MRS of each counter-example.
Analysis and Findings. We explored a total of 90 VAE configurations that work in conjunction with DFV. To control for randomness, we run each configuration five times.
We We observe that smaller latent spaces (LS={1,2}) appear to generate counter-examples with slightly higher MRS, mainly because the model renders a less diverse set of images but of really good quality. The differences in MRS are confirmed with an ANOVA test of significance and a multiple comparison of latent space means with a Bonferroni correction across the latent spaces. More specifically, the MRS for LS=1 is significantly different from the rest of the latent spaces, and a LS=2 is significantly different from the rest, at p=0.05. We conjecture that larger spaces are able to encode richer data distributions enabling the generation of more and more diverse counter-examples that are sometimes farther from the distribution (i.e., a sandal that appears as printed on a shirt). Still, for this particular benchmark the gains in the number of counter-examples found and the losses in MRS seem to saturate after latent spaces of size 8. Across all of the latent spaces sizes, PGD required a median of 0.5 seconds to find counter-examples. The timing data for these experiments is available in the appendix.
We then selected the VAE architectures with LS=8, which contained the architecture with the tying highest MRS with the most counter-examples, to examine their variance. The xaxis of Fig. 7b contains the 15 VAE architectures we explored specified in the x-axis by the latent space size, the number of layers, and the number of neurons. We note that the architectures with more layers appear to be able to produce counter-examples with higher MRS. For example, the median for the architectures with 1 layer was 0.58, with 2 layers was 0.68, and with 4 layers was 0.79. An ANOVA confirms that the differences across architectures are significant at p=0.05, and a pair-wise comparison with a Bonferroni correction reveals that all the architectures with 1 layer are significantly different from the ones with 4 layers. The figure also seems to indicate that, given the same number of layers, having more neurons would render slightly higher MRS. For example, the median for the architecture with 16 neurons was 0.59 and for the three architectures with 256 neurons it was 0.71. We notice, however, that the number of counter-examples found was higher when fewer layers are used. We conjecture that having more layers further restricts the size of the input space learned by the VAE, perhaps due to the extra expressive power of the additional layers. As with latent spaces, the time to find counter-examples did not vary significantly across architectures, with most of them requiring a median of less than 1 second to find a counter-example. The timing data for these experiments is available in the appendix.
The last piece of this experiment explores changing the radius of the constraints in the latent space. We examine the effect of such changes on VAE 8,2,256 , the architecture with the most counter-examples and greatest MRS in Fig. 7b. Fig. 7c shows that the MRS slightly decreases after the first bound of 0.25 and then starts to increase with higher bounds, from a median of 0.80 for a radius of 0.25 to a median of 0.68 for a radius of 1.25 and back to a median of over 0.75 MRS with a radius of 4. An ANOVA test across radii was significant at p=0.05, and a multple comparison with a Bonferroni correction showed that radius 0.25 was deemed significantly different from radii 0.5, 0.75, 1 and 1.25 but non-significantly different from the higher radii (timing details are provided in the appendix). We also note that the number of counter-examples found increases as the space to explore around the captured data distribution increases from a radius Major Findings: VAE configurations with very limited capacity (in layers, neurons, or latent space size) can have a noticeable effect on the DFV effectiveness, specially in the number of counter-examples being found. If more counterexamples are desirable, then one should increase the dimensionality of the latent space, reduce the number of layers, and increase the radius. If higher-quality counter-examples are more desirable, then favoring a lower-dimensional latent space or smaller radius, or reducing the number of layers is indicated.

C. RQ-3: on DFV Scalability
In this experiment, we assess the scalability of DFV by applying it to a large DNN model for autonomous UAV control using 3 different input distribution models.
Experimental Procedure. To answer RQ3, we use the larger and more complex GHPR-DroNet benchmark. We apply the PGD falsifier to the benchmark, both as is, and using DFV both with a VAE model, as well as with a GAN as the generative model. We train a fully-connected VAE, FC-  Both models use a 512 dimensional latent space. As before, we run each falsifier 5 times to account for random noise and we record the number of counter-examples found and the time to find each counter-example. Each job had a timeout of 1 hour.
Analysis and Findings. Fig. 8 shows box plots with solid outlines for the distributions of the reconstruction similarities of counter-examples found using PGD on the DroNet DNN without DFV, as well as using DFV with the decoder of FC-VAE DroN et and the generator of GAN DroN et . Fig. 8 also shows the number of counter-examples found using each model using bars with the count labeled above each bar. We find that, for DFV with both models, while fewer counterexamples are found, they clearly have higher reconstruction similarities than those found using the DroNet model alone. Indeed, the MRS differences between DroNet, FC-VAE DroN et , GAN DroN et are shown to be statistically significant overall by a Kruskal-Wallis test with p=0.05, and so do their pairwise differences. Corroborating the previous findings, this implies that the counter-examples found using DFV are closer to the distribution learned by Conv-VAE DroN et , the model used to compute the MRS values, and thus may be closer to the actual input distribution. Without DFV, violations were found for all 10 properties across all 5 seeds. Using FC-VAE , 28 violations were found for 6 properties. Using GAN , 9 violations were found across 2 properties. While the lowest MRS for a counterexample found using DFV was 0.42, the MRS without DFV never exceeded 0.11.
We now proceed to visually examine the counter-examples generated with and without DFV for 5 properties. Fig. 9 shows counter-examples generated by PGD. The images generated without DFV look like random noise, while the images generated with DFV, independent of the chosen model, have structure and contain features seen in the training images such as roads, trees, or horizon lines. The model used for DFV has an impact on the images produced. While the VAE model tended to produce blurrier counter-examples, the GAN model produced counter-examples with sharper lines, but fewer recognizable road features.
Finally, Fig. 8 shows box plots, with dashed outlines, of the time to generate each counter-example using each model. The median time to falsify DroNet alone was 321 seconds, while DFV with FC-VAE DroN et took 146 seconds and DFV with GAN DroN et took 259 seconds, but there is enough performance variance that those differences are not deemed statistically significant.
Major Findings: DFV can be applied with various models without significant time penalty, while also producing counterexamples with up to a nine-fold improvement in reconstruction similarity.

D. Threats to Validity
External Validity. Three threats to the generalization of our findings are our choice of tools, benchmarks, and generative models to evaluate DFV. We mitigate the concern about tools generality by selecting multiple falsifiers and verifiers as part of the first research question. For the next questions we traded generality across tools for more insights about the performance of DFV under different models, which implied that we had to drop DNN verifiers from the rest of the assessment because they did not scale to the networks and models we were targeting. Regarding benchmarks, we selected ones from different domains, one a classification task, and the other being a regression task, with very different architectures and training data. Still, more benchmarks are needed to more broadly explore the cost-effectiveness of DFV. To mitigate the threat about model selection, we explored an extensive set of models in a systematic way. Still, the examination of more generative models is part of the future work.
Construct Validity. Our choice of MRS as a quality measure and our personal qualitative judgment of generated counter-examples pose a threat in that the relevance of a counter-example could be judged by many means. We mitigated this threat by basing MRS on a popular measure, ESRE, and specializing it to images with SSIM. We also provide results using ESRE in the appendix, and we will explore additional measures including those for outlier detection in the future and perform studies with users to help us judge the counter-examples quality.
Internal Validity. Our training processes for the networks and the models constitute a threat to the internal validity of the study as their correctness could have affect the findings.
We have documented and programmed those processes when possible through scripts to facilitate their reproduction. We also mitigate this threat by making our data and scripts for running our experiments and analyzing our results publicly available (see below). Another threat to validity is the randomness involved in training of networks and models, and in the tools' performance. We mitigated that threat by running those tools multiple times and showing their variability.

VI. CONCLUSION
This work introduces a novel approach, DFV, which enables existing DNN verification and falsification techniques to target the data distribution. DFV composes learned latent variable generative distribution models with the DNN under analysis, reformulating the problem so that generated counterexamples are on the data distribution. We explore different data distribution models and find that using even simple models yield substantially better counter-examples across a range of verification and falsification techniques for two different benchmarks.
These findings along with recent work on distribution-aware testing [9], [7], suggest that models of the data distribution can play an important role in V&V of DNNs. We plan to pursue further work along these lines. For example, how performance metrics for latent variable generative models that assess their precision and recall [43], can guide the development of distribution models that are customized to best suite different V&V activities for DNNs.

ARTIFACT AVAILABILITY
We provide an artifact containing the tool, as well as the data and scripts required to replicate our study at https://zenodo.org/ record/5104745