Evolutionary improvement of assertion oracles

Assertion oracles are executable boolean expressions placed inside the program that should pass (return true) for all correct executions and fail (return false) for all incorrect executions. Because designing perfect assertion oracles is difficult, assertions often fail to distinguish between correct and incorrect executions. In other words, they are prone to false positives and false negatives. In this paper, we propose GAssert (Genetic ASSERTion improvement), the first technique to automatically improve assertion oracles. Given an assertion oracle and evidence of false positives and false negatives, GAssert implements a novel co-evolutionary algorithm that explores the space of possible assertions to identify one with fewer false positives and false negatives. Our empirical evaluation on 34 Java methods from 7 different Java code bases shows that GAssert effectively improves assertion oracles. GAssert outperforms two baselines (random and invariant-based oracle improvement), and is comparable with and in some cases even outperformed human-improved assertions.


INTRODUCTION
Recently, we witnessed great advances in test input generation [13,30]. However, the oracle problem [4] remains a major obstacle that limits the effectiveness of automatically generated test suites. Instead of generating test oracles for each automatically generated test case, one could rely on assertion oracles to expose software faults. Assertion oracles (also called program assertions) are executable boolean expressions that predicate on the values of variables at specific program points. A perfect assertion oracle passes (returns true) for all correct executions and fails (returns false) for all incorrect executions. Perfect oracles are difficult to design, and thus assertion oracles often fail to distinguish between correct and incorrect executions [25], that is, they are prone to both false positives and false negatives, which are jointly called oracle deficiencies [20]. A false positive is a correct program state in which the assertion fails (but should pass), and a false negative is an incorrect program state in which the assertion passes (but should fail).
Oracle deficiencies are a serious problem for both manually and automatically generated assertion oracles. In fact, invariant generators are known to generate invariants that are incomplete and imprecise when used as assertion oracles [6,29,40]. They are incomplete because most dynamic invariant generators, notably Daikon [10] and InvGen [17], cannot generate assertions that do not match pre-defined templates of Boolean expressions [6]. Existing invariant generators are also imprecise, because the generated invariants often do not generalize well with unseen test cases. In fact, Nguyen et al.'s and Staats et al.'s studies [29,40] report high false positive rates for Daikon invariants.
Improving the quality of program assertions by removing oracle deficiencies is of paramount importance. It would improve the fault detection capability and reduce the false alarms of both automatically generated and manually written test cases.
Recently, Jahangirova et al. proposed OASIs [20,21] to automatically identify oracle deficiencies. Given an assertion oracle, OASIs generates test cases and mutations that gives evidence of false positives and false negatives, respectively. This evidence is meant to support the developers in assessing and improving the oracles.
A recent study by OASIs's authors shows that the manual improvement of assertion oracles is difficult [22]. Given the oracle deficiencies detected by OASIs, for only 67% of the given assertions humans successfully removed all oracle deficiencies.
The difficulty of manually improving assertion oracles motivated us to study how to automatically improve assertions. Given an assertion oracle α and some evidence of false positives and false negatives provided by an oracle assessor (such as OASIs), we aim to automatically generate an improved assertion α ′ with fewer oracle deficiencies than α. While there are many techniques to automatically generate program assertions, for example, program invariants [2, 9, 15ś17, 27, 35, 37, 46], automatically improving assertion oracles is an unexplored problem.
In this paper, we propose GAssert, Genetic ASSERTion improvement, the first technique to automatically improve assertion oracles. Given an assertion oracle and its oracle deficiencies, GAssert explores the space of possible assertions to identify those with zero false positives and the lowest number of false negatives. GAssert favors assertions with zero false positives, as false alarms are known to trigger an expensive debugging process [29].
GAssert addresses the challenge of navigating a huge search space with an evolutionary approach that evolves populations of assertions by rewarding assertions with fewer deficiencies. GAssert formulates the oracle improvement problem as a multi-objective optimization problem (MOOP) [41] with three competing objectives: (i) minimizing the number of false positives, (ii) minimizing the number of false negatives, (iii) minimizing the size of the assertion.
The key challenge of defining a multi-objective fitness function is that these three objectives are competing with each other. Simply merging the objectives into the same fitness function is not an effective solution, as in MOOPs it is difficult to simultaneously reduce all competing objectives [31,34,41]. For an evolutionary algorithm, a possible strategy to improve a given program assertion might be either by first removing all false negatives (accepting more program behaviors, i.e., generalizing the assertion) or by removing false positives (accepting less program behaviors, i.e., specializing the assertion), or by an interleaving of these two strategies.
GAssert addresses this challenge with a co-evolutionary approach that evolves two populations in parallel with different fitness functions for each population. The fitness functions of the first and second population reward solutions with fewer false positives and false negatives, respectively, considering the remaining objectives only in tie cases. The two populations exchange their best individuals (population migration) on a regular basis, to supply both populations with good genetic material, useful to improve both the primary and secondary objectives. Moreover, GAssert presents novel crossover and mutation operators specifically designed for the oracle improvement problem.
We empirically evaluated GAssert on 34 methods from 7 Java code bases. We evaluated the ability of GAssert to improve an initial set of Daikon [9] generated assertions. The improved assertions eliminate all false positives present in the initial Daikon assertions, and reduce the false negatives by 40% (on average) with respect to the initial Daikon assertions. When executed with unseen tests and mutants, the GAssert assertions increase the mutation score by 34% (on average) with respect to the mutation score obtained with the initial assertions.
In summary, this paper makes the following contributions: • We formulate the problem of automatically improving assertion oracles given a set of false positives and false negatives; • We propose GAssert, the first technique to automatically improve assertion oracles; • We evaluate GAssert on 34 methods from seven Java code bases, and show that GAssert outperforms both unguidedrandom and invariant-based approaches; • We release our evaluation results (https://doi.org/10.5281/ zenodo.3876638) and tool (https://doi.org/10.5281/zenodo. 3877078) to facilitate future work in this area.

PROBLEM FORMULATION
This section provides the preliminaries for this work and formulates the problem of improving assertion oracles. In this paper, P is an object-oriented program composed of a set of classes, each defining a set of methods and fields. Given a program point ρρ of a method m in P, S ρρ denotes the set of all program states that can reach ρρ when m is executed. A state s ∈ S ρρ defines an assignment of values to memory locations that are accessible (visible) at the program point ρρ (e.g., instance fields, method parameters and local variables). S ρρ is partitioned into two disjoint sets: correct (S + ρρ ) and incorrect (S − ρρ ) program states. We say that a state is correct if it satisfies the intended program behaviour, incorrect otherwise. We drop the subscript ρρ and use S, S + and S − when ρρ is clear from the context.
A program point ρρ can be associated with an assertion oracle α, a quantifier-free first-order logic formula that predicates on variables and functions of Boolean or numerical types and returns a Boolean value (T or F). Let Σ denote the set of variables visible at the assertion point ρρ. Let F denote the set of Boolean and numerical operators that GAssert uses to synthesize assertions. The content of Σ depends on ρρ, while F is fixed for any ρρ. Table 1 shows the 17 functions in F grouped by operand and output type.
Assertion oracles aim to distinguish correct and incorrect executions. We consider assertions inserted into program P, and not into its test cases. The difference is that assertions in P handle all possible test case executions, while assertions in the test cases check the correctness of a single test execution. More specifically, an assertion oracle α expresses a correctness property that is intended to be true at ρρ in all correct executions (i.e., ∀s + ∈ S + , α[s + ] = T ) and false in all incorrect executions (i.e., ∀s − ∈ S − , α[s − ] = F ), where α[s] denotes the evaluation of the Boolean expression α on state s. We call perfect oracle an assertion that satisfies such a condition.
Perfect oracles are difficult to design, and assertion oracles often fail to distinguish correct from incorrect executions, i.e., they have false positives and false negatives, which we call oracle deficiencies. Definition 1. A false positive of an assertion α at a program point ρρ is a reachable program state where α is false, although such state is correct (according to the intended program behavior). More formally, it is a state s + ∈ S + ρρ : α[s + ] = F .

Definition 2.
A false negative of an assertion α at a program point ρρ is a reachable program state where α is true, although such state is incorrect (according to the intended program behavior). More formally, it is a state s − ∈ S − ρρ : α[s − ] = T .
In this paper, we study the problem of automatically improving assertion oracles, that is, given an assertion α and a set of oracle deficiencies, generating a new assertion α ′ with fewer deficiencies. Identifying oracle deficiencies by enumerating all correct and incorrect states is infeasible, because it requires to enumerate infinitely many executions [36]. Thus, we rely on a precise but incomplete oracle assessor OA that returns evidence of false positives and false negatives (if any) for a given assertion. We assume any OA to be precise (it reports only real oracle deficiencies), but possibly incomplete (it may miss oracle deficiencies) because it cannot enumerate all possible correct and incorrect executions.
An oracle assessor can be either a human or an automated technique. To enable full automation, we rely on the automated oracle assessor OASIs [20,22]. Given an assertion α, OASIs leverages search-based test generation and mutation testing to report oracle deficiencies, if any can be found within the given time budget.
OASIs finds false positives of an assertion α by generating test cases that make α return false in the reached state. OASIs considers such states as false positives of α because it targets the implemented program behavior, which might differ from the intended one. As such, GAssert needs a manual validation of the improved assertions to ensure that they capture the intended program behavior.
OASIs finds false negatives of an assertion α by seeding artificial faults (mutations) into program P using mutation testing [13].
OASIs generates a test case and a mutation that produce a corrupted program state s − ∈ S − at the assertion point ρρ, where α does not reveal the fault, i.e., α returns true.
We now define the oracle improvement problem given an oracle assessor OA. Let A denote the universe of possible Boolean expressions containing variables in Σ and functions in F . To make A a finite set, we bound the size of assertions (i.e., the number of variables and functions in the assertions) to a maximum value (50 in our experiments). Let FP(α, S + ) denote the number of false positives of α wrt a finite subset S + of S + . That is, FP(α, S + ) is the number of states s + ∈ S + ⊆ S + : α[s + ] = F . Similarly, FN(α, S − ) denotes the number of false negatives of α wrt a finite subset S − of S − . That is, Problem Definition 1. Given an assertion α at a program point ρρ in P, given a set of false positives S + ⊆ S + and a set of false negatives S − ⊆ S − reported by an oracle assessor OA, and an overall time budget B, the oracle improvement of α is the process of finding within B a new assertion α ′ ∈ A such that FP(α ′ , S + ) = 0 and either FP(α ′ , S + ) < FP(α, S + ) or FN(α ′ , S − ) < FN(α, S − ).
In defining the oracle improvement, we give priority to false positive over false negative reduction, by requiring all false positives to disappear in the improved oracle α ′ . The rationale for this choice is that false negative reduction can be easily achieved with assertions that raise many false alarms. However, such assertions are troublesome for developers, as they trigger an expensive debugging process, in which the root of the assertion failures may likely Algorithm 1: GAssert: Iterative Oracle Improvement Process input : initial assertion α at progr. point ρρ in P, time-budget B output : improved assertion α ′ 1 function GAssert 2 P ′ ← instrument-method-at-program-point(ρρ, P) 3 ⟨S + , S − ⟩ ← get-initial-correct-and-incorrect-states(P ′ ) be the assertion itself. Therefore, we privilege assertions with no false alarms (no false positives). Ideally, the improved assertion oracle α ′ has zero oracle deficiencies wrt to S + and S − (i.e., FP(α ′ , S + ) = FN(α ′ , S − ) = 0). However, generating such assertions can be expensive and difficult, and may be infeasible within a reasonable time budget, as an oracle that detects all faults could be as complex as the method under test [20]. Therefore, we deem an oracle with zero false positives and the lowest number of false negatives sufficiently adequate in practice [22].

GASSERT
Algorithm 1 overviews the GAssert approach. GAssert's inputs are (i) an assertion oracle α, (ii) the program point ρρ in P where α is placed, and (iii) a time budget B. The output of GAssert is an improved assertion α ′ . GAssert improves assertion oracles with an iterative process. Before the first iteration, GAssert instruments P to capture program states at runtime (line 2 of Algorithm 1). It then produces an initial set of correct and incorrect states S + and S − by executing an initial test suite on the instrumented version P ′ and on its faulty versions (mutants), respectively (line 3). The while loop at lines 4ś13 implements the iterative process. GAssert gets the dictionary of variables Σ from the states S + and S − (line 5), and invokes the oracle improvement of α (line 6). The oracleimprovement algorithm, which we discuss in Section 3.3, returns an improved assertion α ′ (line 6). If OASIs cannot find any oracle deficiencies of α ′ , Algorithm 1 returns α ′ , and the iterative process terminates (lines 8 and 9). Otherwise, GAssert adds the newly identified false positives and false negatives (S + new and S − new ) to S + and S − (lines 10 and 11), respectively. The improved assertion α ′ replaces the initial assertion α (line 12) and a new iteration starts.

Running Example
We now describe the GAssert oracle improvement process with a running example. Figure 1 shows a Java method that accepts two integers x and y as parameters ì p, and returns the minimum between them. The figure also shows (i) the assertion point ρρ (line 9), (ii) two instrumented method calls to collect the program states (lines 1 and 8), and (iii) two mutants M 1 and M 2 used to produce false negative program states (lines 6 and 4).    Table 2 illustrates how GAssert improves a trivially incomplete initial assertion (min < x) into a stronger assertion that intuitively captures the expected behavior of a łminž function: ((min == x) OR (min == y)) AND ((min ≤ x) AND (min ≤ y)). Column łinput assertion αž shows the assertions that the oracle assessor (OA) receives as input at each iteration. The first assertion (min < x) is provided to GAssert as an input, while the following two assertions are automatically generated by its evolutionary algorithm. The initial assertion can be manually generated or inferred with a tool. Column łFalse Positives (FP)ž shows the false positive states with the test cases that produce them. On such states α fails while it should pass. Similarly, Column łFalse Negatives (FN)ž shows the false negative states with the test cases and the mutants that produce them. On such states α passes but it should fail.
In the example, OA identifies both FPs and FNs for α : min < x. Table 2 reports a test case t 1 that OA generates for α, and for which α incorrectly returns false. The execution of test cases t 1 produces the state s + 1 that is a false positive for α (see Def. 1). The table also reports a sample test case t 2 and mutant M 1 that OA generates for α and for which α incorrectly returns true (α does not kill mutant M 1 ). The execution of test cases t 2 with mutant M 1 produces the state s + 2 that is a false negative for α (see Def. 2). At the first iteration, GAssert takes as input α, the false positive s + 1 and the false negative s − 2 of α, and returns the improved assertion α ′ : (min ≤ x) AND (min ≤ y). GAssert produces α ′ with an evolutionary algorithm that evolves populations of assertions towards an assertion with zero false positives and the lowest number of false negatives. The evolutionary algorithm explores the search space by (i) selecting pairs of assertions (parents) by means of fitness functions that reward solutions with fewer oracle deficiencies, (ii) creating new (and possibly fitter) offspring by exchanging genetic materials (portions of assertions) of the parents with crossover operators, and (iii) mutating the offspring (with a certain probability) using mutation operators.
We now exemplify how the evolutionary algorithm obtains α ′ : (min ≤ x) AND (min ≤ y) during the first iteration. Let us assume that the algorithm selects two parents α p1 : min ≤ x and α p2 : min y. The assertion α p1 reduces the number of false positives with respect to the initial assertion (FP(α p1 , S + ) = 0, where S + = {s + 1 }), but it does not reduce the number of false negatives, because α p1 evaluates true under . Conversely, the assertion α p2 reduces the number of false negatives (FN(α p2 , S − ) = 0), but it has the same number of false positives as α (FP(α p2 , S + ) = FP(α, S + ) = 1). The crossover operator merge crossover applied to α p1 and α p2 produces the offspring α o1 : (min ≤ x) AND (min y) and α o2 : (min ≤ x) OR (min y). If the mutation operators mutate α o1 into (min ≤ x) AND (min ≤ y), GAssert obtains an improved assertion with zero oracle deficiencies wrt S + and S − , and the first iteration terminates.
At the second iteration, OA takes in input α : (min ≤ x) AND (min ≤ y) to find its oracle deficiencies (if any). For this assertion, OA does not find false positives, but it reports a false negative: In fact, α returns true under s − 3 , while it should return false. Given the assertion α : (min ≤ x) AND (min ≤ y), the correct states S + ={s + 1 } and the incorrect states S − ={s − 2 , s − 3 }, the evolutionary algorithm returns the improved assertion α ′ : ((min == x) OR (min == y)) AND ((min ≤ x) AND (min ≤ y)). This assertion does not have oracle deficiencies wrt S + and S − (i.e., FP(α ′ , S + ) = FN(α ′ , S − ) = 0). As OA does not find oracle deficiencies for α ′ , the GAssert improvement process terminates.
The following two subsections describe in detail how GAssert serializes program states and how it improves assertion oracles.

Program State Serialization
A program state s = {v 1 , · · · , v n } is a set of variables that are in memory at a certain execution point. Each variable v i has a type type(v i ), an identifier id(v i ) and a value value(v i ). Deciding which variables compose a program state is a key design choice. It defines both the expressiveness of the assertions that GAssert can produce (the dictionary Σ) and the size of the search space (A). GAssert should consider variables that capture useful properties of the method under test, and ignore irrelevant ones. Indeed, considering too many variables unnecessarily increases the search space, which hardens the problem of finding oracle improvements.
Given a method m(ì p) with formal parameters ì p, GAssert constructs the program state s at ρρ considering as variables all parameters p i of ì p and all the local variables created in m that are visible at ρρ. Note that when m(ì p) is a non-static method, the object receiver of m (this in Java) is m's first parameter p 0 . GAssert captures the values of the parameters both at the beginning of the method (adding the prefix łoldž to the variable identifiers) and immediately before ρρ. By considering łoldž values, GAssert can generate assertions that predicate on method preconditions [9].
When the considered variable has a primitive type, GAssert simply adds its runtime value to the program state (rounding floats with a fixed precision) using the variable name as identifier. However, variables of object-oriented programs can have both primitive and non-primitive (object) types, introducing the problem of obtaining primitive values from objects. Given a non-primitive variable v i , there are two well-established approaches to obtain primitive values: object serialization and observer abstraction. Object serialization [42] captures the values of all primitive-type object fields that are recursively reachable from v i . Observers abstraction [2] captures the return values of observer methods invoked with v i as the object receiver. Observer methods are side-effect free methods that are declared in v i 's class and return primitive values. Such values often characterize important properties of objects [1,2].
Both approaches have advantages and disadvantages. Object serialization can lead to many variables, which unnecessarily increase the search space. Indeed, many recursively obtained primitive variables often refer to implementation details that do not capture interesting properties of objects. Observers abstraction is inherently incomplete because the available observer methods might not capture all the relevant aspects of the analyzed objects [2].
Hybrid State Serialization. To address the issue, GAssert opts for a hybrid solution that combines both approaches. We rely on observer methods for all non-primitive variables considered by GAssert. In addition, we use the object serialization approach only for the object receiver (this) of the method under test m, capturing the values of all primitive fields of this. For non-primitive fields of this, we do not serialize their recursively reachable primitive fields, but again we use the observers abstraction approach. The rationale is that the primitive fields of m's object receiver are more likely to capture important aspects of the behavior of m than recursively reachable primitive fields or other method parameters. We now describe in detail our hybrid approach.
Let v i be a non-primitive variable considered for constructing the program state s. GAssert finds the observer methods { f 1 , f 2 , . . . , f n } of the class C of v i by using a static analyzer that scans the bytecode instructions of the public methods in C. The analyzer marks a method f j as observer if (i) f j returns a number or a Boolean, and (ii) f j cannot directly or indirectly execute putfield or putstatic bytecode instructions (m is side-effect free), and (iii) f j does not have parameters (besides the object receiver). When collecting state s at runtime, for each observer method f j with return type τ j , GAssert adds to state s a variable with identifier łid(v i ).f j ž, type τ j and value the result of the invocation of v i .f j .
For non-primitive variables of type array, string or Java collection (objects that extend java.util.Collection) GAssert considers a smaller set of observer methods that capture the most important properties of such object types. GAssert adds v i .size (v i .lenдth for arrays and String) and v i .isEmpty to the state s.
If v i is the object receiver (i.e., id(v i ) = this), GAssert serializes it by adding variable this.field j to the state s, for each primitivetype field field j of this. It then applies the observer methods approach described above to each non-primitive fields of this.
Collecting States. Function instrument-method-at-programpoint instruments the method m that contains ρρ by adding two method calls (Algorithm 1, line 2). One at the beginning of the method (to get the łoldž values) and the other immediately before ρρ. When a test execution reaches the instrumented method calls, GAssert performs the state serialization described above. Every time GAssert executes a new test, it stores the observed states so that the fitness functions can compute the number of FP and FN without requiring expensive program re-executions.
Initial Program States. Function get-initial-correct-and-incorrect-states (line 3 Algorithm 1) generates a set of initial correct (S + ) and incorrect (S − ) program states by executing an initial test suite on both the instrumented program P ′ and its faulty versions. The rationale of considering these initial states (as opposed to immediately relying on the oracle assessor OA) is to minimize the number of iterations of the while loop (line 4 Algorithm 1). In this way, GAssert avoids invoking OA to detect obvious oracle deficiencies, and rather lets OA focus on hard-to-find ones.
Post-processing the States. Function get-initial-correct-andincorrect-states post-processes the states with two scans. The first scan removes redundant states from S + , so that ∄s 1 , s 2 ∈ S + such that s 1 and s 2 are equivalent (s 1 ≡ s 2 ), i.e., all corresponding variables have identical values: . The second scan checks that each state in S − is indeed incorrect, i.e., the seeded fault (the mutant) has successfully corrupted the program state. For each incorrect state s − ∈ S − , GAssert retrieves the correct state s + ∈ S + , obtained when executing the same test that produced s − on the original version of the program (without the seeded fault). If s − ≡ s + , GAssert found a likely equivalent state and removes s − from S − . We call them likely because our collected states encode only a fragment of the actual program state.
Dictionary of Variables. Function get-dictionary-of-variables (line 5 Algorithm 2) builds the dictionary of variables Σ that function oracle-improvement uses to create new assertions. The function picks an arbitrary state s in either S + or S − (by construction all states have the same variables), and adds all the variables in s to Σ.

Oracle Improvement
A major challenge to automatically improve assertion oracles is the huge search space of candidate solutions (A in Section 2), which grows exponentially with the number of variables and functions.
GAssert addresses this challenge with Genetic Programming (GP) [3,45]. We formulate the oracle improvement problem as a multi-objective optimization problem (MOOP) [31,34,41] with three competing objectives: (i) minimize the number of false positives (FP); (ii) minimize the number of false negatives (FN); (iii) minimize the size of the assertion, that is, the number of variables and functions in it. The latter objective helps to improve the quality of assertions, as long assertions are often difficult to understand.
Classic multi-objective evolutionary approaches, for instance NSGA-II [8,43], rely on Pareto optimality [18,39,41] to produce solutions that offer the best trade-off between competing objectives [41]. However, in our case not all assertions with an optimal trade-off between FPs and FNs are acceptable solutions. As discussed in Section 2, we aim to obtain assertions with zero FPs and the lowest number of FNs. On the other hand, primarily focusing on reducing FPs may be inadequate, as there may not be enough evolution pressure [45] to reduce the FNs at the same time.
Hence, we propose a co-evolutionary approach that evolves in parallel two distinct populations of assertions (Popul FP and Popul FN ) with two competing objectives: reduce the false positives (fitness function ϕ FP ) and reduce the false negatives (fitness function ϕ FN ). These populations periodically exchange their best individuals (population migration) to add promising genetic material in both populations. Eventually, Popul FP will more likely produce assertions with zero FPs and fewer FNs. In fact, the migration of best individuals adds in Popul FP assertions with a decreasing number of FNs.
Fitness Functions. Both ϕ FP and ϕ FN are multi-objective fitness functions. The former gives priority to reducing false positives, while the latter to reducing false negatives. Both functions consider the remaining objectives only in tie cases. In multi-objective optimization, the fitness of a solution is often defined by the concept of dominance (≺) [8]. While the standard definition of dominance gives the same importance to all objectives, we need an unbalanced definition towards FPs and FNs, which we define as follows: Definition 3. FP-fitness (ϕ FP ). Given two assertions α 1 and α 2 and two sets of correct S + and incorrect S − states, α 1 dominates FP α 2 (α 1 ≺ FP α 2 ) if any of the following conditions is satisfied: Definition 4. FN-fitness (ϕ FN ). Given two assertions α 1 and α 2 and two sets of correct S + and incorrect S − states, α 1 dominates FN α 2 (α 1 ≺ FN α 2 ) if any of the following conditions is satisfied: In tie cases, FP(α 1 , S + ) = FP(α 2 , S + ) and FN(α 1 , S − ) = FN(α 2 , S − ), both functions favor smaller assertions. If neither α 1 ≺ α 2 nor α 2 ≺ α 1 , the choice between α 1 and α 2 is random. We now describe the details of our co-evolutionary algorithm (Algorithm 2).
Building the Initial Populations. Both populations Popul FP and Popul FN contain N assertions each. We represent an assertion α ∈ Popul as a rooted binary tree [7], where leaf nodes are variables or constants (terminals) and inner nodes are functions. Each node has a type, either Boolean or numerical. The type of leave nodes is the type of the associated variable, the type of the inner nodes is the type of the function outputs. We define the size of an assertion α, size(α), as the number of nodes in its tree representation.
Function get-initial-population at lines 2 and 3 of Algorithm 2 initializes the two populations, Popul FP and Popul FN , respectively, in the same way. Half of the initial population consists of randomlygenerated assertions (to guarantee genetic diversity), the other half of assertions is obtained by randomly mutating the input assertion α (to have łgoodž genetic material for evolution). Intuitively, an improved assertion could include fragments similar to the input assertion, thus initializing the populations with variants of α increases the chances of introducing łgoodž genetic material.
GAssert produces the first half of individuals with a tree factory operator that takes a type τ (either number or Boolean) and Popul new ← get-best-individuals(ϕ) 21 while Popul new is not full do 22 ⟨a p1 , a p2 ⟩ ←select-parents(Popul, ϕ) 23 ⟨a o1 , a o2 ⟩ ←crossover-and-mutation(a p1 , a p2 , Σ) 24 add ⟨a o1 , a o2 ⟩ to Popul new 25 return Popul new a depth d, and returns a randomly-generated assertion with root of type τ and depth of the tree d. Because the root of an assertion must be of Boolean type, GAssert always sets τ to Boolean, and invokes tree factory N /2 times with random values of d. Tree Mutations. To obtain the second half of individuals, GAssert relies on two classic tree-based mutation operators: Node Mutation changes a single node in the tree [5]. It takes as input an assertion α and one of its nodes n, and returns an assertion α 1 obtained by replacing the node n in α 1 with a new node with the same type of n (chosen randomly).
Subtree Mutation replaces a subtree in the tree [5]. It takes as input an assertion α and one of its nodes n, and returns a new assertion a 1 obtained by substituting the subtree rooted at n with another subtree. Such a subtree is generated by the tree factory operator with the type of n as τ and a random number as d. Stopping Criterion. Algorithm 2 evolves the two populations in parallel until either Popul FP or Popul FN contains a perfect assertion α ′ with respect to the correct and incorrect states in input (line 6 of Algorithm 2). If GAssert finds the perfect assertion before a maximum number of generations, it continues the evolution process to see if it can find perfect assertions of smaller size. Algorithm 2 prematurely terminates when the overall time-budget B expires or when it reaches a maximum number of generations. In both cases, GAssert returns the best generated assertion, the one with zero false positives and the lowest number of false negatives, that is, α ′ s.t. ∄α ∈ {Popul FP ∪Popul FN } : α ≺ FP α ′ (line 15 Algorithm 2). Lines 9 and 10 of Algorithm 2 evolve in parallel the two populations by invoking function select-and-reproduce (lines [17][18][19][20][21][22][23][24][25]. The function implements the classic evolutionary approach [45], which works in three consecutive steps: selection, crossover and mutation. GAssert introduces novel selection and crossover operators that are specific for the automatic oracle improvement problem. Fitness Computation. GAssert initializes the selection process by computing the number of false positives FP(α, S + ) and false negatives FN(α, S − ) for each α ∈ Popul (function compute-fitness line 17 Algorithm 2). Both fitness functions need this information to compute the dominance relation. GAssert optimizes the fitness computation by (i) loading the S + and S − states in the primary memory to avoid costly re-executions of the program, (ii) parallelizing the computation, (iii) caching the results to avoid recomputing them upon encountering the same assertion multiple times.
Function select-and-reproduce initializes the new population Popul new with the empty set (line 18 of Algorithm 2) performing the elitism if gen % frequency-migration = 0. It then proceeds with parent selection, parent crossover and offspring mutation, adding the resulting offspring to Popul new until Popul new reaches size N .
Parent Selection. Function select-parents selects two parents a p1 and a p2 from Popul (line 22 in Algorithm 2). GAssert implements two different selection criteria, tournament and best-match selection, and chooses between them with a given probability.
Tournament Selection [28] is a classic GP selection criterion [45]. It runs łtournamentsž among K randomly-chosen individuals and selects the winner of each tournament (the one with the highest fitness) [28]. As GAssert needs two parents, it plays two tournaments to obtain α p1 and α p2 . We choose K = 2 (the most commonly used value [26]) as it mitigates the local optima problem [45].
Best-match Selection is a new criterion presented in this paper, which is specific to the oracle improvement problem. The criterion exploits semantic information about the correct and incorrect states that each assertion covers. Let cov + (α,S + ) denote the subset of S + on which α evaluates to true, i.e., cov + (α,S + ) = {s + ∈ S + : α[s + ] = T } ⊆ S + . Let cov − (α,S − ) denote the subset of S − on which α evaluates to false, i.e., cov − (α,S − ) = {s − ∈ S − : α[s − ] = F } ⊆ S − . The best-match criterion selects the first parent α p1 randomly from Popul. If Popul is Popul FP , the best-match selection criterion gets the set of all assertions α 1 ∈ Popul FP such that {cov + (α 1 ,S + )\cov + (α, S + )} ∅. For each assertion α 1 in the set, the best-match selection criterion considers the cardinality of {cov + (α 1 ,S + )\cov + (α,S + )} as the weight of α 1 . It then selects the second parent α p2 from the set using a weighted random selection, where assertions with a higher weight are more likely to be selected. Symmetrically, if Popul is Popul FN , the best-match criterion considers cov − instead of cov + . Intuitively, the criterion increases the chances of crossover between two complementary individuals that are likely to yield a fitter offspring.
Crossover. Function crossover-and-mutation exchanges genetic material between two parents α p1 and α p2 , producing two offspring α o1 and α o2 , which GAssert mutates (with a given probability) with the mutation operators used to initialize the two populations. GAssert implements two crossover operators, subtree and merging crossover, and chooses between them with a given probability. [24] is the canonical tree-based crossover. Given two parents, it selects a crossover point in each parent, and creates the offspring α o1 and α o2 by swapping the subtrees rooted at each point in the corresponding tree [24].

Subtree Crossover
Merging Crossover is an operator that we specifically defined for the oracle improvement problem. Given two parents α p1 and α p2 , it selects two Boolean subtrees, α 1 from α p1 and α 2 from α p2 , and creates the offspring α o1 : (α 1 AN D α 2 ) and α o2 : (α 1 OR α 2 ). This operator works well in synergy with our best-match criterion, since merging two subtrees with OR and AND functions combines their semantics without disrupting them.

Node Selectors.
A key design choice is the criterion to select the nodes of the tree α for crossover and mutation. We implemented two different selection criteria: (i) Random that randomly selects a node in α, (ii) Mutation-based that selects a node in α such that the subtree rooted on this node contains at least a variable v i with the following property: ∃s − ∈ S − in which the value of v i differs from the value in the corresponding state s + ∈ S + obtained when executing on the original program the same test that yielded s − . As such, v i can recognize s − as a false negative of α. A such, a new assertion that predicates on v i could solve such a false negative.
Migration. GAssert periodically exchanges the M best individuals between the two populations, where M is a hyper parameter of the algorithm (see lines 11-14 of Algorithm 2). When selecting the best individuals GAssert considers both fitness functions so that both populations can benefit from assertions that have either the lowest number of false positives or false negatives.

EVALUATION
To experimentally evaluate our approach, we developed a prototype implementation of GAssert for Java classes. We conducted a series of experiments to answer three research questions: RQ1 Is GAssert effective at improving assertion oracles? RQ2 How does GAssert compare with random (unguided) and invariant-based oracle improvement? RQ3 How does GAssert compare with human oracle improvement? RQ1 evaluates the effectiveness of our evolutionary algorithm. RQ2 checks whether our fitness functions provide useful guidance to improve assertions. Towards this goal, we compare GAssert with a version of GAssert (Random) where the guidance provided by our fitness functions is replaced by a random choice. As a further baseline, RQ2 compares GAssert with an oracle improvement process (Inv-based) that relies on the invariant generator Daikon [9] to improve assertions. RQ3 compares our automated approach with oracle improvement performed by humans.

Subjects
We conducted our experiments on 34 Java methods. We took four methods from the SimpleExamples (SE) class, used by the recent Jahangirova et al.'s oracle improvement study [22]. Ten methods from the Daikon subjects StackAr (SA) and QueueAr (QA), often used to evaluate Java invariant generators. We selected the remaining 20 methods from four popular Java libraries: Apache Commons Math (CM), Apache Commons Lang (CL), Google Guava (GG) and JTS Topology Suite (TS). From each library we randomly selected five methods with the following characteristics: (i) contain at least five lines of code, (ii) produce a return value, (iii) are not recursive, (iv) do not write to files and do not use reflection (as the outcome of such operations cannot be captured by our assertion oracles). For each of the 34 methods, we selected as the program point ρρ of the assertion the last exit point of the method.

Evaluation Setup
To run GAssert we need an initial test suite and perform mutation analysis to get an initial set of incorrect states, in addition to the correct states obtained by running the initial test cases on the original program. We also need some initial assertions to be improved. We evaluate the improved assertions with the number of false positives and false negatives on the initial test cases and mutations. To avoid circularity in the evaluation we collected a new sets of validation test cases and mutations. The number of false positives and the mutation score obtained on the latter provide an external assessment of effectiveness.

Initial Test Cases and Mutations.
We obtained the initial correct states (S + 0 ) by executing an initial test suite (T 0 ) on the instrumented version of P. We generated T 0 by running EvoSuite [12,13] (v. 1.0.6) with the branch coverage criterion and a time budget of one minute [12]. We performed ten runs with different random seeds to collect a diverse and large set of initial test cases.
We obtained the initial incorrect states (S − 0 ) by executing the initial test suite (T 0 ) on a set of initial mutations (M 0 ) of the instrumented version of P. We obtained such mutations by running Major [23] (v. 1.3.4) enabling all types of supported mutants.
Columns ł|S + 0 |ž and ł|S − 0 |žof Table 4 show the cardinality of the initial states. Note that for some subjects |S − 0 | < |S + 0 |, which is counterintuitive. This is because GAssert removes redundant states from both S + 0 and S − 0 , and likely equivalent states from S − 0 . Initial Assertion Oracles. We obtained an initial assertion for our subjects by running the dynamic invariant generator Daikon [9] (v. 5.7.2) with the initial test suite in input. We chose Daikon because is a fully-fledged tool and is the de-facto invariant generator for Java methods [9]. Because also Daikon accepts in input a set of observer methods, we ran Daikon with the same observer methods that GAssert used to serialize program states.
Daikon generates invariants considering all possible exit points of a method (e.g., returns and exception throw statements). However, our oracle improvement process focuses on a single program point ρρ. To ensure that Daikon generates invariants that consider only the exit point at ρρ, we automatically remove all the initial test cases that do not reach ρρ (i.e., do not produce program states).
GAssert initializes half of the populations by adding the complete assertion, all single α and β, and random mutations of each of these assertions. GAssert initializes the other half of the populations with randomly generated assertions.

Validation Test Cases and Mutations.
To evaluate if the improved assertions generalize well with unseen correct and incorrect states, we generated new test cases (T v ) and mutations (M v ). We bound on the size of the assertions 50 prob. of crossover 90% size of each of the populations (N ) 1,000 prob. of mutation 20% minimum number of generations 100 prob. of tournament parent selection 50% maximum number of generations 10,000 prob. of best-match parent selection 50% frequency of elitism (every X gen) 1 prob. of merging crossover 50% frequency of migration (every X gen) 100 prob. of random crossover 50% number of assertions for elitism 10 prob. of mutate-state-diff node selector 30% number of assertions to migrate (M) 160 prob. of random node selector 70% obtained such validation sets using the tools Randoop (v. 4.2.0) and PIT (v. 1.4.0), respectively. These tools are different from the ones that provide test cases and mutations to the oracle improvement process (EvoSuite, Major and OASIs). Different tools are expected to obtain different test cases and mutations.
For each subject, we ran Randoop ten times with different random seeds using at least 100 test cases or three minutes as stopping criterion. We ran PIT enabling all types of supported mutants. Columns ł|T v |ž and ł|M v |ž of Table 4 show the cardinality of the validation test cases and mutations, respectively. Quality Metrics for Assertions. We evaluate an improved assertion α ′ by comparing the number of FPs and FNs of α ′ with the number of FPs and FNs of the initial assertion α wrt the initial and validation sets of test cases (T ) and mutations (M).
Before evaluating the assertions, we removed from all the test cases the test oracle assertions that EvoSuite and Randoop generated. We then inserted the assertion under evaluation (either α or α ′ ) into the method under analysis at the specified ρρ.
To evaluate an assertion with the initial sets, we executed T 0 and count the number of failing tests, which represents the number of false positives FP(α, S + 0 ), FP 0 in short. If FP 0 is zero, we compute FN(α, S − 0 ), FN 0 in short, by running mutation testing with mutations M 0 and test cases T 0 . If FP 0 is greater than zero, we cannot run mutation testing because we need a green test suite. In such a case, if the evaluated assertion has the form assert(α 1 AND α 2 AND α 3 ), we consider each of the smaller assertions assert(α 1 ), assert(α 2 ) and assert(α 3 ) and remove those that have false positives. We concatenate the remaining smaller conditions with ANDs and perform mutation testing with M 0 and T 0 for this reduced assertion at ρρ. If all smaller assertions have false positives then we report F N 0 for the assertion oracle assert(true).

RQ1: Effectiveness
Columns łinitial assertion αž of Table 4 indicate the quality of the initial assertions generated with Daikon. For eleven subjects (SE4, CM1, CM2, CM3, CM5, CL1, GG1, GG4, TS2, TS3 and TS4), Daikon does not generate any invariant, and in this case we consider assert(true) as the initial assertion. The false positives of α on the initial tests (FP 0 ) are always zero (except for subject CL5). This is an expected result because Daikon uses the execution traces of the initial tests to generate α. The size of the initial assertions ranges from 1 to 147 (28 on average). For six subjects the size is over 50, confirming that Daikon can generate many (often redundant) preconditions and postconditions [9]. The number of false negatives on the initial tests (FN 0 ) ranges from 0 to 1,750 (219 on average). The number of false positives on the validation tests (FP v ) is always zero, except for two subjects (SA5 and QA1), indicating that the initial assertions perform well with unseen tests. This result may depend on the high branch coverage of the EvoSuite-generated initial tests. The mutation score on the validation set ranges from 0% to 100% (38% on average).
Columns łGAssert improved α ′ (median)ž indicate the quality of the GAssert improved assertions. We report the median values of the ten executions. The number of iterations ranges from 1 to 9 (3 on average). The median FP on the initial tests (FP 0 ) is zero for all subjects, as GAssert produces by construction assertions with zero false positives wrt S + 0 . The median FN on the initial tests and mutations (FN 0 ) ranges from 0 to 1,282. The average is 125, which is 42.92% less than the average FN 0 of the initial assertion (= 219). For 24 subjects the median FN 0 of GAssert improved assertion α ′ is less than FN 0 of the initial assertion α. Although for nine subjects is the same (including subjects SE2, QA1 and QA4 with FN 0 =0), GAssert often drastically reduced the size of the initial assertions. These results demonstrate that our evolutionary algorithm is effective in improving assertion oracles. For only subject SA2, α ′ has more false negatives than α. This is because when generating invariants Daikon relies on its own set of helper functions that are not supported by GAssert, and had thus to be excluded from the initial assertion when being passed to it for improvement.
The median FP on the validation set (FP v ) is zero for all subjects. The median mutation score on the validation set (M v %) ranges from 0% to 100% (58% on average) 1 . The M v % of α ′ is higher than the one of α for 16 subjects, with an increase of 34% on average.

RQ2: Comparison with Random and Invariant-Based Oracle Improvement
In this research question we compare GAssert with two baselines: GAssert with no guidance by the fitness functions (Random) and the invariant inference of Daikon (Inv-based). We set up the process so that these two baselines are used as part of the same iterative oracle improvement process of GAssert, described in Algorithm 1. The only difference among GAssert, Random and Inv-based is how each of them performs the oracle improvement process (line 6 of Algorithm 1). When running and evaluating Random and Daikon we used the same evaluation setup of RQ1.
Random is a variant of GAssert, in which there is no evolutionary pressure in the population because any guidance by the fitness functions is disabled. We obtained Random by modifying GAssert as follows: (i) we replaced the tournament and best-match selection with random selection; (ii) we disabled the Merging crossover and Mutation-based node selector; (iii) we disabled elitism and migration. Random terminates the random evolution of the two populations when either population finds a perfect assertion or the time budget expires. In the latter case, Random outputs the best assertion (wrt ϕ FP ) among all those generated so far.
Although true random search would have been the ideal baseline, enumerating and (uniformly) sampling the search space is infeasible because of the huge size of the search space. There are 1.978 × 10 27 possible binary trees (assertion oracles) for each program point (Catalan number [44] with maximum tree size of 50). This is just a lower-bound of the search space because for each of these trees we need to consider all possible valid assignments of nodes to variables and functions. As such, we opted for a variant of GAssert that uses crossover and mutations operations to explore the search space, but without any guidance by the fitness functions.
Columns łRandom improved α ′ (median)ž indicate the quality of the assertions returned by Random. The results show that the improved assertion of GAssert dominates the one of Random for 20 (59%) and 17 (50%) subjects considering the initial and validation sets, respectively. In such cases, GAssert assertions are substantially better than the one of Random. For 6 (18%) and 10 (25%) of subjects Random assertions outperform GAssert ones, but in this cases the difference is minimal. For the remaining cases the tools are showing similar results.
Inv-based is an oracle improvement approach that relies on dynamic invariant generation to improve oracle assertions. We chose Daikon (v.5.7.2) to build Inv-based because is the only publicly available tool that meets our requirements: (i) works with Java programs, (ii) generates executable Java-like assertions, (iii) takes in input a list of observer methods (for a fair comparison, it should use the same observer methods used by GAssert).
Daikon does not aim to improve a given assertion α nor relies on incorrect executions (FN). However, Daikon can rely on the test cases that OASIs outputs, which represent evidence of false positives of α. More specifically, the Inv-based oracle improvement process repeats the following two steps until the time budget expires, or it is not able to generate any invariant, or OASIs does not find any false positives for α: (i) execute the current test suite and compute the invariant α; (ii) invoke OASIs to get the test cases that reveal FPs for α and add them to the test suite.
Column łDaikon improved α ′ (median)ž indicate the quality of the assertions returned by Daikon. For ten subjects we did not run Inv-based because Daikon did not generate an initial assertion, and thus we compare GAssert and Inv-based with the remaining subjects. Considering the fitness function ϕ FP , the improved assertion of GAssert dominates the one of Inv-based for 19 (59%) and 15 (63%) subjects considering the initial and evaluation sets, respectively. In such cases, GAssert assertions are substantially better than the one of Inv-based. For 2 (8%) and 7 (29%) subjects Inv-based assertions dominates GAssert ones considering the initial and evaluation sets, respectively. For the remaining cases nor GAssert or Inv-based assertions dominate each other. Figure 2 plots the median FN 0 (left) and M v % (right) for each pair of initial and improved assertions wrt GAssert, Random and Invbased. If a point is on the diagonal it means that the corresponding approach did not improve the false negatives or mutation score wrt the initial assertion. For the initial sets (left plot), most of GAssert points are under the diagonal, which means that GAssert produced improved assertion with less FNs. For the validation sets (right plot), most of GAssert points are above the diagonal, which means that GAssert produced improved assertion with higher mutation score. The plots also show that in most cases GAssert outperforms both Random and Inv-based oracle improvement.

RQ3: Comparison with Human Oracle Improvement
Jahangirova et al. [22] conducted a human study to assess the ability of humans to improve assertion oracles. They performed this study in two settings: (i) the assertion is improved manually by humans without any tool support (M) (ii) the assertion is improved in an iterative setting with the use of OASIs (M + O). Overall, they recruited 29 humans to participate in the study. The subject methods they considered were SA3 and SA4 from the StackAr class. Moreover, the authors also share the data collected from Amazon ® Mechanical Turk, which consists of manually improved assertions for four simple methods, performed by 74 different crowd-workers. As the results are publicly available [22], we compare these assertions with the ones produced by GAssert.
We run GAssert with the input assertions that were provided to the study participants. Then, as in RQ1 and RQ2, we measure the oracle deficiencies in the initial and GAssert improved assertions (wrt the validation set). We then compare these values to the oracle deficiencies in the assertions improved by humans. Column łtypež of Table 5 indicates whether this improvement was purely manual (M) or included OASIs (M + O). As for four methods from our set the oracle improvement was performed by crowd-workers and no action was taken to ensure that they have a proper background or experience for such a task, we apply an additional filtering step to the list of assertions for these methods. We exclude the assertions that do not improve the initial assertion, i.e., they do not have less false positives or a higher mutation score. Column łOv.ž shows the overall number of assertions available and Column łExc.ž shows the number of assertions that were excluded.
The results show that GAssert is always able to improve the initial assertion and achieve a higher mutation score. Moreover, the median values across 10 runs for GAssert and across the number of human participants for manual improvement are always the same. Column łGPBž (GAssert Performs Better) reports the number of manual improvements that achieve a lower mutation score than GAssert does (10% of cases).

RELATED WORK
GAssert is the first fully-automated technique to improve oracle assertions. The closest related work is on invariant generation, oracle quality, and oracle improvement.
Invariant Generation. Dynamic invariant generators produce Boolean expressions (called program invariants) that evaluate to true for all the executions of an input test suite [6,9,10,17,32,33,37]. GAssert improves assertion oracles by reducing their false positives and false negatives, and as such can improve the assertions produced by invariant generators, which are known to be incomplete and imprecise when used as assertion oracles [6,29,40]. Ratcliff et al. proposed an evolutionary approach [35] for invariant generation that leverages negative counterexamples to rank the invariants. Differently from GAssert, their approach uses negative counterexamples in a post-processing phase and not as a part of the fitness function. Moreover, GAssert uses OASIs to actively generate positive and negative counterexamples. In addition, GAssert considers both externally (parameters, return values) and internally observable variables (local variables, private fields).
Oracle Quality Metrics. Research on measuring oracle quality mostly focuses on assertions in the test cases (test oracles) [19,25,38]. For instance, EvoSuite [11,12] and a parameterized test case generator proposed by Fraser and Zeller [14] select from an initial set of possible assertions those that kill the highest number of mutations. These studies propose metrics to select test oracles with no guidance on how to improve them. GAssert focuses on assertions in the program, and not in the tests, evaluates the quality of oracles in terms of both false positives and false negatives, and actively improves program oracles by generating new assertions.
Oracle Improvement. Zhang et al. 's iDiscovery approach [46] improves the accuracy and completeness of invariants by iterating a feedback loop between Daikon and symbolic execution. The invariants generated by iDiscovery are still limited within the set of Daikon templates. Therefore they are not as expressive as the ones generated with GAssert. OASIs [20,21] relies on humans to improve a given oracle assertion so that it does not suffer from the reported oracle deficiencies. Given oracle deficiencies identified by OASIs, GAssert automates the difficult task of improving assertions with a novel evolutionary algorithm.

CONCLUSION
Improving assertion oracles is important to increase the faultdetection capabilities of both manually written and automatically generated test cases [4]. In this paper, we presented GAssert, the first automated approach to improve assertion oracles.
Our experiments indicate that GAssert improved assertions has zero false positives, and the number of false negatives in them is largely reduced with respect to the initial Daikon assertions. The few sample cases with independently obtained human improvements indicate that GAssert is competitive with ś and even sometime better than ś human improvements.
We plan in the future to increase the expressiveness of GAssert assertions by also considering the universal and existential quantifiers. Also, we plan to investigate how difficult is for a developer to understand the assertions produced by GAssert.