Search-based Automated Play Testing of Computer Games: a model-based approach

. Computer game technology is increasingly more complex and applied in a wide variety of domains, beyond entertainment, such as training and educational scenarios. Testing games is a diﬃcult task requiring a lot of manual eﬀort since the interaction space in the game is very ﬁne grained and requires a certain level of intelligence that cannot be easily automated. This makes testing a costly activity in the overall development of games. This paper presents a model-based formulation of game play testing in such a way that search-based testing can be applied for test generation. An abstraction of the desired game behaviour is captured in an extended ﬁnite state machine (EFSM) and search-based algorithms are used to derive abstract tests from the model, which are then concretised into action sequences that are executed on the game under test. The approach is implemented in a prototype tool EvoMBT . We carried out experiments on a 3D game to assess the suitability of the approach in general, and search-based test generation in particular. We applied 5 search algorithms for test generation on three diﬀerent models of the game. Results show that search algorithms are able to achieve reasonable coverage on models: between 75% and 100% for the small and medium sized models, and between 29% and 56% for the bigger model. Mutation analysis shows that on the actual game application tests kill up to 99% of mutants. Tests have also revealed previously unknown faults.


Introduction
A common approach to test a computer game is by play testing it, where human users are deployed to play the game in order to find flaws, usability issues, This work is a result of iv4XR project, funded by the EU Horizon 2020 research and innovation programme under grant agreement No 856716. and to give feedback on the game user experience. This process is expensive, so introducing automation could greatly reduce the cost. Unfortunately, so far there is not much automated testing technology available for computer games. A handful that exist are tailored for specific games (and not publicly available).
Computer games also come in a great variety of genres such as action, adventure, puzzle, strategy, building, etc [1]. The difference between genres (or even within the same genre) is large, e.g. an action game is usually a fast moving event driven system but the story is linear, while an adventure game is much less event driven, but the story is often complex. While such variety is good to keep users entertained, it certainly does not help in developing an automated testing approach that would work for all, or at least most, game genres. This paper presents model-based approach for automated play testing of computer games, relying on search-based testing for generating tests. Outside the Game domain, model-based testing (MBT) [12] has long been known as a versatile testing approach. Similarly, search-based testing (SBT) [9] has proven effective for generating tests, in particular when the search space is large and exact methods are not applicable. This paper aims to formulate game play testing in such a way that SBT can be applied for automated test generation.
We present an approach for modelling game behaviour using extended finite state machines (EFSMs) in such a way that the tester can model the desired aspect of the game behaviour. Once the model is defined, SBT is applied for test generation from the model, following a typical MBT cycle.
The approach is implemented in a prototype tool EvoMBT which allows the generation of abstract tests from EFSM models by applying search-based algorithms. Empirical evaluation is carried out by applying EvoMBT on a 3D game called Lab Recruits. The concretisation and execution of abstract tests on Lab Recruits is implemented by means of an agent-based API of Lab Recruits. Results show that the proposed application of SBT and MBT are effective in achieving reasonable levels of coverage on the model and exposing faults.
The main contributions of this work are: 1. an approach combining SBT and MBT for automated game play testing 2. a tool EvoMBT for generation of tests from an EFSM model, allowing experimentation with existing search algorithms 3. publicly available artifacts (tool, models, data) that enable reproducibility of results and facilitate further research.
The rest of the paper is organised as follows: Section 2 presents the running example used throughout the paper. Section 3 discusses issues in modelling games and presents definitions used in the rest of the paper. Section 4 introduces the testing problem and Section 5 presents our proposed search-based test generation approach. Experimental results are presented in Section 6 and related work is discussed in Section 7. Section 8 concludes, and outlines future work.

Running Example
This section introduces Lab Recruits 3 , a 3D game developed for experimenting with intelligent agents. The application allows the definition of mazes, a set of rooms connected by doors. Each door is opened by one or more buttons, and each button activates one or more doors. The goal is to find the path to reach a certain room by opening doors in the right order. The game can be played by both humans and artificial agents [10]. Lab Recruits levels are defined as csv (comma-separated value) human-readable files allowing researchers to specify their tests of variable complexity.
As a running example, Figure 1 shows a level of the Lab Recruits game named buttonDoors1. The level features three doors, door1, door2, and door3, and four buttons, b0, b1, b2, and b3. Door door1 is activated by buttons b1, b2, and b3, while door2 and door3 are connected only to b2. Note that b0 is not connected, therefore pressing it has not effect in the game. Agent agent1 aims to reach the room marked with a star, and therefore to open door3. A possible path requires agent1 to press b1 to open door1 and then b2 to open door3. Since b2 also acts on door1, at this point agent1 cannot reach door3, but need to traverse door2 and press b3 to open door1. Now, agent1 walks through door1 and door3, finally reaching the star room. Even if the layout of the level is simple, the path to reach the final room is not trivial and it is not trivial for automated play testing.

Modelling Games
Computer games are stateful systems, and hence using state-based models to model them is natural. They are also very complex systems, so abstraction will have to be applied, but not to the degree that we lose control and observability of the system. Plain finite state machines (FSM) or labelled transition systems are in most cases either cumbersome or insufficient, and we will need to use, for instance, EFSM that allow variables and assignments to be superimposed over a finite state model. The running example presented in Section 2 with buttons and doors whose states change dynamically can be modelled with a plain FSM, but its size would be quite large; whereas an EFSM model would be much more succinct and easy to understand.
Unlike other types of systems, modelling a computer game has an additional challenge due to the presence of the 'world' where the game is played on. E.g. the Lab Recruits game in Section 2 is played in a virtual lab building as its 'world'. A world imposes certain constraints. Triggering a state might require a certain interactable to be interacted with, but a test agent can only do that if the interactable is physically reachable from its current position. If there is a wall between them, this is obviously problematic. So in terms of modelling, such physical constraints need to be taken into account as well. That is, a transition in the model should be translatable to a concrete sequence of actions by the agent, that are also physically possible. The same goes with observation. When the model requires that a certain condition should be checked, e.g. as an invariant to check, or as the guarding condition of a transition, it implies that the agent should be able to observe the condition. In the game setup this is not always given. A wall might be blocking the agent's sight, and hence the agent might first need to move itself to a spot where it can observe the said condition. In terms of modelling, this means that introducing guards and invariants in the model implies that there should exist a feasible way for the agent to actually observe them.
In the next subsection, we present the EFSM notation we adopt in the rest of the paper, and introduce the modelling of Lab Recruits (see Section 2) which takes into consideration the issues mentioned above regarding the modelling of games for testing purposes.

EFSM Notation
An FSM models the behaviour of a system as a finite set of states connected by transitions, where a transition could be fired by an input and returns an output. EFSMs [4] introduce data information into FSM behavioral representation. An EFSM has an internal memory, a set of variables, to store data and extends FSM transitions with guards and update transformations. Guards specify whether a transition can be performed according to the values of the variables stored in the memory. Updates allow changing variable values as a result of a transition. In this paper we adopt the following definition of EFSM.   Figure 2 has three types of transitions: solid edges for free travel, when the agent can move from one entity to the other without traversing a door; this type of transition has empty enabling and update functions. Dotted transitions model guarded movements that happen when the agent walks through a door; the enabling function check the status of the corresponding variable, while update function is empty. Dashed self loop transitions are for toggle actions, i.e., the agent presses the button; the update function changes the status of the doors connected to the pressed button. Note that the concept of 'transitions' here also incorporates 'world travel'. That is, a transition in the model is guaranteed to be physically possible in the Lab Recruits world, and furthermore the guards guarding the transitions can be physically checked as well (through some concretisation that guides the agent to observe them, achieved via automated navigation in the underlying testing framework [11]). Input set I = {travel, toggle} defines the actions an agent can perform, i.e., move (travel) or press a button (toggle). Output set O is empty.

EFSM model for Lab Recruits
The game starts with the agent near b0 and with all the doors closed, thereforex = (0, 0, 0). A feasible path to reach star room in Figure 1 has to include transition 16 (t16, for short), and therefore opening door3. The agent1 starts  Figure 1 going to b1 (t30) and pressing it (t23). Update transformation changesx to (1, 0, 0), i.e., door1 is open. Then, agent1 goes to d1m (t24) and walks through door1 reaching d1p (t20). Enabling function of t20 is 1, as x 1 = 1. After that, agent1 goes to b2 (t13) and presses it (t9), changingx to (0, 1, 1), so door1 is closed, while door2 and door3 are open. At this point, the agent goes to door2 (t8), crosses it (t5), reaches b3 (t3), and toggles it (t1). Button b3 opens door1 so thatx = (1, 1, 1), i.e., all the doors are open. Now, agent1 can reach star room following, for instance, t1, t4, t6, t14, t21, and t16. This gives an example of a feasible path from the initial position of the agent to the star room mimicking the steps an agent has to perform. We also implemented a random level generator for Lab Recruits for experimental purposes. The generator builds on the observation that the EFSM of a Lab Recruits level has a specific structure. First, a room is represented by the set of buttons it contains. Given the total number of buttons n buttons in a level and the mean mean buttons number of buttons in a room, the algorithm extracts random integers from a Poisson distribution with mean mean buttons, until all buttons are used. For instance, given n buttons = 10 and mean buttons = 2, the generated sequence 2, 3, 2, 1, 2 corresponds to the number of buttons in a level with 5 rooms. Then, given the number of doors n doors, we randomly connect two rooms until all doors are used. The algorithm guarantees that there are not unconnected rooms. Given a room, the corresponding EFSM model has a state for each button and door side, and all the states are connected by a free travel Each button has a self loop with update function that represents button-door connections. Finally, the models corresponding to linked rooms are connected by guarded travel transition, where the enabling function checks the status of the door. The generated EFSM can be transformed into the corresponding csv file level and opened on Lab Recruits.

Problem Definition
The play testing problem involves finding a sequence of actions that achieve a desired goal in the game. Given the model of the system under test (SUT), the play testing problem could be represented as a coverage problem on the model. Testing a specific play action in the game (corresponding to a transition T in the model) is equivalent to finding a prefix play that would reach the source state of T , and subsequently a suffix play to verify the effect of T . Hence, generating a test suite from the model that would cover all transitions corresponds to exercising the corresponding game play actions in the SUT. Stronger coverage criteria, such as k-transition coverage and path coverage, represent more rigorous interactions of game play actions.

Test Generation
Our proposed approach follows the generic model-based test generation approach where the SUT is abstracted into a model which is then used to generate abstract tests. The abstract tests are then concretised into concrete tests that can be executed on the SUT.
The goal of the work presented in this paper is to investigate the feasibility of applying the model-based approach by incorporating search-based test generation for the test generation phase.
The abstraction phase, where the model of the SUT is built, typically involves human involvement as it requires a good understanding of the behaviour of the SUT. In our experiments, we have used models which were crafted manually as well as randomly generated ones. However, the test generation approach presented here is independent of how the model is generated, as long as it is as described in Section 3.1. The concretisation and execution phases are specific to the SUT and could be implemented in different ways, depending on the nature of the SUT. For our experiments we have built automated transformers from abstract tests to concrete tests, and adopted an agent-based API provided by the SUT for executing the tests. For a different SUT, different concretisation (and execution) mechanisms are needed, however the generation of abstract tests remains the same, as long as the model of the SUT is provided.
In the remainder of this section, we present the search-based test generation approach for deriving abstract tests from the model of the SUT.

Search-based test generation
Test generation from models could be driven by different goals. In this paper we outline a search-based approach that can be applied to find test suits that satisfy a desired model coverage criterion, e.g., transition coverage. We present the various ingredients needed for applying a search algorithm for test generation, including individual representation, search operators, and fitness function.
Individual representation Given an EFSM, we represent an individual as a path (sequence of transitions), starting from the initial state of the model (see Section 3.1). Individuals can be of different length, up to a pre-defined maximum. For our running example (see Figure 2), I 1 = t 27 , t 28 , t 18 , t 21 represents an example of an individual. Note that paths in the model may or may not be feasible, hence an individual, as generated initially, is not guaranteed to be feasible.
Search operators With individuals represented as paths in the model, different operators could be implemented. Here we describe crossover and mutation operators that we used in our experiments. Clearly, other operators could be implemented and experimented with.
Crossover: one possible way of implementing crossover is to adopt a straightforward application of single point relative crossover. Given two individuals, a common state is chosen at random and the tails of the two individuals are swapped. For our running example, if I 1 = t 30 , t 26 , t 16 , t 11 and I 2 = t 29 , t 21 , t 17 , t 25 , crossover at state d3m results in offspring O 1 = t 30 , t 26 , t 17 , t 25 and O 2 = t 29 , t 21 , t 16 , t 11 .
Mutation: we propose three mutation operators, applied with equal probability: 1) insert self transition: insert a self transition on a randomly chosen state of the model, if such a transition is allowed 2) delete self transition: remove a self transition at random 3) delete a transition: remove a transition at random.
Fitness function Given an EFSM model and a given coverage criterion, the fitness function should guide the search towards covering all coverage targets. However, since the individual may not be feasible, the fitness function should also guide the search towards turning the individual into a feasible one. As a result, the fitness function has two components: 1) related to path feasibility, and 2) related to the search target. A high level algorithm of the fitness function we adopted is shown in Algorithm 1. To calculate the fitness of an individual with respect to a coverage target, first the individual is executed on the model and the execution trace as well as the outcome of the execution are returned (line 6 in Algorithm 1). If the individual is a feasible path, then the algorithm checks to see if the current target is present in the individual. If present (line 9) then target is covered, otherwise, the fitness value should estimate the distance from satisfying the target. In this case, we opt for a simple heuristic, i.e., penalising the individual by a predefined constant value (PENALTY1 ). Other heuristics could be applied here as well. If however the individual happens to be infeasible (line 14 in Algorithm 1, this means that a transition guard in the path represented by the individual has failed. In this case, we compute the approach level and branch distance for the path (lines 15 and 16). Approach level is computed as the number of transitions in the path yet to be traversed. Branch distance is computed based on the guard expression of the failing transition, as typically done in code-based testing [9]. We then check whether the individual contains the current target (line 17). If yes, no penalty is applied, otherwise, a penalty is applied (PENALTY2 PENALTY1 ). Finally, the algorithm computes the Search algorithms Once the individual encoding, search operators, and fitness function are defined, existing search algorithms could be applied to generate tests. In practice though, the corresponding machinery for implementing the test generation is needed. We have implemented a prototype tool EvoMBT that uses EvoSuite [5] as a library. EvoMBT implements all the model related parts, including the operators discussed above. It implements EvoSuite's interfaces in such a way that search algorithms already implemented in EvoSuite can be used out-of-the-box. Details are discussed in Section 6.

Evaluation
In this section we present the experiment we carried out in order to get insight into the feasibility of the proposed test generation approach combining search based algorithms with model based testing.

Prototype: EvoMBT
EvoMBT provides an implementation of the EFSM used in this paper and the necessary machinery for generating/executing tests from/on the model, compute fitness values, and collect coverage. EvoMBT uses search algorithms implemented in EvoSuite [5] (i.e., EvoSuite is used as a library). EvoMBT currently implements state and transition coverage criteria, relative point crossover, and a number of mutation operators. At the moment, several search algorithm found in EvoSuite can be used with EvoMBT without any modification. For the purpose of experimentation, EvoMBT also implements mutation operations on the Lab Recruits application, and enables concretisation and execution of generated tests on Lab Recruits (both mutated and original). It generates different reports that enable analysis of results as well as debugging of eventual faults. EvoMBT is publicly available in Github: https://github.com/iv4xr-project/iv4xr-mbt. We also provide an executable jar with all the necessary resources, and additional plots that could not fit in the paper, here: https://doi.org/10.5281/zenodo.4768470.

Models of the system under test
We use three different models of Lab Recruits: i) buttonDoors1: the running example (see Section 2) with 10 states and 30 transitions, ii) randomMedium: randomly generated model as described in Section 3.1 with n doors = 8 and n buttons = 10 having 26 states and 116 transitions, and iii) randomLarge: randomly generated with n doors = 15 and n buttons = 20 having 50 states and 194 transitions.

Experiment setup
Experiments are aimed at assessing: feasibility of search-based algorithms for generating abstract test sequences from the model, practicality of abstract tests for execution on the actual system under test, and fault finding potential of the generated tests. Consequently we formulate the following research questions to guide our experimental evaluation: RQ1 -Suitability of SBT how suitable are search based algorithms for test generation from the models? RQ1.1 -Model coverage how much of the models are covered by the test generation algorithms? RQ2 -Test execution are the model-based tests feasible in terms of execution on the actual application? RQ3 -Fault finding what is the fault finding potential of the tests generated from the models?
For RQ1, we report on the search algorithms we used for test generation and the coverage they achieved on the models (RQ1.1). For this purpose we use transition coverage criterion, computed as the ratio of covered transitions to the total number of transitions in the model. For RQ2, we measure the test execution time on the Lab Recruits application. For RQ3, we measure the mutation score of the tests on mutants injected into the Lab Recruits application. Mutation score is computed as the ratio of killed mutants to the total number of mutants generated. Experimental settings For the search algorithms, we kept the default values in EvoSuite. For search budget, we use 300 seconds, collecting statistical data every 10 seconds. Experiments were run on computers with Intel Core i7 processors with 8 cores @2.80GHz and 8GB memory, running Ubuntu Linux.
Experiment procedure For RQ1.1 we run each algorithm on a given model 20 times, to account for the random nature of the algorithms. Hence, we performed 5 algorithms × 3 (models) × 20 (repetitions) = 300 runs for a total of 300 × 300 (seconds) = 90000 seconds (25 hours). For RQ2, we report test execution times on the Lab Recruits application for all test suites generated. For RQ3, we report the mutation scores for all algorithms generated on buttonDoors1, randomMedium, and randomLarge.

Results
Suitability of SBT (RQ1) The first set of results are related to the suitability of search based testing for the generation of tests from models. We have applied 5 different search algorithms: MONOTONIC GA, MOSA, NSGAII, SPEA2, STEADY STATE GA for the generation of tests from the models. Figure 3 shows the coverage achieved by each algorithm on the three models of Lab Recruits.
As can be seen from Figure 3, the algorithms achieve different levels of coverage on the three models. On the running example (buttonDoors1), which is the least complex of the three with 30 transitions in total, all algorithms achieve high levels of coverage with MOSA achieving 100%. On the randomMedium model which has 116 transitions, MOSA still achieves full coverage, while the other algorithms achieve less than for the smaller model. On the largest of the three, which has 194 transitions, the coverage achieved by the algorithms decreases with MOSA achieving 59% median coverage while the others achieve below 37%. Given the size of the model, the results could potentially improve if search budget is increased. As can be seen in Figure 4, the trend shows that the coverage is likely to increase with increased search budget.
We answer RQ1 positively: different search algorithms could be applied for test generation from models, achieving reasonable levels of transition coverage.  Fault finding (RQ3) With RQ3, we assess the fault finding potential of the generated tests. We created mutants in the Lab Recruits application in which the association between buttons and doors is changed, i.e., a link between a button and a door is removed. We created a number of mutants and executed the generated test suites on each mutant. Given that the tests may not all run successfully on the original application (see Table 1, 'Failed' column), we executed only the passing tests on the mutants, and calculated the mutation score.
The results, presented in

Related Work
Although SBT has been successful for various types of software, its application in computer games has not been much studied. Directly using a search algorithm to search for test sequences without any model has not been attempted, as far we as know. The search space is far too large for such an approach to work. Instead, existing works tend to employ search algorithms for optimizing learning-based automated agents. For example Holmgård et al [6] uses Monte Carlo Search Tree (MCTS) based agents to replace human play testers. A genetic algorithm is used to evolve MCTS' selection policy towards a desired play style. The case study is small; a game played in a 10×20 grid. It is unclear if the approach would scale to bigger games.
There were very few studies, as far as we could find, on the use of MBT for testing computer games, e.g. [13,2,7]. Ariyurek et al [2] use a scenario graph for generating abstract test sequences. Such a graph is essentially an FSM whose states are decorated with a set of predicates that abstractly describe the state of a game under test. MCTS is used to search for the concrete sequence of actions that implement the steps in an abstract test sequence. The case studies are however small scale games, played in a grid not larger than 10×11; it is not clear if MCTS would scale to a larger search space. To deal with larger game worlds Prasetya et al. investigate the use of navigation mesh as a model [11] of the game world under test, subjected to their agent-based automated testing framework iv4XR/aplib [10]. The idea is taken from pathplanning, where the walkable areas of a physical or virtual world is divided into a finite set of connected shapes (e.g. triangles). This reduces the initially infinite search space into a finite graph which is then used as a model to guide agents' navigation, e.g. using A*.
Iftikhar et al [7] use UML state machine, which is an EFSM, to model an open source variant of the Super Mario game. The study does not however fully explore the implication of using extended FSMs. For generating tests, an N+ strategy is used [3], that is aimed at covering all transitions and round trips. This strategy is not strong enough to handle an EFSM with complex constellations of conditions. It essentially unrolls the FSM into a tree where along any full path in the tree no node is repeating, except if it is the last node.
Outside the game domain, search-based algorithms are used for generating tests from EFSMs. Many of the works are focused on finding valid paths from the models and eventually covering predefined goals (e.g., [8]). Our work is aimed at exploring the feasibility of SBT for automated play testing via modelling. In this regard, existing works on SBT from EFSMs are complementary with ours, and could eventually be experimented with in EvoMBT so as to increase the effectiveness and efficiency of test generation. For instance, the fitness function of Kalaji et al [8] could be implemented in EvoMBT in order to assess its feasibility for the play testing use case.

Conclusion and Future Work
We have presented an approach for automated game play testing by employing the combined application of search-based and model-based testing. The main objective of the work presented is exploratory in nature where we tried to assess the suitability of search-based testing for automated game play testing. Game play behavior is abstracted into an EFSM model and search-based algorithms are used to generate abstract tests, which are then converted into concrete tests that can be executed on the game. Experimental results are promising where a number of search algorithms were experimented with and achieved reasonable model coverage, mutation score, and exposed real bugs in the game under test.
The current work makes a number of choices with respect to heuristics used for test generation, such as fitness function and search operators. There are different alternatives that could be implemented and experimented with as part of future work.