On Transforming Model-based Tests into Code: A Systematic Literature Review

Model-based test design is increasingly being applied in practice and studied in research. Model-Based Testing (MBT) exploits abstract models of the software behavior to generate abstract tests, which are then transformed into concrete tests ready to run on the code. Given that abstract tests are designed to cover models but are run on code (after transformation), the effectiveness of MBT is dependent on whether model coverage also ensures coverage of key functional code. In this article, we investigate how MBT approaches generate tests from model specifications and how the coverage of tests designed strictly based on the model translates to code coverage. We used snowballing to conduct a systematic literature review. We started with three primary studies, which we refer to as the initial seeds. At the end of our search iterations, we analyzed 30 studies that helped answer our research questions. More specifically, this article characterizes how test sets generated at the model level are mapped and applied to the source code level, discusses how tests are generated from the model specifications, analyzes how the test coverage of models relates to the test coverage of the code when the same test set is executed, and identifies the technologies and software development tasks that are on focus in the selected studies. Finally, we identify common characteristics and limitations that impact the research and practice of MBT: (i) some studies did not fully describe how tools transform abstract tests into concrete tests; (ii) some studies overlooked the computational cost of model-based approaches; and (iii) some studies found evidence that bears out a robust correlation between decision coverage at the model level and branch coverage at the code level. We also noted that most primary studies omitted essential details about the experiments. Copyright © 2023 John Wiley & Sons, Ltd. Received

MDE employs more elaborate models to support the evaluation of quality attributes such as reliability and security during development and model-driven evolution.A hallmark of MDD and MDE is that models are kept at a level where it is still relatively easy to make large-scale changes without getting bogged down in implementation details [6].This ability to separate abstraction layers has many benefits, including speeding up the overall development, reducing the effort of making changes, and producing more reliable software.Partly because of the expense of changing software after deployment, MDD tends to be applied more commonly in embedded and real-time systems such as transportation systems and electronic appliances.
On the one hand, if a model is executable, such as models written in executable UML [37] or Simulink2 [17], then by running the model it is possible to test some of its aspects.Software engineers may also exploit sophisticated model-to-code transformation tools to automatically generate code from the model.On the other hand, some model languages, such as UML statecharts, are not executable and do not have sufficiently defined semantics to support automatic model-to-code transformation.Thus, these models are often transformed into code by hand.
A common application of models is to design test cases.An early example derived test cases from (non-executable) UML statecharts [41].The test cases covered specific elements of the statechart at the model-level, then were run on the code-level implementation.Subsequent papers referred to test cases defined at the model-level as abstract tests, while their corresponding code-level implementations are called concrete tests.This concept, called model-based testing (MBT) [54], is now used widely throughout the software industry and has led to hundreds of papers exploring various aspects of MBT.A key question is related to coverage.If test cases are designed to cover specific aspects of the model (nodes, edges, logic predicates, etc.), what is the relationship between model coverage and code coverage?Does the code include decisions that were not in the model?Are covered elements of the model dispersed into different places in the code?As the code changes over time, how can the tests be kept up to date?One of the most significant challenges arises due to standards.For example, the US Federal Aviation Administration, the European Union Aviation Safety Agency, and the Transport Canada department require that safety-critical software on commercial airplanes and air-traffic control systems to be tested to a stringent standard [46].The same standard is often looked to as a goal in other transportation industries, such as trains and automobiles.The coverage requirements in the standard are defined on the code level, not the model.Thus, compliance cannot be based on test cases derived from models.Software companies must show that test cases that run on the code will fill in any "gaps" in coverage introduced by the model-to-code transformation.
These gaps between model coverage and code coverage also make traceability crucial.To measure and ensure code coverage for model-based tests, engineers must be able to trace from model-level element to code-level element, and from model-level coverage to code-level coverage.Research into these crucial questions has been going on since 1990, and this article is the first attempt to catalog and categorize relevant papers.A particular challenge is that these papers have been published in many different conferences and journals, and employed diverging sets of terms.This makes it hard for researchers and practitioners to get a clear idea of the current state of the art.
We have carried out a systematic literature review (SLR), which is a study to identify, select, and critically appraise research to answer clearly formulated questions [26].Our SLR focuses on scenarios in which abstract models are defined prior to testing and investigates how source code coverage can be gauged from test sets generated based on model-based testing approaches.More specifically, our SLR makes the following contributions: • it characterizes how test sets generated at the model level are mapped and applied to the source-code level; • it discusses how tests are generated from the model specifications when MBT is applied; ON TRANSFORMING MODEL-BASED TESTS INTO CODE: A SYSTEMATIC LITERATURE REVIEW 3 • it analyzes the relationship between the test coverage of models and the test coverage of the code when the same test set is executed; and • it provides an overview of all selected studies, including varied classifications applied to them, which is made available online as complementary material. 3ur SLR applies a snowballing process to search for papers of interest.Snowballing recursively analyzes references cited in related papers, and citations to those papers [58].In the SLR domain, such papers are termed primary studies [26] and typically describe research results from well-founded experimental procedures or from early research approaches.We started our snowballing process with three core primary studies.During three recursions, we analyzed 498 non-duplicate primary studies.
after the study selection phases, 33 peer-reviewed primary studies (including the 3 seeds) passed our study criteria, from which 30 were analyzed in this SLR given that they present either original or updated contributions.We categorized the selected studies into several groups.Given our focus on test coverage at the model and code levels, our key categorizations concern whether and how the study addresses the transformation of abstract tests to concrete tests, and the level of traceability of software elements, and the coverage of such elements, across the abstraction levels.Finally, we also categorized the selected studies based on the adopted technologies and the level of automation for test generation.
The remainder of this article is organized as follows.Section 2 summarizes concepts related to software testing, MDE, and model-based testing, and brings a brief discussion regarding test coverage at the model and code levels.Section 3 provides details of our SLR protocol, and the criteria and procedures we adopted to select and analyze the selected studies.Section 4 summarizes the results from our search.Section 5 addresses our research questions.Section 6 discusses threats to the validity of our work, and Section 7 presents prior papers that summarized related literature reviews.Finally, Section 8 presents our conclusions as well as implications and recommendations for future research.

BACKGROUND
This section introduces concepts related to model-based testing.We first discuss software testing in general, independent of whether the testing is applied to models or code.Then we discuss concepts related to utilizing models to develop software, and then focus on model-based testing.Finally, we introduce the key issues for testing when transforming abstract, model-based tests to code.

General Software Testing
We start with general concepts and terms related to software testing [4].Generally, we view testing as an act of executing some software artifact on inputs designed to assess whether the behavior is as intended.Note that the term artifact is intended to include anything that can be executed, including but not limited to code, models, and requirements.The term system under test (SUT) refers to the artifact being tested.Researchers also specialize this term to particular artifacts such as module under test, method under test, predicate under test, and clause under test.
Test inputs are the key input values used to satisfy the requirements for testing.Test inputs are sometimes called test vectors.To be able to run the tests, the inputs are usually embedded in automated scripts or methods (such as JUnit4 methods).Automated tests include additional elements beyond test inputs, including test oracles that decide whether the software behaves as intended.Test oracles can be implemented as assertions in JUnit.

Testing Coverage
A common technique is to design test cases that ensure some sort of coverage, on the theory that if some element of the software artifact is not covered, then we do not know whether its behavior is acceptable [4].The simplest coverage criterion is node coverage, which requires that each node in a graph is covered.This equates e.g. to statement overage if the graph represents code, state coverage if the graph represents state machines.Thus, node coverage, state coverage, and statement coverage amount to the same thing.A slightly more strenuous criterion is edge coverage, which requires that each edge in a graph is covered.This equates to decision coverage and branch coverage, depending on the type of artifact.Node and edge coverage treat predicates as simple black boxes.Thus, a predicate with three clauses (A && B || C) is only evaluated to true and false, without considering the different clauses.Modified Condition and Decision Coverage (MCDC) [11] requires that each individual clause evaluates to both true and false, with the other clauses being such that the clause under test determines the final value of the predicate.Thus, the example predicate p = (A && B || C) can be MCDC-covered with the test set {T T F, F T F, T F F, F F T, F F F }. A final structural coverage criterion is data-flow coverage, which requires that definitions of variables (defs) reach specific uses of those values on at least one path.
Test cases are sometimes derived from requirements, where for each requirement, at least one test case has to ensure that the requirement is satisfied (or, covered).When requirements are used, testers usually refer to behavioral or functional requirements that describe how the software should behave.
But they can also refer to non-functional requirements such as performance, timeliness, liveness, stability, smoothness, and responsiveness, among others.

Model-Driven Engineering
Model-driven engineering (MDE) [57] is an approach to software development that starts with an abstract design model that ignores concerns regarding the implementation language, operating system, and target hardware.An executable model is written in a language with enough semantics so they can be simulated directly.Models without such semantics are called non-executable models.Executable models are sometimes called formal models.Models are transformed to code either automatically by special-purpose compilers or by hand.When transformed by hand, the process is often called

model-based design (MBD).
The studies we summarize in this article do not always apply the terminology consistently, so we introduce several terms here so we can emphasize their overlap and differences.A platformindependent model (PIM) [40] describes the behavior of a system in an abstract modeling language; this is also called the model level.The complementary platform-specific model (PSM) [40] is the code level, that is, the system implemented in a programming language such as C or Java.Some studies also adopt the terms computation independent models (CIM) [40] for models that do not depend on a computation model.We also find the terms model-in-the-loop with the aim of describing software development processes that include abstract models and processor-in-the-loop to describe implementations at the code level.

Model-Based Testing
Models are defined at an abstract, high level, making them convenient artifacts for designing test cases [41].Model-based testing (MBT) designs test cases from an abstract model (model-level or abstract tests), and transforms them into test cases that can be run on the code (code-level or concrete tests).The term computation-independent tests (CIT) is sometimes used for model-level tests and computation-dependent tests for code-level tests.When test cases are designed from informal models or code-based models, such as control-flow graphs, we sometimes use the term model-driven testing (MDT).

Issues of Transforming Models to Code
When models are transformed to code, whether automatically or by hand, the structure of the code might differ from the structure in the design model [7,39].This brings up a serious issue: the ON TRANSFORMING MODEL-BASED TESTS INTO CODE: A SYSTEMATIC LITERATURE REVIEW 5 code-level test cases may not achieve the same level of coverage as the model-level test cases.This is serious for aeronautics software in particular, and transportation software in general.For example, the US Federal Aviation Administration (FAA) requires full code coverage to certify safety critical avionics software [46], and requires that each test must be derived from the requirements.This makes it imperative that when model-based test cases are transformed to the concrete level, testers are able to ensure traceability from abstract tests to concrete tests [2].When code is automatically derived from models, potential problems with the transformation m motivate the application of the so-called witness functions (in the form of traceability [3]) that allow differences to be discovered.

STUDY SETUP
This section provides key information about the protocol we defined for our SLR.We follow the guidelines for conducting secondary studies proposed by Kitchenham et al. [26] and Wohlin [58].
Our full protocol is available online. 5

Goals and Research Questions
The general goal of this SLR is to analyze the state of the art in model-based testing with respect to how source code coverage can be measured from test sets generated using model-based testing approaches.This goal is achieved by answering the following research questions (RQs): • RQ1: How are test suites that are developed at the model level mapped to the code level; code which may or may not be created by automatic transformation?Our RQs emphasize transformation details because we believe that by having a more complete understanding of "under the hood" transformation details testers can have a better idea of how to improve test cases at both model and code levels.As a result, testing and language design principles can be brought to bear on the model-to-code transformation problem.Specifically, by being more knowledgeable about details of the modeling language, testers can help the language evolve by making certain constructs/elements more explicit (i.e.targeted by the transformation).
It is also worth mentioning that, as stated by Stürmer et al. [50], rendering high-level models into code poses a set of challenges that in a way differ from the challenges inherent to traditional compiler design.Most notably, the semantics of the modeling language often is not explicit, and may depend on layout information (e.g.position of the states).Consequently, code generation entails more than simply performing stepwise transformations from the model representation into code: in effect, a series of computation must be derived from the analysis of data dependencies.Therefore, understanding the model-to-code transformer backend and how it turns models into code can help testers arrive at a better operational understanding of the transformation and allow them to focus on corner cases (boundary values and code elements that are seldom covered during MBT).

Inclusion and Exclusion Criteria
The inclusion criteria (Section 3.2.1)and exclusion criteria (Section 3.2.2) for study selection are presented in this section.A study was selected if it passed ((I1 ∧ I2) ∨ I3) ∧ I4).Studies that fulfilled at least one of the exclusion criteria were not selected.
The protocol available online provides more details about how these criteria were applied.Regarding E2, although secondary studies were not included in our final set of selected studies, some may be of interest as a source for new studies; therefore, they were analyzed in an additional study selection round (details in Section 4).

I1:
The study proposes/applies model-based testing for/to models.

I2:
The study addresses automatic model-to-code (or model-to-text) transformation.

I3:
The study addresses the mapping from test suites developed at model level to source code level.

I4:
The study must have undergone peer-review.
We highlight that this literature review particularly focuses on research that addresses modelto-model or model-to-code transformations, with an emphasis on the automatic transformation of models to code.Note that our RQs and inclusion criteria (particularly, I2 and I3) reflect this intent.This is not strictly the case of MBT in general, which would be the case of I1 if applied individually.
While the combination of I1 with I2 allows for the selection of studies that explore MBT and forward engineering of the models to lower levels of abstraction, I3 solely leads to the selection of studies that establish relationships between tests that evolve from the model level to the code level.The three rightmost columns of Tables II and III show the criteria each selected study fulfilled.
Regarding I4, we only selected studies published in scholarly venues (which are well-established types within the research community), namely, conference proceedings, symposium proceedings, workshop proceedings, and scientific journals.

Exclusion (E) Criteria
E1: The study emphasizes hardware testing.
E2: The study is a secondary study.
E3: The study is a peer-reviewed study that has not been published in journals, conferences, symposia, or workshop proceedings (e.g.Ph.D. theses and technical reports).
E4: The study is not written in English.

Search Strategy
The first focus of our work is on a literature study.As we found it very hard to find search strings to match a manageable number of primary studies of interest to this study, we employed a snowballing process based on three initial studies.Snowballing, also referred to as citation analysis, is a literature search method that can take one of two forms: backward snowballing or forward snowballing [26,58].
Backward snowballing starts the search from a set of studies that are known to be relevant (either a start set, or the current set of selected studies).It involves searching the references sections of the studies.Forward snowballing entails finding all studies that cite a study from either the start set or the current set of selected studies.Both search methods update the set of selected studies in an iterative fashion; only the studies included in the previous step are considered in each search iteration, and both backward and forward snowballing end when no new primary studies are found in the search iterations.
Three reviewers were in charge of running the search process.During backward snowballing, they extracted references from the background, related work, and experimental setup sections of the study ON TRANSFORMING MODEL-BASED TESTS INTO CODE: A SYSTEMATIC LITERATURE REVIEW 7 under analysis.For studies that did not include these sections, they considered other sections such as the introduction.Details of our analysis procedures can also be found in the spreadsheet that is available online. 6ur initial set of primary studies, the seeds, includes the three studies listed in the sequence.In order to achieve a comprehensive understanding of the research landscape in this field, we selected studies based on three criteria: age, prominence, and relevance.Our goal was to identify a visionary paper that identified the problem and a mature paper that represented the current state-of-the-art.
Along with these two papers, we included a paper from the research group that inspired this work.
While we acknowledge the limitations of age and prominence as selection criteria, we believe that taking these criteria into consideration was necessary to identify key contributions to the field.Older studies tend to have more citations since these studies have had more time to accumulate citations, while recent studies may not have had the opportunity to accumulate as many citations.However, prominent studies may have gained attention more quickly, hence these studies may have been cited more frequently in a shorter amount of time.In hindsight, our selection criteria led to the identification of seeds that have proven to be valuable for the snowballing process, leading to the identification of additional key contributions in the field.Thus, we believe that our approach was a useful starting point for our study.We analyze this stopping criterion in Section 6 (Threats to Validity).

Procedures for Data Extraction and Analysis
To answer the RQs described in Section 3.1, we extracted from primary studies the information outlined in a data extraction form.Before starting the review, the data extraction form was revised by all involved reviewers.Beyond data extraction fields intended to gather general information about the primary studies (e.g.title, authors, year, and publication venue), the form includes the following fields: (1) The general goal of the study; (2) A description of the study from the perspective of each research question; (3) The main results of the study; (4) The conclusion of the study, cf. the original authors; (5) The conclusion of the study, cf. the reviewers; (6) The target specification language (at model level); (7) The target programming language (at code level); (8) The tool used for model-to-text transformation (for the main software artifacts); (9) The tool used for automatic test case generation (at the model level); (10) The tool used for test set transformation (from model to source code); (11)  After extracting data from all selected studies, the three reviewers in charge of the primary study selection checked all extracted data to make sure the data is accurate and ready for further analysis.
We intentionally structured the data extraction form in terms of the RQs to facilitate the identification of pieces of information that would help us develop the discussion as well as outline the conclusions with respect to each RQ.In particular: fields (1) to (5) supported the discussions regarding RQ1, RQ1.1, RQ2, and RQ3; fields (6) to (10) supported the discussions regarding RQ4; field (11) supported the discussions regarding RQ3; and fields (12) to (15) added details to enrich the descriptions and discussions presented in this article.All studies that provided relevant information with respect to a given RQ are listed in the beginning of the sections that discuss the RQs (namely, Sections 5.1 to 5.5).

SEARCH ITERATIONS AND RESULTS
Table I summarizes the search rounds.It shows the number of backward references and forward citations analyzed in each study selection round, and shows which studies we have selected.For the sake of completeness, the table includes the initial seeds in Round 0. The analysis of forward citations was updated in February, 2020.Columns 4, 5, and 8 show two values for each entry regarding forward snowballing: the left-hand values refer to the first analysis of forward citations, and the right-hand values refer to the most recent analysis.As an example, for study P0003 (column 2), we analyzed 12 studies retrieved in March 3, 2018, and an additional 12 studies retrieved in February 18, 2020.From these, none were included in our dataset (column 8).The table provides the following details: • The study IDs7 and references (column 2).The study IDs are composed by a prefix P followed by a sequential number assigned to each study we retrieved through either backward or forward snowballing.
• The number of analyzed backward references and forward citations with respect to each seed (columns 3 and 4, respectively).The numbers of backward references and forward citations listed in the table refer only to non-duplicate entries (i.e.entries that did not appear in a previously analyzed study).
• The date on which forward citations were retrieved with the Google Scholar8 search engine (column 5).
• The number of selected studies through backward snowballing and which studies these are (columns 6 and 7).
• The number of selected studies through forward snowballing and which studies these are (columns 8 and 9).In column 9, studies with a * prefix were selected in the forward snowballing update.

ON TRANSFORMING MODEL-BASED TESTS INTO CODE: A SYSTEMATIC LITERATURE REVIEW 9
Note that two additional rounds (Additional Round in Table I) were performed.The first concerns the analysis of a secondary study (ID P0237) found in Round 2, from which we retrieved and analyzed the references.The round named Additional Round (from experts) refers to the analysis of studies suggested by experts, 9 which was done in February, 2022.
In total, we analyzed 180 backward references and 318 forward citations.From the backward references, 17 studies were selected, whereas 16 studies were selected from forward citations.From the 33 selected studies, P0064 [52] was subsumed10 by P0253 [51]; furthermore, P0498 [15] and P0499 [35] were subsumed by P00487 [14].Therefore, we ended up with a set of 30 studies that we analyze in the next sections of this article.
To illustrate the process of analyzing a particular study, from the start set, let us consider study P0003, by Briand et al. [9], titled Testing the Untestable: Model Testing of Complex Software-intensive Systems.We have observations from both backward and forward snowballing.
• Backward snowballing: We analyzed the "Background & State of the Art" section of the study, since the study is not a conventional paper (it was published in the Visions of 2025 and Beyond
• Forward snowballing: We analyzed 12 forward citations to this study in March 2018, and another 12 in February 2020.None were selected.
Tables II and III list the 30 studies we analyze in this article.They show the study ID (column ID), the snowballing iteration round (column R) that reflects the first detection of a study, the snowballing technique (column B/F for 'B'ackward or 'F'orward), the reference entry (column Ref.), the list of authors (column Author(s)), the study title (column Title), the venue in which the study was published or presented (column Venue), and the results of the application of the inclusion criteria (columns I1, columns I2 and columns I3).In the column that indicates the round, "4" represents the forward snowballing update, "a1" represents Additional Round (from SLR), and "e1" represents Additional Round (from experts).
Figure 1 depicts the distribution of selected studies per publisher.IEEE Xplore11 includes the most studies in our SRL (twelve studies), followed by ACM Digital Library12 (five studies) and Springer SpringerLink 13 (four studies).
Figure 2 shows the citation map between the selected studies.Continuous edges indicate studies retrieved via backward snowballing; in these cases, a study in a destination node was cited by the study in the origin node (e.g.P0003 cited P0010 and P0011).Dashed edges indicate studies retrieved via forward snowballing; in these cases, a study in an origin node cited the study in the destination node (e.g.P0010 is cited by P0071, P0086, P0443, P0448, and P0451).Studies with no incoming and outgoing edges were included based on experts' suggestions (namely, P0496, P0497, P0498, and P0499).In the citation map, the set of 30 studies we analyze in this article is composed of the 3 studies shown in white background (original seeds) and the 27 studies shown in light gray background (selected studies).
The top of Figure 2 has a timeline for study publication.Starting from the left-hand side, the graph shows that the most recent selected studies were published in 2019.Figure 2 also provides a transitive trace between studies selected in our SLR.The start set (initial seeds) is composed of P0003 [9], P0004 [10], and P0005 [18].By taking P0005 as an example, we see that it was influenced, among others, by P0045; then also, P0045 influenced P0234, which in turn influenced P0374, P0375 and P0463.In Round 0, studies P0003, P0004 and P0005 are listed in Selected Backward column just for convenience; these were the original seeds upon which we started the snowballing process.
In Additional Round (from experts), studies are listed in columns related to backward snowballing for convenience.
ON TRANSFORMING MODEL-BASED TESTS INTO CODE: A SYSTEMATIC LITERATURE REVIEW 11

ANALYSIS BASED ON THE RESEARCH QUESTIONS
This section provides answers to the RQs that were defined in Section 3. Table IV classifies the studies based on the research questions RQ1 to RQ3 (separate tables are shown in Section 5.5 to support the discussion regarding RQ4).We discuss each RQ in turn.
In the first paragraphs of Sections 5.1 to 5.4 we present the characteristics that we considered to group the studies that helped us draw answers to the RQs.Beyond this, we discuss the studies in ascending chronological order, with a few exceptional cases which involve studies that are closely related (e.g.pieces of research that were evolved by the same research group) or studies that to a limited extent contributed to the RQ answers.As a proof-of-concept, the authors described how their approach can be implemented in a modeling environment agnostic fashion through a configurable tool chain that can render functional requirements modeled using activity diagrams, state charts, and sequence diagrams into executable test cases for various outputs.Therefore, although Drave et al. emphasized the description of the proposed approach, they also provided some insights into how their approach can be realized.

Discussion
Regarding studies in which the authors stated that test cases developed at the model level are also applied to test the code, however without providing details of the test transformation, Shokry and Hinchey [49] [16,27] relies on the called Scilab/Xcos37 tool chain to generate test cases for models and re-executing them to test the code.In their approach, the code must be automatically generated from the models by the Scilab/Xcos tool.Amalfitano et al. [3] utilized ordinary spreadsheets to specify test cases that can be executed at both levels.The spreadsheet is automatically processed by a legacy, homemade testing environment.To allow it, the code must be automatically generated from MATLAB/Simulink38 models, but the authors did not provide further details about how tests are handled in the legacy testing environment.
In Aniculaesei et al.'s approach [5], system requirements must be formalized in the Linear Temporal Logic (LTL) language, which then underlies the generation of test cases.As long as the same set of requirements are used as a basis for modeling the system with the Scade39 language, both models are assumed to be consistent, and automatic system and test code generation allows for the execution of the test cases at the code level.
Similarly to the approach proposed by Veanes et al. [56], For Vanhecke et al. [55], transforming test cases from model into code initially requires the definition of abstract test cases in XML by utilizing mappings for the interface, actions, and assertions of the system under test.The abstract tests are later transformed into concrete tests that encompass verification steps and sequences of operations that interact with the system under test.
Drave et al. [14] proposed an approach that is modeling environment agnostic in the sense that the approach does not prescribe a modeling environment.To provide the software tooling that supports such approach, the authors used the MontiCore language workbench to develop a domain specific language tool, termed activity diagram (AD) for SMArDT40 (AD4S).Additionally, the authors developed a parser that can transform ADs in extensible markup language (XML) into AD4S.In this context, the output of a given modeling tool has to be transformed to XML before being parsed into AD4S.AD4S turns the XML representation of models into another textual representation (i.e.AD4S-representation), which in turn can be used to derive test cases that can be stored in a format that is executable by functional test execution tools.
In summary, the following sources of information are required to map the test cases across the abstraction levels:  41 that helped us answer RQ2 in four groups, as follows: studies that explored stepwise model transformation but still require human intervention in the last transformation steps [30,31,32,45,55]; studies that automated test case generation all the way to code generation [2,5,7,12,13,14,18,20,21,25,28,29,38,45,49,53]; studies that relied on executable model and code [10,18,28,51]; and, finally, studies that dealt with test case generation from models in ways that differ from the others discussed in this section [16,23,27,36,56].
Regarding studies that explored stepwise model transformation but still require human intervention in the last transformation steps, these approaches are semiautomatic given that human intervention is required in the final stage of transforming a lower-level model representation into code.For instance, Li and Offutt [32] generated test cases that cover all transitions (edge coverage) and all 2-transition sequences (edge-pair coverage [4]) on UML state machine diagrams.The STALE 42  Regarding studies that automated test case generation all the way to code generation, some approaches, such as the one by Conrad [12], ensure that models are transformed into functionally equivalent code.Conrad, for example, exploited a testing-based approach to gauging the functional equivalence of the model and the resulting code.In a previous work, Conrad et al. [13] emphasized test case generation in the context of back-to-back testing of electronic control unit (ECU) software.
This type of test emphasizes the equivalence between the test object (i.e.model) and its reference (i.e.generated code).
Fraternali and Tisi [21] developed a multi-level test generation approach, and a transformation framework to align two streams of transformation, from computation independent models to code, and from computation independent test specifications to executable test scripts.The test scripts are updated by mappings that can be applied when model changes take place.
The approach proposed by Lamancha et al. [28]  space it is possible to achieve complete coverage and determine unreachability of model elements.
AutoMOTGen, by Mohalik et al. [38], is an example of tool that has been developed to automatically generate tests from Simulink/Stateflow models using model checking.Aniculaesei et al. [5] sought to explore model checking for the automatic generation of test cases based on requirements for test cruise control systems for the automotive industry.Essentially, the authors devised an approach in which system requirements are formalized into Linear Temporal Logic (LTL) language requirements, which is then used to generate test cases.Additionally, if the same set of requirements underlies the modeling of the system with the Scade language, both models are assumed to be consistent, hence automatic system and test code generation allows for the execution of the test cases at the code level.
Some studies rely on executable models and code; these studies assume that test cases can be generated and run on the models, and that the resulting test cases can be transformed into code or directly executed, depending on the syntax.The approaches investigated in such studies include test code generators whose inputs and outputs are executable models.map model elements to code so test cases can be executed in both artifacts, allowing for optimizations during code generation, as long as the optimizations are clearly specified.This allows the model elements to be traced to code, including changes by the optimizer.The model and the code generated from it can be considered functionally equivalent if they both lead to compatible output data when executed with the same input data [51].
Lamancha et al. [28], as another example (and previously described in this section), devised a framework that, after transforming UML models into UML Testing Profile 44 models, can derive the source code of the test cases from the testing profile models.
With respect to other studies that addressed research on test case generation from models, it is worth noting that some model representations used internally may not be readily executable.Nonetheless, integrated tool environments for model exploration and validation, test case generation, and test execution against an auto-generated implementation of the system under test can be developed.An example of such an integrated tool environment is Spec Explorer, by Veanes et al. [56], which is a tool for testing reactive, object-oriented software systems.In the context of Spec Explorer, the system's behavior is described by models written in the language Spec# (an extension of C#) or AsmL.Fundamentally, a model in Spec# defines the state variables and update rules of an abstract state machine.Spec Explorer employs algorithms similar to those of explicit state model checkers to explore the machine's states and transitions, which results in a finite graph containing a subset of model states and transitions.This graph-based representation is then used for test case generation.
Spec Explorer allows for two test case execution modes: offline (i.e. when test generation and execution are seen as two independent phases) and online (i.e. which integrates test generation and test execution into a single phase).Online execution incorporates a sort of feedback loop in which immediate results from test execution are used to further guide the test generation process.Thus, as pointed out by the authors, executable models are not crucial to developing tools that can further refine test case generation.
Search-based testing has also been explored in the context of MBT [23,36].Matinnejad et al.
[36] investigated how a search-based technique based on random search, adaptive random search, hill climbing and simulated annealing algorithms can be used to identify worst-case test scenarios which are utilized to generate test cases for requirements that characterize the behavior of continuous controllers.Similarly to Matinnejad et al. [36], Kalaee and Rafe [23] examined how search algorithms can be applied to generate test sets from graphs.The proposed approach is tailored to systems that are specified as graph transformations.
Koch et al. [27] and Durak et al. [16] designed tools to support the X-in-the-loop testing pipeline, and both tools generate test cases from Scilab/Xcos models.At the model level, test cases are automatically generated for individual and integrated components.The authors refer to test generation for integrated components as model-in-the-loop (MIL).These test cases hinge on what the authors termed "a number of plausible scenarios" which are derived from decision trees that formally represent the integrated models.The results of the tests performed at the model level are subsequently used as "reference" for software-in-the-loop (SIL) testing of auto-generated code.
To summarize, we found that most of the selected studies deal with test case generation from models.However, there are important differences in the way high-level models are turned into lower-level test cases and how the resulting test cases are used: • Some test case generation approaches emphasize model-to-model transformations, thus the last step to transform to code has to be semiautomatic.Testers have to bridge the gap between the lowest model level and code by specifying how certain model elements should be transformed into code, for example, by mapping a graph to a sequence of method calls.
• Some approaches automate test case generation all the way to code generation by performing stepwise model refinements until they reach a low-level model representation that is suitable for code generation.• As long as both model and code are executable, another common approach entails deriving test cases from models and then applying these test cases to the auto-generated code.Some approaches utilize the model-level test cases to create low-level test cases that can be executed to test the auto-generated code.

Discussion
Regarding RQ3: How does the coverage of the model produced by abstract tests relate to the coverage of the code for the corresponding concrete tests?
When models are transformed to code, whether automatically or by hand, it is imperative that the behavior defined in the model is preserved in the code.Likewise, even though computing the coverage of artifact-specific constructs with particular tools may result in diverging coverage results, it is important that we maintain high degree of coverage when transforming test cases from the model to the code level.
Models and code use structural elements to represent the underlying logic, albeit at different levels of abstraction.Models, for example, use high-level structures such as activity diagrams to represent the steps and branching logic (i.e.decisions) involved in a specific behavior.Code represents the procedural logic that manipulates data and implements specific behavior using lower-level constructs.
As a result, the similarity between models and code is found in their use of structures to represent the logic of the system.We believe that to answer RQ3, it is necessary to consider the following points.
Firstly, can model-level decision coverage results be extrapolated to branch coverage at the code level?Secondly, are there any high-level constructs that represent behavior in an implicit fashion?
Implicit behavior at the model level can interfere with model-to-code transformations, and as a result, implicit behavior at model level may not be included in the resulting code representation.This can impact coverage when models are transformed into code.Finally, while there is some overlap in how models and code represent decisions, is this overlap sufficient to result in the same number of test requirements?
In order to answer RQ3, we have examined these subquestions in the context of the empirical research presented in the selected studies.We framed the aforementioned subquestions as follows: 1. Given the structural similarities between code and models, can we expect correlation between model and code coverage?
2. Models tend to have some implicit behaviors, for example, conditional behavior that does not appear as predicates.What are the implications with respect to coverage when we transform the models into code?
3. What happens to the number of test requirements when we transform models to code?Does the code have more, fewer, or the same number of test requirements?
In what follows, we describe and discuss studies that elaborated on different aspects related to the implications of applying model-based tests to code automatically generated from models [2,5,7,18,20,32,38,45], and studies that only briefly mentioned coverage at both abstraction levels (namely, model and code) [14,23,49,53,56].Then we draw answers to the three aforementioned subquestions.
Regarding studies that elaborated on different aspects related to the implications of applying modelbased tests to code automatically generated from models, Baresel et al. [7]  By building on previous work [19], Eriksson and Lindström [18] devised an approach that addresses test generation for executable UML (xtUML) models.It includes two new logic-based testing coverage criteria for models (namely, all navigation and all iteration).The new criteria aim at covering the implicit predicates that logic-based criteria miss.For example, by using only predicatebased criterion in one of the six applications addressed in their previous study [19], the number of test requirements increased 51% when xtUML models are transformed to code.In Eriksson and Lindström's approach, coverage measurement at the model level is enabled by introducing modellevel predicates that capture predicates that would appear during model-to-code transformations.
The results from a single application demonstrated that coverage measured at the model level can accurately predict coverage at the code level.This is particularly important for logic-based testing, since coverage at the code level is often required.
Mohalik et al. [38] shed some light on how AutoMOTGen compares to Reactis (which is a commercial tool that implements a combination of random input-based and guided simulationbased techniques for test case generation) in terms of test coverage.According to the results of industrial case studies, the test case generation techniques employed by both tools can be seen as complementary.Specifically, AutoMOTGen performs better (i.e.achieves higher coverage) for about one third of the cases, while Reactis shows higher coverage for about other third of the cases.As for the rest of the cases, the coverage obtained by both tools seems to be roughly equal.A closer inspection of the results indicates that when models have more logic (i.e.switches and delay types of blocks) AutoMOTGen performs better than Reactis.As for models with more blocks of mathematical operations, Reactis seems to perform better in comparison to AutoMOTGen.This indicates that the technique implemented by AutoMOTGen is more suitable for covering paths with logical constraints.
Additionally, when approximations have to be applied in order to handle complex mathematical operators, the coverage achieved by the test cases generated by AutoMOTGen suffers.Therefore, the authors postulate that AutoMOTGen and Reactis should be used together to achieve better coverage and unreachability guarantees.Following that, the initial test suite is optimized through search-based techniques.The authors conducted an experiment to evaluate the effectiveness of their test case generation approach (using different meta-heuristic algorithms).According to the results of the experiment, the generated test sets can cover a significant portion of the GTS while keeping test generation cost low: on average, the best algorithm achieved 98.25% coverage, and the second best achieved 96.50% coverage.
By revisiting RQ3, we draw the following answers to its three subquestions, respectively: • Empirical studies have shown a strong correlation between decision coverage at the model level and branch coverage at the code level.This answer relies on the few studies that have shed light on the implications that arise from the similarities between models and code.for logic-based coverage criteria also increase accordingly [19].From a criteria comparison standpoint, a study suggests that there is a need to combine high-level testing criteria (which are based on test requirements) with logic-based criteria [18]; in that study the overlap between these criteria is not straightforward since the test requirements come from different sources.
Another study that empirically investigates the relationship between structural model and code coverage [7] showed that there is a strong correlation between decision coverage on model level and branch coverage on code level.• Few studies reported on the increase in test requirements when implicit behavior in modeling structures is made explicit during model-to-code transformation.In particular, only three studies [18,20,45] provided details about the number of test requirements when models are transformed to code.The three studies addressed turning implicit behavior present in modeling structures (e.g.predicates) into explicit behavior at the code level, and how this leads to an increase in the number of test requirements in the code when compared to the corresponding model.Overall, these studies introduced approaches for making behavior explicit either through transformation rules to be applied to the model before it is transformed to code [20], coverage criteria for models [18], or by forgoing modeling structures that omit logic at the model level [45].The discussion and conclusion for RQ4 are based on study classifications that rely on the MBT technology (e.g. , modeling language and tools) and software development tasks (e.g. , modeling and test coverage calculation).Note that even though the elements of the taxonomy (e.g. , the input and output languages) were defined in advance (as detailed in Section 3.4), the list of elements inside each category was constructed during the analysis of the selected studies.In other words, the list of elements grew over the course of our systematic review of the literature.At the end of the study analysis and data extraction, we revised the resulting categories to avoid ambiguity and remove duplicates.
The results discussed in this section encompass (i) the modeling language, herein referred to as input language (Figure 3 and Table V), (ii) the source code or test specification language -i.e.
the output language -used to encode artifacts that are generated with either automatic or manual  model-to-code or model-to-model transformations (Figure 4 and Table VI), and (iii) the goal of the tools and frameworks used in the studies (Figure 5 and Table VII).Note that there is some overlapping in the results shown in the charts and tables, given that some studies used combined technologies and used tools for varying purposes.For instance, Baresel et al. [7] and Mohalik et al. [38] adopted Simulink and Stateflow as languages for creating models.As another example, Aniculaesei et al. [5] employed tools for modeling, test generation at the model level, and computing test coverage at the code level.
Figure 3 shows the number of studies in which a given language was used for modeling purposes.
Table V lists the respective studies.Simulink 45 was the most used (nine studies), followed by Finite State Machines (FSMs) and Stateflow 46 (four studies each).These three languages are more mature and have more automated support, so we were not surprised that they are widely addressed.
Figure 4 summarizes the classification of studies with respect to the output language.The respective studies are listed in Table VI.The results for this study classification reflect the numbers presented in Figure 3.For instance, the toolkits that support Simulink-and Stateflow-based modeling usually support automatic generation of C code, which was the case of seven studies.FSMs are commonly employed to represent states of objects in object-oriented (OO) systems that are further implemented in C++ (three studies), Java (two studies), and Python (two studies) languages.
Figure 5 displays the number of studies that utilized tools and frameworks for specific tasks in the MBT process.In summary, for studies that address, to varying degrees, the mapping of abstract tests to concrete tests: • Simulink and Stateflow, either individually or in combination, are by far the most commonly used input languages for system modeling.
• C and C++ are the most explored output languages for model-to-code transformation, thus corroborating the findings regarding the input language.
• Tools are mostly used for the modeling activity, generation of abstract tests, and test coverage computation (either at code or model level).• Test cases are transformed from a model representation into code using specialized tools.Moreover, software specifications undergo transformation into code.The test cases to cover the specifications are then applied to the code without the need for any manual intervention.
• Transforming model-level test cases into code-level test cases involves stepwise model-tomodel and model-to-code conversions, which can be performed either automatically or manually. #1.1 • Model-to-model transformations are employed to generate executable test models.The resulting test models are closely aligned with system models and serve as the foundation for generating test cases that can be run on models and code.
• It is imperative that the transformation rules realized by model-to-code generators be explicit, allowing for the identification of the relationship between model and code elements. #2 • Most of the studies focus on test case generation from models.There are differences in how models are transformed into lower-level test cases and their subsequent utilization: -Some approaches prioritize model-to-model transformations, requiring a semiautomatic step to convert the model into code.
-Some approaches automate test case generation through stepwise model refinements, gradually achieving a representation that aligns with the code's abstraction level.
-When both the model and code can be easily executed, the prevalent approach involves deriving test cases from models and subsequently executing these test cases on the auto-generated code. #3 • Studies have provided evidence of a strong correlation between decision coverage at the model level and branch coverage at the code level.
• Models contain implicit test requirements represented by implicit predicates.These predicates are not covered by logic-based criteria applied at the model level.
• Few studies reported on the increase in test requirements when implicit behavior in modeling structures is made explicit during model-to-code transformation. #4 • Simulink and Stateflow are by far the most commonly used input languages for system modeling.
• C and C++ stand out as the two most extensively explored output languages for model-tocode transformation.
in four major categories: construct validity, conclusion validity, internal validity, and external validity.

998
We organize this section according to these four categories.
ON TRANSFORMING MODEL-BASED TESTS INTO CODE: A SYSTEMATIC LITERATURE REVIEW 35 studies [12,16,27] described a contrived usage example, failing to provide empirical evidence supporting the effectiveness of the proposed approach.Therefore, we argue that technology transfer has been negatively affected by a lack of data to inform the evolution of MBT tools.It is imperative that future studies improve the transparency of reporting their experimental designs to support better comprehension of the methodology and reproducibility of results.We also hope that reviewers and editors will be more diligent about noting missing information in studies, and insist that authors correct the oversights in revision.
In our study, we found increasing adoption of MBT in industry, increasing application of model-tocode transformations, and a complementary increasing need to understand how test cases designed for models achieve coverage on the code.Although these studies document significant progress on this topic, these issues document significant gaps in our intellectual knowledge on the topic.We hope that practitioners can benefit from our study to better test their software and to better understand how well their software has been tested.We also hope that researchers can use this study as a reference to learn about the current state of knowledge and to identify future research directions, both theoretical and empirical.
Finally, as with any SLR and despite our best efforts over a few years of work, it is unlikely that we found all primary studies.Although we applied several search strategies, the limitations of research repositories (and our own abilities) mean that no search can be exhaustive.Thus, we hope this SLR will be further updated in the future.

• RQ1. 1 :
What is required of the model-to-code transformation to support the transition from model level tests to code level tests?• RQ2: How are tests generated from the model specifications (e.g.UML or Simulink)?• RQ3: How does the coverage of the model produced by abstract tests relate to the coverage of the code for the corresponding concrete tests?• RQ4: Which are the applied technologies and which are the software development tasks focused by studies that address mapping of tests across model and code levels?

1 .
Testing the Untestable: Model Testing of Complex Software-intensive Systems, by Briand et al. [9]; 2. Data Flow Model Coverage Analysis: Principles and Practice, by Camus et al. [10]; and 3. UML Associations: Reducing the Gap in Test Coverage Between Model and Code, by Erikssonand Lindström[18].Search Stopping Criterion: To keep the review feasible, for the study selection phase we executed three snowballing iterations, called rounds, after which we started the data extraction and synthesis.

10 FERRARI 16 *
ET AL.Table I. Summary of search rounds (in Round 0, studies P0003, P0004 and P0005 are listed in Selected Backward column just for convenience; these were the original seeds upon which we started the snowballing process).et al. [52] 2 + 3 P0071 Tekcan et al. [53] P0086 Li et al. [29] *P0443 Durak et al. [16] *P0448 Koch et al. [27] *P0451 Amalfitano et al. selected in the forward snowballing update.*** not selected, but used as source of references in the additional round.

Figure 2 .
Figure 2. Citation map for studies that passed the inclusion criteria with decreasing timeline (from left to right).

5. 5 .
Discussion Regarding RQ4: Which are the applied technologies and which are the software development tasks focused by studies that address mapping of tests across model and code levels?
The obtained code coverage obtained with model-based test set;
Figure 1.Number of studies per publisher 30 studies, in total).

Table IV .
[45]20,23,36,49]31,51,545,55]h respect to our research questions.bydescribing the tool that supports the transformation[2,3,5,10,16,27,38,51,56]; studies that described procedures for transforming test cases from models to code[14,21,25,28,30,31, 32,45,55]; and, studies that just reported that test cases developed at the model level are then applied to test the code[18,20,23,36,49].Studies from the three groups are discussed in the sequence.With respect to studies that described tools, Stürmer et al.[51]developed tools to automatically transform test cases based on executable models.The study reported on test vectors generated for Simulink and Stateflow 14 models that can be automatically executed on auto-generated C code with support of a tool called Mtest.Their approach allows the model elements to be traced to code, including changes performed by a model-to-code transformation optimizer.However, the authors did et al.[56]presented details of the Spec Explorer15tool that uses a state model (specified in a language named Spec#) to derive abstract test cases.Spec Explorer employs algorithms similar to those of explicit state model checkers to explore the machine's states and transitions; the automatically generated abstract tests are further converted into concrete tests.The authors defined a set of rules to map what they call action methods (at the model level) to concrete methods present in the actual system under test.Mohalik et al.[38], also in the context of Simulink and Stateflow models, developed the AutoMOTGen test generation tool.AutoMOTGen transforms Stateflow models into code written in the SAL language, which underlies the generation of test cases.Test coverage requirements are encoded as goals in SAL to establish traceability, and a model checking engine is utilized to generate test cases from counter-example traces.The tool generates test cases to satisfy block coverage, condition coverage, decision coverage, and MCDC.The generated test cases are directly used to test the code produced from the models.Camus et al.[10]employed the SCADE tool suite16to automatically transform model-based test cases to be directly applied to source code.The Model Test Coverage 17 (MTC) tool was employed to run tests and collect model coverage data.They also applied structural code coverage analysis on the code.When applying the resulting test cases to code, the code coverage can be used as a measure of conformance to standards such as DO-178C/DO-331.Amalfitano et al.[2]studied test cases that were automatically generated to provide "full coverage" (such as the coverage of all states, all transitions, and all paths) of UML state machines and then run on automatically generated Java code.They employed the Conformiq Designer tool18to generate test cases at the model level, and then automatically transform model-level test cases into code.In Both tools turn models and test cases generated at the model level into C code, and the study assessed the effectiveness of the test sets based on their mutation scores.Regarding studies that described procedures for transforming test sets from models to code, Pretschner et al.[45]provided clear information about model-to-code transformation of test sets.The study describes a compiler that transforms abstract, model-level test cases to concrete, code-level test cases.
[5]]ls not give technical details on how Simulink and Stateflow models are turned into code, or how test vectors are transformed into code.FERRARI ET AL.Veanes another research initiative, Amalfitano et al.[3]reported on a relatively simple experiment to probe into the difference between model and code coverage for four different state machine models and eight test sets.They specified test cases in an ordinary spreadsheet that is automatically processed by a legacy, homemade, unnamed testing environment; the same test cases are executed at both model and code levels.Koch et al.[27]presented the Scilab/Xcos XTG19tool.It supports the Durak et al.'s [16] Xin-the-loop testing pipeline for model-based development of parallel real-time software that runs on multicore processor architectures tailored to the avionics industry.Scilab/Xcos XTG enables back-to-back testing by injecting automatically generated code into the model elements, thus allowing enhanced simulations to be carried out at the model level.It also generates input test data and expected output that can be used to exercise the model and the code at various phases of the model-based testing workflow.In both studies, a single example was outlined, without any further empirical assessment.Aniculaesei et al.[5]compared the fault revealing capability of test sets automatically generated with a commercial tool (the SCADE tool suite16and an academic, open-source tool (NuSMV) that applies the model checking approach.
[10]luded that test cases randomly generated at the model level provide low code coverage, but did not provide details.Matinnejad et al.[36]applied nine test cases generated at the model level in practice at the HIL stage, but did not provide details either.Eriksson and Lindström[18]proposed new model-based coverage criteria that are computed from executable xtUML28models.They continued the work of Eriksson et al.[19]by generating logic-based test cases at the platform-independent level using xtUML.Eriksson and Lindström's approach comprises measuring coverage at the model level by first creating model-level predicates that capture the predicates that would appear during model transformation to code.This allows test coverage to be measured at the model level.In that study, for a single subject application from the avionics domain, the authors reported on the coverage achieved by a test set generated at the model level and re-executed at the code level.However, neither that study nor a prior study on the same project[20]provided details of how the test set is mapped across the abstraction levels.Finally, Kalaee and Rafe[23]mentioned that test cases generated at the model level, based on graphs and transformation rules, can be transformed into sequences of method invocations, but the authors did not elaborate on it.TRANSFORMING MODEL-BASED TESTS INTO CODE: A SYSTEMATIC LITERATURE REVIEW 17 in a constraint logic programming language from which the test cases are generated.Then, a compiler transforms abstract, model-level, test cases into concrete, code-level, test cases, which are executed on the software under test (written in C).The approach by Amalfitano et al.[2]requires executable system models, which allow testing models to be automatically generated.The testing models then underlie the generation of test cases at both the model and code levels.The authors work in a Model Driven Architecture 29 development context and employ the Conformiq Designer tool30to automatically generate and transform model-Java method call, and an initialization of a UML object mapped to a Java object creation.The authors extended that work to use the skyfire35tool to generate Cucumber test scenarios for different types of , Camus et al.[10]stated that "it has been verified in practice for complex models that tests covering the model also cover the code generated from that model, except few [sic] systematic cases which are predictable and justifiable."These systematic cases include refinements of the model coverage criteria such as addressing numeric aspects (which would allow performing analysis of singular points) and handling delays (which would allow assessment of sequential logic).With these refinements implemented, the authors state that one could provide formal evidence of model-to-code [31]21,25,28,30,31,51,545,55] studies, we identified the following two perspectives regarding the transformation, or reuse, of test cases generated at the model level to test, or evaluate, the code derived from models: fully-automated transformation and execution of test cases, and partially-automated or manual transformation of test cases with subsequent automated execution.Both perspectives are following summarized.•Testcasesarefullyautomaticallytransformedfrommodelintocodebyusingspecificindustrialortailor-madetools.Software specifications are automatically transformed into code, and test cases generated to cover the specification are automatically applied to code without manual intervention[2,3,5,10,16,27,38,51,56].•Abstract,model-leveltestcases are transformed into concrete, code-level test cases after stepwise model-to-model and model-to-code transformations that are performed either automatically or by hand.The concrete tests are then run directly on the code[14,21,25,28,30,31, 32,45,55].5.2.DiscussionRegarding RQ1.1:What is required of the model-to-code transformation to support the transition from model level tests to code level tests?models that form the basis for test generation and transformation down to the code level.For example, Pretschner et al.'s approach [45] transforms extended finite state machines into specifications written ON Similarly to Fraternali and Tisi [21], Lamancha et al. [28] exploited a model-to-model transformation of UML models to UML Testing Profile models, from which test cases for the model level are generated.Subsequently, model-to-text transformations automatically produces test cases in a variety of languages; examples are test scripts that follow the JUnit 33 style.Regarding studies that used various approaches to transformations, Stürmer et al.[51], for instance, addressed the issue of reusing test sets across abstraction levels, suggesting that the specifications of model-to-code optimizations should be available.This allows model elements to be traced to auto-generated code elements, including elements omitted from or inserted into the code through optimizations.The approach proposed by Veanes et al.[56]needs human intervention for transforming (i.e.binding) model elements (i.e.action methods in the model) into code elements (methods with matching signatures in the SUT).In a theoretical study, similarly to what was proposed by Stürmer et al.[51], Kirner[25]also considered optimizations, suggesting that the code generator must conform to a set of rules that are derived from a coverage profile.For that, the author initially defined formal rules based on some coverage criteria (statement coverage, decision coverage, and MCDC), a set of coverage preservation rules, and a set of code optimizations.Based on the formal rules, a coverage profile is created and integrated into a code transformer.Li and Offutt[31]assumed non-executable behavioral models such as UML state machines, which do not contain details such as objects, parameters, actions, and constraints.They employed the STALE34framework to manually write code in the STAL language to define mappings between abstract (model-level) and concrete (code-level) elements, so that abstract and concrete execution paths can be automatically generated by STALE.Example mappings are a UML action mapped to a level

•
Formal model-to-model transformations are needed to produce executable test models, typically in the context of MDE development approaches.Such test models are aligned with the system models and underlie the generation of test cases that can be either executed on models as well as code, or exclusively on the code.When tests are executed on the code, the model-level test cases are abstract.•The transformation rules performed by model-to-code generators must be explicit to clarify the correspondence between model and code elements.Such transformations include optimizations performed by compilers while transforming model elements into code elements, and the generation of code from model elements that have implicit predicates.The rules can be created either automatically or by hand.
it is key to understand approaches to developing model-to-code transformers.We surmise that understanding how model-to-code transformers turn models into code can help testers focus on edge cases that are often neglected during model-based testing.With those concerns in mind, we hereafter discuss representative studies [55]nother research initiative, the AbsCon (Abstract test cases Concretizer), by Vanhecke et al.[55], was designed to turn abstract tests into concrete ones.The tool's test case concretization process maps assertions and actions in abstract tests to verifications and sequences of operations (i.e.concrete tests), respectively, that exercise the SUT through the test API.However, the process of turning abstract tests into concrete test scripts is not fully automated, it requires tester intervention.Specifically, before turning assertions and actions, which are defined in XML, into concrete tests in Python, testers must provide the following additional information: the test API model for executing the SUT, the path to the Python files that implement the SUT model and the mapper for the chosen API, and a CSV file with input values (i.e.test case values).
[55]ework first turns UML state machines into general graphs (model to model).Abstract tests are generated to cover the graphs.The abstract tests include transitions and constraints (based on state invariants).Testers provide mapping rules, which are sequences of method calls to represent transitions in the statechart, which are assembled to transform abstract tests into concrete tests.Li et al.[30]improved on STALE by further automating the test case generation step.The resulting framework, named skyfire,43is built on STALE, but generates concrete tests directly from the graphs in the form of Cucumber test scenarios.Skyfire generates test cases that satisfy graph coverage criteria and transforms test cases into Cucumber scenarios.Nevertheless, similarly to AbsCon[55], this approach is semiautomatic given that testers have to write the Cucumber mappings for the generated scenarios.
is stepwise in the sense that it applies model-tomodel transformations and then model-to-text transformations.The approach turns high-level UML 2.0 representations (i.e.sequence diagrams) into test case scenarios that conform to the UML Testing Profile 2 (UTP2), and then model-to-text transformations are applied to the UTP2 models to render these models into text (i.e.code).According to the authors, the model-to-text transformation step allows for the generation of test cases in a variety of programming languages owing to the fact that it is implemented with MOFScript, which is an OMG standard.
[14]an et al.[53]also devised a twofold approach to turning a high-level representation into executable test code.Specifically, in the proposed user-driven test case generation approach test cases are first represented as states and state transitions in XML files, and then these XML files are transformed into Python scripts.Drave et al.[14]developed a method to manage requirements, design, and test in automotive industry.The specification method SMArDT leverages model-based software engineering techniques with the aim of mitigating the deficiencies of the established V-Model.The method is based on the premise that consistency checking between layers and test case generation (and regeneration) helps developers and testers cope with the bureaucracy imposed by the classical V-Model.The authors posited that consistency among specification artifacts between layers enables automatic transformation of test cases to lower levels.To realize the method in a modeling environment agnostic fashion, the authors put together a configurable tool chain that can turn functional requirements modeled using activity diagrams, state charts, sequence diagrams, and internal block diagrams from various formats into executable test cases for various output formats.Some researchers have also turned their attention to the formal verification technique of model checking to derive test cases automatically.Essentially, model checking hinges on the capability of model checkers to exhaustively probe into the state space of the SUT and generate test cases that are based on traces or counter-examples of properties specified by the SUT's model.Therefore, model checking based test generation is built on the assumption that by thoroughly exploring the state studied the relation between requirements and structural coverage at both the model and the code levels.The authors report on empirical coverage results for model (Simulink/Stateflow) and code (C) of three functional modules of an automotive system.They found a strong correlation between model and code coverage in terms of achieved percentages of coverage.Other studies we describe in the sequence, however, found a substantial difference in the number of test requirements when performing model-to-code transformations, particularly when models have implicit behaviors that are transformed into decisions and loops in code that have explicit predicates.The introduced predicates create new code-level test requirements.Prepared using stvrauth.clsDOI: 10.1002/stvr ON TRANSFORMING MODEL-BASED TESTS INTO CODE: A SYSTEMATIC LITERATURE REVIEW 23 Pretschner et al. [45] argued that it is key to make behavior explicit at model level (i.e.akin to the introduction of model-level predicates to make implicit semantic assumptions explicit).The results of their experiment would seem to suggest that, when behavior is made explicit at model level, automatically generated test sets are able to uncover as many faults as handcrafted model-based test sets with the same amount of test cases.In terms of test requirements, Pretschner et al. noted Eriksson et al. [20] addressed the issue of implicit conditional behaviors at the model level.They devised model-to-model transformation rules that turn implicit predicates into explicit predicates at the model level, thereby ensuring that structural coverage achieved at model level is preserved at the code level.These transformation rules resulted in near 100% code-level MCDC coverage.Thus, although model-level test cases generated from the original model may not be enough to guarantee code-level coverage, they can be augmented in clearly defined ways to achieve coverage.Regarding the number of test requirements, for the original artefacts (i.e.without applying the devised rules) they found 67% additional logic-based test requirements from the code compared to the design model, whereas this percentage dropped to near 0% when the rules were applied.
Copyright © 2023 John Wiley & Sons, Ltd.Softw.Test.Verif.Reliab.(2023) [56]fitano et al.[2]compared the model coverage achieved by the test cases at the model level with the coverage obtained by the test cases when run against the generated code.They found differences between model coverage on state machines and code coverage.They ran two test sets on four state machine models and their code.The test sets reached 100% coverage on states and transitions, but statement coverage varied from 48% to 75% and branch coverage from 25% to 52% on the code.Amalfitanoet al. gave three main reasons for these differences: (i) the code generator added extra code for exception handling and debugging; (ii) model coverage was not enough to Copyright © 2023 John Wiley & Sons, Ltd.Softw.Test.Verif.Reliab.(2023)Preparedusingstvrauth.clsDOI:10.1002/stvrguaranteecodecoverage; and (iii) the design of the models play a major role in the quality of the generated test cases.Their results indicate the major source of differences was code that added behavior that was not included in the model, even without explicitly showing the absolute number of test requirements at both abstraction levels.The approach proposed by Li and Offutt[32]renders state machine diagrams into general graphs, which are then used to generate abstract tests.The abstract tests are generated so as to satisfy graph coverage criteria: edge and edge-pair coverage.According to the experiment results, edge-pair coverage tests were not significantly stronger than the edge coverage tests.The authors believe that this is the case because edge-pair coverage did not entail many more mappings (i.e.test inputs) than edge coverage.Aniculaesei et al.[5]evaluated the coverage in terms of a mutation score.In their experiment, which used a single subject, the test set generated by the SCADE toolchain was able to kill roughly 21% of the mutants (a very low mutation score of 0.21).After analyzing the causes that might have contributed to this low mutation score, the authors concluded that the system targeted in the experiment was only partially represented through LTL specifications.Regarding studies that only briefly mentioned coverage at both the model and the code levels, Spec Explorer, by Veanes et al.[56], derives test cases from graph-based models (dubbed model automata).resulting test cases are generated in hopes of either providing some sort of coverage of the state space, reaching a state (i.e.node) satisfying some property, or traversing the state space randomly, likewise the coverage of the corresponding implementation under test which may be a distributed system consisting of subsystems, a (multithreaded) API, a graphical user interface, etc. between nodes in the graph.Initially, the approach creates a model of the GTS using graph transformation rules.The model is then simulated to generate the first test cases.
[23]ry and Hinchey[49]simply reported that randomly generated model-level tests provide low code coverage (around 32%).These findings were based on their own experience with the X-in-theloop testing process.In a similar level of details, in a study in which test cases were first represented as states and state transitions in XML files, and then transformed into Python scripts, Tekcan et al.[53] mentioned coverage-related results without defining what they mean by "coverage".Drave et al.[14]reported on the results of an experiment in terms of the fault-finding effectiveness of the proposed MBT as opposed to structural coverage-related results.More specifically, the authors carried out a case study to compare model-based test cases derived in the context of the tool chain environment that realizes SMArDT and manually created test cases.According to their results, the MBT approach generated test cases had a higher fault coverage (i.e.detected more faults) than the traditional hand-crafted test cases.The MBT approach was especially effective at generating test cases that uncover faults caused by inconsistent requirements.Nevertheless, neither the traditional nor the model-based test cases uncovered all faults.The authors do not elaborate on the structural coverage achieved by neither test set.Kalaee and Rafe[23]proposed a test case generation approach for graph transformation systems (GTS) that utilizes model simulation and search-based techniques.In this context, coverage is analyzed in terms of the all def-use criterion: specifically, data flow coverage criteria is determined by data dependencies From a model-to-code transformation viewpoint, when looking at the effect of model to code transformations on the test artifacts and the number of test requirements, it would seem that as the number of test artifacts increases, the test requirements Often, these implications are discussed either in the light of the problem of preserving structural code coverage when transforming model into code, or in terms of the correlation between a high-level (i.e.model-based) testing criteria and a lower-level criteria (i.e. based on notions Copyright © 2023 John Wiley & Sons, Ltd.Softw.Test.Verif.Reliab.(2023) Prepared using stvrauth.clsDOI: 10.1002/stvr of structural code coverage).

•
[19,20]d that models can have implicit test requirements represented by implicit predicates at the model level and these predicates are not affected by logic-based criteria applied at the model level.Nevertheless, studies suggest that deterministic rules applied during model-to-code transformation can make these predicates explicit, resulting in better structural coverage at the model level with relatively low additional testing effort.More specifically, some studies posit that models tend to have implicit test requirements[19,20].Specifically, implicit predicates at model level represent hidden controls and loops, which account for most of the implicit behavior in models.Therefore, the main implication with respect to model-to-code transformations and test coverage is that these implicit predicates are not affected by logic-based criteria, so they do not contribute any test requirements when such criteria are applied at model level.However, studies show that the hidden behavior in models can be made explicit by deterministic rules that can be applied by a model-to-code compiler during transformation.The results of these studies suggest that by making implicit behavior explicit it is possible to achieve structural coverage at model level that is closer to the coverage obtained at code level.Additionally, most implicit behavior when turned explicit tend to result in single-clause predicates, thus few extra test cases are needed and the ensuing test design activity is cheap.
Table VII lists the respective studies.Examples are modeling (with 13 occurrences in our selected studies), test generation at the model level (twelve occurrences), and test coverage calculation at the code level (five occurrences).

Table VIII .
Overview of the results and findings for each RQ.