A Systematic Review on Cloud Testing

A systematic literature review is presented that surveyed the topic of cloud testing over the period 2012--2017. Cloud testing can refer either to testing cloud-based systems (testing of the cloud) or to leveraging the cloud for testing purposes (testing in the cloud): both approaches (and their combination into testing of the cloud in the cloud) have drawn research interest. An extensive paper search was conducted by both automated query of popular digital libraries and snowballing, which resulted in the final selection of 147 primary studies. Along the survey, a framework has been incrementally derived that classifies cloud testing research among six main areas and their topics. The article includes a detailed analysis of the selected primary studies to identify trends and gaps, as well as an extensive report of the state-of-the-art as it emerges by answering the identified Research Questions. We find that cloud testing is an active research field, although not all topics have received enough attention and conclude by presenting the most relevant open research challenges for each area of the classification framework.


93:2
A. Bertolino et al. just as water or electricity. Research in the new technology has gained momentum since, and the IT industry is moving towards the cloud with huge investments and great expectations [26,150].
The acclaimed new computing paradigm does not come without challenges, though. To clear the initial confusion around its actual capabilities, Armbrust and coauthors [14] provided a view of the top 10 obstacles to cloud computing. Among them, the difficulty of "removing errors in these very large-scale distributed systems" [14] points to a broad and challenging research topic that is cloud testing, which refers to "testing and measurement activities on a cloud-based environment and infrastructure by leveraging cloud technologies and solutions" [47]. In recent years, researchers have actively investigated the scientific and technical problems posed by cloud testing and have developed many techniques and tools for testing cloud-based systems.
Beyond addressing the challenge of testing systems that reside in the cloud, researchers have also realized the potential offered by the cloud to mitigate the ancient problem of high testing costs [22]. In fact, the cloud offers the opportunity to develop and maintain costly test infrastructures and to leverage on demand scalable resources for configuration (by using cloud virtualization) and performance (by means of cloud elasticity) testing. Thus, the very term "cloud testing" is used in the literature with two different meanings: testing of the cloud or testing in the cloud.
In front of an active and continuing interest by the community in cloud testing research, there does not exist a recent and comprehensive classification of research results that can guide researchers in entering this field. Several authors provided an overview of the issues and the opportunities in testing of the cloud or testing in the cloud, e.g., References [16,78,126,152], but not based on a systematic study of literature. A few systematic surveys or mapping studies have also been conducted, e.g., References [4,71,76,115,122,164]. However, such studies either focus on specific aspects related to cloud testing or are now several years old (see next section). In particular, the latest comprehensive surveys reviewed the literature until 2012.
Motivated by the above, this survey fills a gap by conducting a systematic literature review (SLR) [83] over the 2012-2017 period with the objective to identify and categorize relevant research on cloud testing. The study covers any aspect of testing of the cloud (ToC), testing in the cloud (TiC), and their intersection, i.e., testing of the cloud in the cloud (ToiC). Our extensive "hunt" of literature included the automated search over six popular digital libraries (Scopus, ACM, IEEE, ScienceDirect, Wiley, and Springer) and several snowballing iterations, both backward and forward (over Google Scholar). As a result, a total of 810 primary studies 1 have been scrutinized, of which 147 primary studies eventually passed the selection and are here surveyed.
A classification framework is proposed that divides the selected primary studies into the three (non-overlapping) categories of TiC, ToC, and ToiC. The categories are structured into six main testing research areas, namely: test perspective, design, execution, objective, evaluation, and domain. Each research area includes several topics that emerged from the reading of the studies.
This article is structured as follows: in the next section, we overview related work, i.e., recent surveys in cloud testing. Then, in Section 3, we describe our Research Questions (RQs) and the research methodology. In Section 4, we describe the derived classification framework, which is another contribution of this work. In Sections 5 and 6, we present the results from the survey: the former provides some interesting numerical analyses, while the latter includes an extensive discussion of insights gained from the full-text reading of the selected primary studies. Although the focus of this review is the scientific literature, the section also outlines industry trends in this field. Then, in Section 7, we summarize the open research challenges that emerge from the literature. Conclusions and directions for future research are given in Section 8. surveys partially overlap with this work; however, none of them provides a survey of the whole cloud testing research field. A peculiar case is the work by Jia et al. [76]: this paper proposes to use the well-known 5W + 1H pattern to guide the structuring of research questions for systematic mapping studies. Then, to demonstrate the approach, the paper conducts as a case study a systematic mapping study of cloud testing research that categorizes 51 primary studies. Although the paper was published in 2016, the set of included papers was selected in 2012; therefore, also that work is antecedent to the period we consider here.
Finally, the closest work to this article is the survey by Ahmad et al. [4], which focuses on the empirical studies in cloud testing papers. The work makes a literature search over the period 2010-2015 and provides a systematic mapping study over 69 primary studies (from 75 referred papers). In comparison to Reference [4], this article surveys a different period (2012)(2013)(2014)(2015)(2016)(2017) and selects about twice as many papers. Moreover, for the years that are surveyed in both works (i.e., 2012-2015), we can observe different selections of primary studies. This can be due to the usage by Ahmad et al. of a different (non-standard) search protocol (manual search on publication venues previously selected by an automated search) and by their more relaxed interpretation of the "testing" term to also include other verification and validation approaches, whereas this survey only focuses on testing approaches.

RESEARCH METHODOLOGY
This survey follows the guidelines for systematic reviews in software engineering research by Kitchenham and coauthors [24,83]. Following these guidelines, our research methodology included three main phases: planning the review (as described in Section 3.1), conducting the review (as described in Section 3.2), and reporting the results (we refer to Sections 5 and 6 for results description).
The literature search included an automated query over three popular digital libraries and several iterations of the snowballing approach [155], plus an additional assessment over three more digital libraries of the completeness of results from the above search methodology (as described in Section 3.2.5).

Planning the Review
The main goal of this study is to understand the current state-of-the-art in cloud testing and reviewing the existing approaches. In particular, we identified the following research questions (RQs): RQ1: What are the main objectives for cloud testing? RQ2: How are cloud resources exploited for software testing? RQ3: What are the test methods, techniques, and tools mainly used in cloud testing? RQ4: How are testing results evaluated in cloud testing? RQ5: What are the research issues and future research directions of cloud testing? RQ6: Which are the main application domains for software testing in the cloud?
The last question aims at understanding for what type of applications the testing has been migrated to the cloud, hence it only refers to TiC studies.

Conducting the Review
Conducting the review started with the identification of the relevant primary studies. Overall, our search spanned over seven digital libraries that are the most commonly used in similar studies, namely: Scopus, ACM Digital Library, IEEE eXplore, Google Scholar, ScienceDirect, Wiley Online Library, and Springer Link. The process was articulated in five steps as described below.

Automated Search in Digital Libraries.
In a first step, we conducted an automated search in the following electronic sources that are of great relevance for software engineering research: • Scopus (http://www.scopus.com) • ACM Digital Library (http://dl.acm.org) • IEEE eXplore (http://ieeexplore.ieee.org) Specifically, we searched by title, abstract, and keywords selecting English papers from 2012 to 2017. To be as comprehensive as possible, we defined a very general search string as shown in Listing 1. We included in the search also the acronym "TaaS" (Testing as a Service), because we noticed that it is commonly used in several cloud testing works. However, we decided not to search for other terms, such as, e.g., "analysis" or "evaluation." Although these may be used as synonyms for test or testing, it is unlikely that a paper truly centered on testing would never use the words "test" or "testing" in its title or abstract or keywords.

Selection Based on Inclusion/Exclusion Criteria.
We performed a first selection by reading title, abstract, and keywords of the papers and selecting them according to the Inclusion and Exclusion Criteria defined in Table 2(a). These are quite standard criteria mainly based on relevance of scope. We also excluded works that are not primary studies and works that are not peer-reviewed or too short (e.g., theses or abstracts). We also excluded monographs and books, as these tend to present mature work illustrating and merging results that have previously appeared in journals or conferences.

Selection Based on Quality Assessment.
We then performed a second selection of the included papers based on the reading of the whole paper. To this purpose, we defined a quality assessment checklist composed of the five criteria in Table 2(b), and a QualityScore, given by the sum of the individual scores I k as shown in Equation (1).
The quality assessment procedure followed a conservative approach aimed at excluding only those papers having very low quality. It included two phases. In a first phase, each paper was read by a (randomly selected) author who assigned to each criterion of Table 2(b) a score I k between 0 and 1 (precisely, equal to 0 if the paper did not satisfy that criterion, 0.5 if the criterion was partially satisfied, and 1 if it was clearly satisfied), so the maximum possible QualityScore was 5. For each paper: if QualityScore was less than 2, it was excluded; if QualityScore was greater than or equal to 3, it was included; finally, if QualityScore was less than 3 and greater than or equal to 2, the paper was labeled as acceptable.
In the second phase, another quality assessment was performed for the papers having an acceptable quality. For them, a second author different from the first one read and assessed the paper following the same process of the first step and producing a second QualityScore. Then, all those papers for which the sum of the two QualityScores (in the first and second step) was greater than or equal to 4.5 were included in the survey.

Searching Based on Snowballing.
Snowballing [155] is a search approach commonly used to complement automated queries. We adopted both backward and forward snowballing to identify additional papers that the automated search of might have missed. Precisely, for each selected primary study, we examined: for backward snowballing its list of references, and for forward snowballing its citations in Google Scholar (http://scholar.google.com/). In both cases, we first selected all papers with publication year in the range 2012-2017 that were not already included in the survey; then, we applied to the new found primary studies the same quality assessment process previously applied to the automatically retrieved papers.
As detailed in Section 5, the snowballing procedure lasted for three iterations. The first iteration (indicated as Snowball Iteration A in the figure) was applied to the start set of papers selected from the automatically found ones. The remaining iterations (Snowball Iteration B and C in the figure) were applied to the papers derived in iterations A and B of the search procedure, respectively, until no relevant new paper was found.

Assessing the Research Methodology.
We finally conducted a further search of the literature with the aim of verifying that all the relevant primary studies have been included. We launched again an automated search over different electronic sources, precisely: • ScienceDirect (https://www.sciencedirect.com) • Wiley Online Library (https://onlinelibrary.wiley.com) • Springer Link (https://link.springer.com) As detailed in Section 5, this search did not find any relevant primary study to be added, thus confirming the reliability and completeness of the snowballing procedure.

A CLASSIFICATION FRAMEWORK FOR CLOUD TESTING RESEARCH
In Figure 1, we present the framework we developed to characterize the cloud testing research and classify the papers. Aiming at completeness, we derived this framework incrementally. First a draft scheme was obtained based on our reading of titles, keywords, and abstracts during paper selection; this scheme included six areas and several topics for each area. We then used this draft scheme to classify the papers while reading the full text, but also continued to add within each area new subtopics as needed. Finally, during data analysis, we standardized/unified the new topics. The resulting framework is thus in itself a useful contribution to have a snapshot of trends in cloud testing research.  Overall, as anticipated, we classify papers into three different categories that span the above areas as shown by the colored frames in Figure 1: • Testing in the cloud (blue continued frame in Figure 1) refers to software testing performed by leveraging scalable cloud technologies, solutions, and computing resources to validate non-cloud software/applications. This category includes testing solutions for different application domains, such as mobile or web environments, which are validated exploiting large-scale simulations and elastic resources offered by the cloud. The main benefits of these solutions deal with: (i) reducing costs by exploiting apparently unlimited computing resources in the cloud; (ii) avoiding to develop and maintain testing infrastructure (scaffolding); (iii) on-demand test services provided by a third party to conduct online validation for large-scale software systems. • Testing of the cloud (red dotted frame in Figure 1) refers to validating the quality (functional and non-functional properties) of applications and infrastructures that are deployed in the cloud. The focus is on the specific testing problems posed by systems residing in the cloud, thus papers belonging to this category aim at checking the provided automatic cloud-based functional services, as well as at validating their performance, scalability, elasticity, and security. Moreover, software applications can be deployed on different clouds (e.g., private, public, or hybrid), hence testing can also focus on compatibility and interoperability among heterogeneous cloud resources. • Testing of the cloud in the cloud (green dashed frame in Figure 1) refers to applications and infrastructures deployed in the cloud and tested by leveraging cloud platforms. Papers belonging to this category fill the intersection area between Testing of the cloud and Testing in the cloud.

RESULTS
This section reports the number of primary studies selected in each step and presents several quantitative analyses of the review outcomes.

Numerical Outcomes
From the automated search 2 described in Section 3.2.1, an initial collection of 655 primary studies was found. The detailed results for the digital libraries considered in this step are reported in Table 3(a). This initial collection was filtered according to the inclusion and exclusion criteria stated in Table 2(a), obtaining a reduced set of 166 papers. Of these, 87 passed the two-step quality assessment (the detailed results from the quality assessment procedure are reported in Figure 2). Applying the forward and backward snowballing (see Section 3.2.4) on the selected 87 papers (i.e., Iteration A), 87 new peer-reviewed papers were identified, of which 47 passed the quality assessment selection.
The snowballing and quality assessment were iterated three more times, collecting a total of 147 (i.e., 87 + 47 + 12 + 1) primary studies. Referring to Iteration B, Iteration C, and Iteration D in Figure 2, we give the number of new papers obtained and selected at each snowballing iteration.
After the snowballing terminated, a second automated query was launched, following the procedure presented in Section 3.2.5. As reported in Table 3(b), a total amount of 31 primary studies has been collected from the three databases.
By comparing this list with the whole set of papers already analyzed, we found that 10 papers were already included in the previous selection. We evaluated the remaining 21 papers by reading title, abstract, and keywords and by applying the same Inclusion and Exclusion Criteria defined in Table 2(a). As a result, no additional paper was added to the list of the already selected references. Precisely, 20 papers fell into one or more of the following categories: studies not explicitly focusing on testing, editorial contributions, books, and surveys. One last remaining reference was a reprinting of a paper already included. This result confirmed that the snowballing process was exhaustive, so we proceeded to the analysis and reporting phase.
The complete list of 147 selected papers is provided at the end of the paper. Figure 3(a) depicts the overall distribution of selected primary studies over the years, whereas Figure 3(b) details the trend per each of the six areas. The overall distribution of primary studies over the six areas is shown in Figure 4(a). In particular, 60 papers were tagged as Test Perspective, 84 as Test Design, 120 as Test Execution, 93 as Test Objective, 51 as Test Evaluation, and 67 as Test Domain. Note that, depending on its content, each paper could be classified in multiple areas, so the histogram in Figure 4   distributions over the six areas) is not a partition and the sum of papers could be greater than their number (147). Figure 4(b) depicts the distribution of the primary studies among the three categories. As evident, most of the primary studies (i.e., 62.59%) are in the TiC category, while less than 28% of the considered papers targeted the problem of testing systems residing in the cloud, and only a minor part (i.e., 9.52%) explicitly refer to testing cloud-based solutions using testing resources on a cloud platform.

Quantitative Analyses
The following figures from Figure 5 to Figure 10 depict the breakdown of each area into its specific topics (see Figure 1), distinguishing the three categories of TiC, ToC, and ToiC. We remark that, as in the case of classification by area, also multiple tagging by topics was admitted; thus, the following breakdowns should not be intended as partitions.
More in detail, Figure 5 reveals that most of the interest for these primary studies was in describing the challenges subsumed by the cloud testing approaches (i.e., 26). Several primary studies presented concepts (i.e., 12) and potential issues (i.e., 13) of the paradigm. A minor number of primary studies directly addressed future research direction for cloud testing (i.e., 8), while only a few focused on aspects such as terminology (i.e., 2), technologies (i.e., 1), or motivations (i.e., 1).   Figure 6 highlights that among the 90 assigned topics in the area of Test Design, the most addressed topics are Test Generation (i.e., 31) and Test Process (i.e., 16). The topics Test Model (i.e., 8), Test Metrics (i.e., 8), Requirements (i.e., 9), and Test Selection (i.e., 7) were sufficiently covered, while only marginal attention was directed to Test Reduction (i.e., 3), Test Suite Assessment (i.e., 1), Oracle Generation (i.e., 1), Test Specification (i.e., 4), and Test Prioritization (i.e., 2). As a further consideration, a very limited number of primary studies in this area (i.e., 3) are specifically targeting Testing of the Cloud in the Cloud.
The breakdown of the primary studies in Figure 7 remarks the strong interest towards Test Execution. Out of 118 expressed topics in this area, several works present cloud testing tools or services (i.e., 34), testing infrastructures (i.e., 32) or platforms (i.e., 21), and the configuration of cloud instances (i.e., 17). Few works discuss specific testbed setup in/for cloud environments (i.e., 11), others address virtualization (i.e., 2) or scheduling (i.e., 1) aspects.
The analysis of the target goals for cloud testing is given in Figure 8. A good amount of primary studies cover functional testing (i.e., 22), but as expected, most of the effort has been spent on approaches that validate performance attributes (i.e., 50 over 108 cumulative classifications in the area). Finally, it is interesting to notice that also other non-functional objectives are covered by the analyzed primary studies: security (i.e., 11), elasticity (i.e., 6), reliability (i.e., 6), robustness (i.e., 5), compatibility (i.e., 2), availability (i.e., 1), usability (i.e., 1), or software quality in general (i.e., 5). Even though the resulting set of non-functional attributes is broad, the papers per topic    are not so many. This result highlights the versatility of cloud testing but conversely evidences that non-functional testing other than performance appears much less mature.
In Figure 9 the primary studies have been classified according to their capability of supporting the evaluation of the activities related to cloud testing. In this area most of the expressed tags (i.e., 25 over 55) concern means for the comparison of quality attributes of the software-under-test (SUT) in different conditions (e.g., configuration, deployment, load, etc.). Other topics in Test Evaluation, such as monitoring (i.e., 4), coverage (i.e., 2), and analysis (i.e., 2), received a marginal consideration; while both reporting (i.e., 12) and cost assessment (i.e., 10) were investigated much more. In our interpretation, these results enforce the idea that cloud testing is perceived as the modern promise for quantitative analysis of the SUT. At the same time, a considerable attention is given to those methods able to feed the outcomes of the technological experimentation into proper methodological frameworks.
Finally, Figure 10 reports the classification of the target domain of those primary studies testing a software/application in the cloud. Within this category, the collected data confirm that researchers leverage the cloud mostly for validating web application (i.e., 19) or software specifically targeted to mobile devices (i.e., 20); nevertheless, also a relevant number of works referred to the cloud as a means for testing SOA solutions (i.e., 14). In addition, the review also found a discrete number of papers (i.e., 9) addressing specific application domains such as: antivirus, earth observation, enterprise applications, gaming, GUI, high workload data analytics, IoT, or networks emulation.

ANSWERING THE RESEARCH QUESTIONS
We now summarize the main results presented in the primary studies, aiming at answering the research questions introduced in Section 3.1. The discussion is structured into three parts related to the three cloud testing categories. The section also includes a brief summary of recent surveys in industry [26] and an analysis of validity threats.

Testing in the Cloud
6.1.1 RQ1. As already said, the cloud overcomes the limits of traditional test approaches: indeed, the readiness of huge amounts of resources, the possibility to manage big amounts of data, and the availability of flexible, elastic environments allow to address testing objectives before considered infeasible. Among them the possibility of performing massive combinatorial testing, measuring performance in several (usage) scenarios, evaluating attributes such as scalability, elasticity, and reliability by scaling up and down resources in the most convenient way.
Due to such motivations, several functional testing approaches are moved to the cloud, including model-based testing, also based on formal specifications [13,60]; coverage testing [81,82]; use of combinatorial approaches for concurrently testing different configurations on different servers and in any order [139,140,156]. In particular, the paradigm Testing-as-a-Service (TaaS) opens new perspectives for functional testing specifically in the mobile context (such as References [93,114,166]) or GUI testing (such as Reference [31]).
The availability of many and relatively cheap resources in the cloud makes performance testing easier, thus many frameworks have been developed addressing the objective of performance testing [35,58,75,102,103,112,116,153]. However, performance evaluation requires rigorous planning and the setting of specific configurations to maximize test effectiveness. Thus, in this direction, we found several research proposals supporting load testing [52,159,160]; performing the analysis of usage scenarios for the generation of performance test cases [131]; focusing on the guarantee of specific service level agreements [148]; targeting the management of large test jobs [92]; tracking and analyzing a huge number of events [77].
Elasticity is among the main reasons that make cloud computing an emerging trend. Elasticity testing in turn may have different objectives such as: to control different behaviors, to identify the resources to be (un)allocated, to coordinate events in parallel [8,9], and of course to target scalability [17,30,40,91].
Reliability testing focuses on the observation of the system under test under the operational usage profile. Proposals focus on measuring the reliability level [59] or on achieving a specific reliability value also through API testing [39,99,154].
The growth in complexity of pervasive software-intensive systems goes along with the increased concern in the security of such systems, especially for mobile applications. Several recent proposals for testing in the cloud include approaches to virtualize, simulate, and discover network attacks, protocol vulnerability, and other security concerns [66,98,135,147,161].

RQ2.
Cloud infrastructures are used to provide Testing as a Service (TaaS) following a pay-per-use business policy [48,51,119].
Cloud resources are exploited to achieve test cost saving, scalability, and efficient utilization of the test resources while guaranteeing a quality of service (QoS) level according to a negotiated service level agreement (SLA) [103]. Efficient resource allocation approaches and test scheduling solutions are proposed to maximize the utilization of test resources and balance the load among them [11,44,60]. Moreover, the work in Reference [101] investigates the possibility of using hierarchical virtual machine fork for optimizing the cloud resources in system testing and saving system configuration effort as well as memory requirements by enabling disk sharing between concurrently executing test cases.
Different strategies are proposed to (i) partition the testing tasks; (ii) allocate them to different cloud processors for concurrent execution; (iii) and collect results. Some proposals focus on task decomposition methods and task scheduling algorithms to decrease the testing time [90,92,93]. The goal is to balance the number of test cases or test suites in each decomposed task or the execution time required to perform each task [90].
By leveraging huge computing resources, cloud testing allows for large scale combinatorial testing that was not possible in traditional test systems. Large numbers of processors are used to perform parallel test executions and identify faulty interactions through concurrent test algebra execution and analysis [139,156]. For instance, in the largest combinatorial testing experiment presented in References [141,145], all 2-wise to 6-wise configurations with 2 50 components are analyzed.
Cloud-based testing infrastructures are also proposed to efficiently perform interoperability and compatibility testing. For example, the work in Reference [36] validates the interoperability among SOA-enabled systems by checking the compliance of their communication protocols and the types of exchanged messages, whereas the authors of Reference [93] propose an approach to partition compatibility test suites into concurrent testing tasks that are executed on a set of Android devices.
Dynamic resource adaptation strategies are defined to manage the cloud resources dynamically adding or removing virtual machines based on the workload of the cloud testing platform and the number of available devices [91]. The aim is to balance the workload among several similar devices to improve their usage and decrease the testing time.
The computation power of the cloud is leveraged to scale-up fuzz testing of Android applications [98] along the dimensions of code size and test case number. Platforms such as Cloud Crawler [35] allow cloud testers to better control the costs of cloud configuration and resource allocation (e.g., by shutting down the VM after each individual test).
Finally, cloud resources are used in heavy testing techniques such as search-based software testing using genetic algorithms. MapReduce is the most used model to process distributed data on multiple computers [40,69,129]. The goal is to exploit easy-to-use parallelization mechanisms for enlarging the solution spaces with respect to sequential search-based techniques and achieving higher efficiency and scalability, thus improving the cost-effectiveness of these approaches [40].
6.1.3 RQ3. As depicted also in Figure 6, many techniques and tools address test generation. In this case, parallelization is used for mitigating the data values explosion problem [84]. Specifically, the Apache Hadoop MapReduce paradigm is used to support parallelization [29,40,69] of test data generation techniques such as for instance genetic algorithms [40]. Parallel concolic execution [31] and symbolic execution algorithms [10] have been defined for generating test cases, trying to overcome the path explosion problem by distributing the computation tasks over different workers on private as well as public clouds. Cloud9 is "an automated testing platform that employs parallelization to scale symbolic execution by harnessing the resources of commodity clusters" [27].
Model-based testing allows to generate a high number of test cases to be executed on the cloud starting from an abstraction of the SUT. Different model-based testing frameworks have been developed [12,81,82,98], such as AUTOMATIC, which derives many different QoS configurations to be tested in the cloud [12]; and ATCloud, which generates test cases based on API models [154]; or the proposal of Reference [82], which specifically addresses functional testing for composed services. MIDAS [58] is a model-based scalable testing platform leveraging a Domain Specific Language (DSL) based on the Unified Modeling Language (UML) and the UML Testing Profile (UTP). EvoDroid [99] is another model-based framework that analyzes the source code of an app and automatically extracts both a behavioral model and the APIs of externally referred apps. EvoDroid then exploits these models to automatically generate the tests that are executed concurrently in the cloud on several Android emulators.
Different testing frameworks leverage combinatorial testing techniques and use test algebra and adaptive test configuration generation algorithms that identify faulty interactions [140,142,[144][145][146]. In particular, the test results by different processors are combined thanks to test algebra rules that identify those interactions that do not need to be tested. Different solutions address the problem of identifying the configurations to be deployed and tested on a cloud platform with the aim of reducing their number and saving testing effort [12,38,141,145,146,154].
Many approaches deal with the architectural design of cloud-based testing infrastructures [11,84,129,133,134]. Some of them are usually tailored to specific application domains, including mobile and web applications. In the context of mobile testing, papers [134] and [133] present the design and implementation of cloud-based infrastructures as a service (known as MTaaS), trying to address the most important issues of testing mobile applications, whereas the work in Reference [129] presents the architecture of a scalable platform for cloud testing of mobile systems allowing to add new testing functionalities such as non-functional testing or test planning.
Concerning web application testing, several works [11,107,112,151] describe the architecture of testing services for analysis of web applications or web services compositions. An open and extensible cloud-based testing platform is MIDAS [58,60], supporting functional, security, and usage-based testing of service orchestrations. The authors of Reference [36] propose a cloud-based multi-layer architecture for interoperability tests among distributed automation systems, enabling the configurable compliance testing from protocol to system level. The authors of Reference [17] present the architecture of Vee@Cloud, which "serves as a scalable virtual test lab built upon cloud infrastructure services." The resource manager allocates Virtual Machine instances and deploy test tasks from a pool of available resources across different Clouds. Another platform is Cloud Crawler [35], which provides a declarative language supporting the description of many different performance evaluation scenarios to be executed in the cloud.
Many solutions present frameworks and tools implementing the TaaS model [44,69,85,103,116,121,147,153]. The common features of these tools are automated tests executed in parallel, computation scaling, and configuration test setup. Many of them target mobile applications mainly deployed on Android devices [56,63,64,99,114,121,153,157,166], with the goal of performing performance and compatibility testing. They share features such as test script generation, configuration of real or emulated devices on the cloud, automatic execution of the distributed tests, or test report generation including error location and error snapshots. Few solutions propose framework and tools addressing other domains such as GUI testing [31], security testing [135,147], stress testing [69], web browser testing [57], or performance testing [103].
Different proposals focus on the specification of cloud-based testing processes. Some of them define common process steps, e.g., selecting the types of testing to be executed, executing the test scripts and reporting the test results [153], whereas others define specific steps for cloudbased parallel test execution [92], such as specifying the test jobs and test deadlines, determining the number of virtual machines, computing test time/cost, partitioning of tasks based on the specified strategy and merging of test results. A specific test process is defined for mobile applications in References [51] and [49]: it specifies as main steps unit and integration testing, tenant-based functional and QoS testing as well as continuous testing and testing of specific features of mobile systems. Another test process [58,59,112] specifically focuses on SOA applications identifying as main steps SOA monitoring, usage profile inference, test data repository creation, test model definition, and test generation and execution. Finally, the work in Reference [81] shows that testing is an important part of the service life-cycle and proposes a model to "describe systematically the relevant processes governing services in the context of cloud brokerage." Some papers address testbed setup [1,30]. Specifically, the work in Reference [30] proposes a testbed implementing sCloud that adaptively allocates the available resources to heterogeneous workloads in distributed data-centers, taking into account QoS requirements as well as real green power and workload traces. The work in Reference [100] provides a test environment where mobile applications can be tested on different smart devices and mobile platforms, providing more realistic results than emulators. New objectives of benchmarking in the cloud are addressed in Reference [42], whereas the work in Reference [1] proposes new benchmarking solutions where controlled experiments are run on several clouds sharing a common orchestrator interface, and several multitiered applications are deployed fully automatically. Finally, an open-source and extensible testbed is PHINet [123], which supports development, testing, and analysis of Health-IoT in the cloud: it enables to run experiments under live traffic and various health sensors and allows users to control data acquisition and delivery.
6.1.4 RQ4. Several cloud testing solutions considered in this survey include a persistence layer responsible for storing historical data about past test executions or applied configurations (e.g., References [12,40,60,82,93,107,129,140,154]).
In some cases, such a layer is wrapped by dedicated software components/modules offering a finer perspective about these data, rather than considering them as a mere collection of informative test reports. For example, we found that such components can enable the comparison of the quality attributes for a considered SUT under different conditions [1,20,77,116,131].
Among others, CloudPerf [103] integrates a native and extensible reporting infrastructure offering both rule-based querying and statistics providers. Also, the paper remarks the importance to properly control the format of the reports, for example by means of custom XML templates.
The framework in Reference [153] supports developers and testers of mobile applications to assess the quality of their solutions on several mobile devices. For this reason, it includes a web-based reporting front-end enabling the comparison of the results obtained during the test executions.
The evaluation of test results in References [159] and [160] is addressed by means of a reporting infrastructure that periodically retrieves, organizes, and stores the data produced by the allocated testing tasks. The main objective of such infrastructure is to analyze and to compare the impact of load testing for a target Web Service.
In Reference [52], test results can be evaluated by means of a set of dedicated components that are responsible for gathering quality metrics (e.g., throughput, response time, etc.), for displaying online comparison charts or for exporting test reports. Those components are exposed both as RPC interfaces and by means of a presentation layer where a set of tenants can manage test result information.
The Cloud Crawler platform [35] collects and stores the performance results from each executed test scenario and aggregates them in a spreadsheet document that can be used to compare the performance of alternative configurations or the impact of different workloads.
In Reference [85], the authors enrich test reports with structural coverage information about the source code achieved during the test sessions, which can be continually used during code development, but also as a means for structuring accounting of billing policies.
Live monitoring of test executions is also considered as a key feature for the evaluation of test results in References [103], [107], and [160]. The reason is to anticipate troubleshooting of issues, rather than waiting for the completion of a testing session.
Test results in cloud testing are also evaluated in terms of the analysis of test costs, and the works in References [170], [29], and [85] remark the importance of a predictive costing model. Different business models for cloud testing have been proposed. For example, there are approaches addressing cost-reduction by spreading common costs across parallel test cases execution [38], multiple tenants/renters [68], or implementing the pay-as-you-test model [51]. The latter framework explicitly includes components for accounting and billing [51].

RQ5.
Many primary studies transfer approaches, methods, and infrastructures conceived for traditional testing to the cloud [27]. For instance, the work in Reference [85] migrates concolic test generation, the work in Reference [40] focuses on the use of genetic algorithms for test data generation, the work in Reference [131] proposes usage scenario for the formulation of performance test cases, the work in Reference [102] targets the use of design patterns for performance testing, and the work in Reference [60] proposes black-box and gray-box techniques for test input and oracle generation. In addition several papers provide tutorials and informal surveys in the context of cloud testing such as References [49], [68], or [119].
The rapid advance of cloud computing, with its deep impact on mobile and web application development, raises interesting open problems that span all over the development process, from requirements definition up to the final release [108,170]. In this context, different works provide informative discussions and possible solutions about the issues of testing-as-a-service (TaaS) [48,79,81,82,129,132].
In turn, test case specification and generation [29], and the definition of the most appropriate configurations to be tested [39,140,156], are still crucial questions in the cloud context. Some authors target the issues related to frameworks for parallel tests execution [116], problems on effective and efficient allocation of the available resources [30,42,52,92], specific testing aspects such as security [66,161] or Android devices and applications [99].
The works in References [49] and [108] discuss perspectives related to migrating software testing in the cloud, including: (i) the availability of an elastic environment, i.e., the ability of dynamically scaling up and down testing resources as needed [30]. Such an environment may address different functional and non-functional aspects [103] and test monitoring facilities [8,77]; (ii) speeding up of the testing activity, i.e., the possibility offered by the cloud to execute in parallel different groups of tests or share and reuse testing resources [28,60,129]; (iii) self-service testing, i.e., the possibility to freely select or customize tools and platforms, configure the test settings, or remotely launch testing [29,68]; and finally, (iv) emulation of real-world scenarios [51].
Important future directions in cloud testing are represented by the possibility to develop and validate mobile applications and SaaS applications on mobile web [51,112] and the availability of frameworks for measuring and certifying performance, quality, applicability, and usability in real-world scenarios [13].
As emerged by this investigation, research on testing in the cloud is very active. Approaches, methods, and infrastructures are continuously evolving to tackle both well-known and new issues, perspectives, and future directions, especially in mobile or web application development. Moreover, the possibility to validate the SUT under various configurations and different conditions allows testers to better assess and certify performance and quality attributes [36].
6.1.6 RQ6. From a detailed analysis of the papers labeled under the topic "web application," several types of cloud-based testing services/architectures can be distinguished. Some approaches [11,69,116,151,160] propose a TaaS architecture addressing non-functional requirements (e.g., performance [11,151], stress [69] and load [160] testing, scalability [151], security [66], etc.). Among the others, the work in Reference [116] proposes an abstraction framework enabling the parallel execution of tests for web application on a local development workstation as both the tests and the application are deployed in the cloud. It provides faster test feedback to developers that can seamlessly apply the same operations both locally and on the cloud without having to care about deployment.
Some works focus on security aspects: the work in Reference [147] validates remote web applications by means of cloud scanners, and the one in Reference [13] formalizes an adaptive assurance technique based on online certification, foreseeing a certification authority regulating the on-line certification processes, and their related trust model for the cloud. The overall perspective is to enable chains of trust supported by the verifiable (non-)functional properties of cloud-based services.
An interesting perspective about the lock-in problem for users of TaaS platforms and services is discussed in Reference [39]. Finally, the authors of Reference [57] propose an approach aiming to detect potential cross-browser incompatibilities within web applications, impersonating users accessing a web application from different browsers.
Compatibility testing [121] is one of the most investigated areas within the mobile application domain [132]. A contribution common to the papers on this topic is that of checking the same application running on several kinds of mobile devices emulated in the cloud. Similarly, in Reference [133] and its related papers (e.g., References [51,134]), a comprehensive TaaS system is conceived, aiming to validate complex mobile scenarios (i.e., MTaaS). According to Reference [133], MTaaS includes both a IaaS and a PaaS layer. The reference implementation of the MTaaS-IaaS supports the resource provisioning, monitoring, and billing services.
Several works less ambitious than MTaaS aim to validate the functional behavior of one or more applications when running on different target devices, possibly under different configurations. Specifically, the work in Reference [166] proposes a TaaS platform enabling the automatic generation of functional tests for mobile devices that are then launched over several kinds of mobiles; the one in Reference [153] proposes an approach that conforms to a set of international testing criteria; the works in References [93], [63], and [56] propose architectures that improve the efficiency of compatibility testing on mobile devices.
Concerning security, the work in Reference [135] presents an automated approach for security testing of software in mobile phones: the authors adopted the full virtualization technology (i.e., KVM) to easily emulate terminals in the cloud. Each device hosts actual applications that are the target of vulnerability scanning frameworks enabled in the platform.
Then, there are works for mobile testing that are actually agnostic from any specific paradigm [98,99]: however, the authors explicitly validated their approach by leveraging the cloud paradigm with the motivation to achieve several orders of magnitude improvement in execution time by running tests in parallel and on device emulators deployed on-demand.
Another considerable set of primary studies proposes a TaaS architecture for SOA (e.g., References [112,113]). Among them, the papers noted in References [58,60], and [59] belong to a series of works that perform SOA testing, leveraging the already mentioned MIDAS platform [18,37].
A methodological support is given in Reference [81], which presents an approach for functional testing based on a Service Lifecycle Model. The contribution aims to "support providers during the service engineering phase, and consumers during the operation phase." Also, the papers in References [139,140,142,143,156], and [144] are a series of methodological works leveraging an algebraic approach for testing SaaS.
An example of SOA testing of non-functional attributes in the cloud is given in Reference [148]. Here, the authors validate proposed SLAs (e.g., levels of availability, performances, reliability, and other attributes) against the implementation of software services. The framework runs in the cloud sets of test cases designed according to a prescribed quality model.
Papers tagged as Cloud Infrastructural Applications include works aiming to test specific applications that could be used as building blocks for some cloud solutions. Under this topic, the work in Reference [103] presents a performance testing framework designed on purpose for multitenant dynamic environments; the one in Reference [52] migrates an existing load testing tool (Bench4Q) to the cloud; the one in Reference [9] proposes an approach to reproduce elasticity testing in a deterministic manner; and the one in Reference [75] presents a code generation framework for automated configuration and performance testing in several alternative scenarios.
About Enterprise Application Software, the authors of Reference [119] investigate the adoption of cloud testing in practice by SMEs and propose a structured approach for adopting cloud testing. Nevertheless, from a practical perspective, each enterprise has different applications that can be tested in multiple different ways. The frameworks in References [102] and [28] abstract most of the concepts of Enterprise Application testing and discuss various solutions for leveraging cloud testing of non-functional properties (e.g., performance [102], elasticity, and reliability [28]).
Other different objectives exist to carry out software testing of the cloud, although in light of the survey results, these objectives appear as secondary in comparison with performance. These objectives are listed as follows, from highest to lowest order of appearance: (i) Functional aspects, reported in five studies [25,45,80,128,137]; (ii) Security, reported in three studies [21,105,163]; (iii) Elasticity, reported in two studies [5,62]; and (iv) Reliability, reported in one study [104].

RQ2
. Cloud resources are used in different ways in the selected studies. The work in Reference [124] presents a way for load generation for online testing on the cloud. The work in Reference [33] presents a method for robustness testing of IaaS cloud platforms: test cases are generated by leveraging all the combinations of input and state levels, applying various constraints. The authors of Reference [6] create test sequences for detecting configurations that decrease cloudbased software systems' performance.
Another significant number of studies use cloud resources in different manners with the aim of assisting in the testing process, for instance the work in Reference [105] presents an evaluation of different encryption algorithms (RC4, RC6, MARS, AES, DES, 3DES, Two-Fish, and Blowfish) on both desktop computer and Amazon EC2. The works in References [80] and [67] present Bon-FIRE, a multi-site testbed exposing cloud resources across different sites via a web portal. Among other features, BonFIRE allows to create custom network configurations at scale based on cloud infrastructures. The work in Reference [128] proposes a testing methodology together with a tool (called Elvior TestCast T3, TTCN-3) for automating use-case testing.
Testing metrics are also widely used, see, e.g., the study in Reference [167], which presents a cloud framework for anomaly detection called eCAD. This tool internally uses an evolutionary data-clustering algorithm called DBSCAN (Density-Based Spatial Clustering of Applications with Noise) to detect cloud anomalies, monitoring different performance indicators, such as CPU, memory, or input/output. Another example of the use of metrics is in Reference [70], in which the authors present a generic cloud performance model for assessing IaaS, PaaS, SaaS, and mashup or hybrid clouds. This work uses different performance metrics, such as speedup, efficiency, latency, or bandwidth. Then, the work in Reference [86] proposes a resource monitoring and management service for OpenStack-based cloud testing platforms for Android devices, using metrics such as average time or number of unit tests in their experiments. Finally, Reference [62] proposes a framework to evaluate IaaS elasticity on mixed workloads. To that aim, the authors designed a workload using different patterns for the infrastructure cloud and then derived aggregated performance metrics for elasticity, including resource, service, and cost aspects.

RQ3.
The test methods, techniques, and tools used for testing of the cloud range over a rich variety of approaches. To start with, different types of cloud configuration are available: for instance, the work in Reference [2] analyzes the scalability of the NoSQL database Cassandra using the Yahoo Cloud Serving Benchmark. The authors conclude that scaling the number of nodes does not guarantee an improvement on the performance, even for large data-sets. The work in Reference [138] proposes an adaptive combinatorial testing of SaaS multi-tenant applications based on an Adaptive Reasoning (AR) algorithm using a small subset of the cloud configurations. The work in Reference [7] proposes an approach that controls the resource variations when providing elasticity to web applications in the cloud and is validated using Amazon EC2. The work in Reference [97] introduces a cloud simulation tool, called CloudAnalyzer, aimed at scaling applications deployed in the cloud under different configurations. The tool provides multi-threading and database support, as well as an algorithm editor window.
Cloud testing tools and services are presented in several papers. The authors of Reference [89] present a tool to derive test cases from a formal specification expressed as a DSXM model. The authors of Reference [45] showcase AUToCLES, a TaaS tool of cloud-based elastic systems: from a JOpera (i.e., a visual composition language) specification of the SUT and a set of test cases defined using JMeter, AUToCLES instantiates the SUT, configures the testing environment, generates test input data, and executes the test cases. The authors of Reference [163] propose a model-based and change-driven solution for security testing: the approach identifies possible intrusion points by applying malicious users' techniques to the system interface. The work in Reference [127] presents CloudBench, an open-source IaaS cloud benchmark that is compatible with multiple cloud providers, such as, e.g., Amazon EC2, OpenStack, or Google Compute Engine. The one in Reference [34] presents Cloud Crawler, a declarative environment for the description and execution of automated application performance tests. To that aim, a DSL (called Crawl) is presented, supporting the description of many different IaaS performance evaluation scenarios.
Testbed setup is spread in some studies. For instance, the authors of Reference [149] present the C-MART benchmark, able to emulate modern web cloud applications. The tool is able to generate workloads emulating the access to the website. QoS performance measurement based on response time is proposed. The work in Reference [3] presents aDock, a set of tools for creating sandboxes of cloud environments (based on OpenStack and Docker) that expose varying performance and configurable properties.

RQ4.
In terms of test results evaluation, performance is again the most influential quality attribute considered by practitioners. This is in line with the results for RQ1, in which performance was also a key motivation to carry out testing of the cloud. In the analyzed studies, a common factor is the evaluation of some performance indicators (such as response or execution time, among others), e.g., comparing the results in terms of number of cloud nodes [2], cloud provider [70], type of cloud nodes (e.g., small, medium, large, etc.) [55,70], type of SUT [74], desktop vs. cloud [105], or number of virtual users [158].
Another aspect of test results for applications deployed in the cloud is the testing cost. For instance, Reference [104] presents the comparison of costs and footprint among different cloud configurations. In the same way, Reference [95] studies the impact on testing costs of sequential vs. parallel execution on cloud infrastructures.
The data generated by tests are often gathered as test reports. These results are typically generated by testing frameworks or tools [95,104,137]. Similarly, test coverage is another way of evaluating test results, as shown in Reference [94].

RQ5.
Different studies report research directions of testing of the cloud. For instance, the work in Reference [25] presents a formal model for validating firewall configurations, including packet filtering or NAT features. This approach has been implemented as a test framework that derives test cases generated on the basis of the formal model. Reference [137] explains "why using state-of-the art model-driven engineering (MDE) and model-based testing (MBT) tools is not adequate for testing uncertainties of cyber-physical systems in IoT cloud infrastructures." The authors of Reference [104] explore the use of cloud technologies for debugging. The authors of Reference [46] introduce new test approaches for elastic systems, mapping metaphors from elastic materials (such as deformation, plasticity, necking, or fatigue) with analogy in elastic computing systems. The authors conclude that further research (from requirements engineering to programming language to maintenance) is needed to test elastic systems properly.
A number of papers report different types of issues. About Reference [2], some issues have already been introduced in Section 6.2.3 while discussing RQ3. The work in Reference [55] presents an approach to support application capacity planning in IaaS clouds. It assumes that the application performance under defined configurations and workloads can be inferred from the resource configuration provided by the IaaS, using the total response time as the performance metric. The authors of Reference [87] propose key software architectural drivers for cloud testing. These drivers are divided into two groups: (i) Traditional testing management environment (non-cloud testing techniques, non-cloud testing methodologies, non-cloud standards ISO/IEC 25000, and IEEE Std 829-2008); and (ii) Cloud test management environment (migration strategies, testing techniques and relevant factors, cloud infrastructure and architectural principles, standards, and collaboration and the best practices). The work in Reference [74] evaluates some open-source performance testing tools (Apache JMeter, Load Focus, Nouvula) on Flipkart, Snapdeal, and Amazon online shopping websites. The one in Reference [65] proposes a lightweight test algorithm that can check if a cloud provider meets the agreed SLA in terms of CPU speed.

Testing of the Cloud in the Cloud
The main objectives for moving software testing of the cloud in the cloud are performance [41,130] and functional motivations [19,136]. Security is the target of References [88] and [109], while Reference [50] covers multiple aspects: performance, elasticity, robustness, and reliability.

RQ2.
In Reference [165], stub cloud models are used for test generation aimed to achieve high structural coverage for cloud-based applications. These models allow to simulate real environment conditions using fake stubs that provide user-defined return values.
A test case reduction approach is presented in Reference [19]. In this work, a validation method called Context-Assisted Test Case Reduction (CATCR) for cloud-based systems is proposed. This approach assesses cloud-based applications, taking into account geographical context information, considering the relative importance of the test cases in their geographical cloud-location. [136]. This work presents a highlevel DSL to define the deployment process and resource requirements of a software system. This DSL is later transformed in a set of deployment and instantiation scripts for different cloud providers. Another approach in a similar context can be found in Reference [101], which provides an overview of the configurable Chameleon Cloud testbed for researchers.

RQ3. Cloud configuration is addressed in Reference
Cloud testing tools and services are addressed in different studies. For instance, the authors of Reference [50] propose CTaaS, a cloud-based TaaS environment aimed at supporting SaaS performance and scalability testing. The authors of Reference [41] present PEESOS-Cloud, an architecture for conducting experiments in services using the characteristics of the workloads. Reference [43] presents an API called IoTCloud that allows developers to create scalable high-performance IoT applications. Finally, Reference [37] discusses the cloud-based software architecture in the MIDAS project in Amazon AWS.
Testing infrastructure aspects are addressed in different ways. The work in Reference [19] introduces a validation method called Context-Assisted Test Case Reduction (CATCR) that supports test reduction based on the context. Reference [72] proposes a cloud framework called PCTF that enables the integration of different independent test components. [130]. This paper studies the performance of Eucalyptus and OpenStack in terms of number of VMs and launch time.

RQ5.
Research Objectives in ToiC are addressed in several papers, such as References [37,88,109,165]. A group of papers describe basic research concepts [50,136]. The work in Reference [125] provides a comprehensive cloud testing overview covering the relevant concepts, issues, benefits, and goals. Finally, Reference [19] discusses future research directions.

Industry Surveys
While this survey focuses on the scientific literature on cloud testing, it may be interesting to also consider industrial trends and perspectives. A review of cloud testing approaches, tools, and objectives in industry would, however, entail a quite different research methodology, such as a survey of gray literature or, better, directly interviewing practitioners and managers. Performing such a type of study goes beyond the scope and extent of this work, but for the sake of completeness, we provide below a short summary of the results from existing studies.
Riungu-Kalliosaari and coauthors [120] 3 have recently conducted an extensive and wellstructured survey among practitioners investigating how and why the industrial software testing is moved to the cloud. Their study was based on semi-structured interviews and involved 35 respondents from 20 organizations. The results revealed that industry has a high demand for testing resources, so the main motivation to cloud testing adoption is improving the cost-effectiveness of the testing process. In line with our findings, the study highlighted that the cloud is mainly used for performance and scalability testing. Practitioners find that a cloud-based environment enables CPU-intensive tasks, multi-platform and crowd-sourced testing, and brings the benefits of reduced maintenance effort, while security is perceived as a risk.
We also consulted the latest report by CapGemini [26], published with Sogeti and Microfocus, which is considered as one of the most important sources of information about current practices and future trends in software quality in the industry. The research study involves 1,660 IT executives of different companies around the world.
The report highlights three main objectives for the coming years: intelligent test automation and smart analytics, smart test platforms, and agile organization of Quality Assurance (QA) and test function. All three objectives involve in some way or another cloud infrastructures. Intelligent test automation and test analytics are topics that involve machine learning into the test execution and reporting. Smart test platforms leverage cloud resources to provide test environment and tools, including self-remediation approaches. Agile organization requires managing several testing environments at different points in the software life-cycle, something that can be provided by the huge amount of resources available in the cloud. Indeed, according to the report, 73% of organizations are already using environments deployed in the cloud, and 15% are using containers.
In relation to testing cloud-based applications, the report points out several approaches. Around 63% of organizations mainly do performance testing, whereas security testing is mentioned by 62% of respondents. Assessing peak load requirements is another common testing scenario, approached by 57% of the organizations. Finally, 33% of respondents do not use any specific approach to test cloud-based applications.
One of the domain areas that have been mentioned in many of the selected primary studies in our survey is IoT. When we look at the status of IoT testing in the industry, numbers fall apart when compared with web or mobile testing. Only 32% of respondents having IoT products have a mature IoT testing environment. Around 51% of companies working on IoT products do not yet have a testing environment, although some of them are planning to invest in such an environment. This is one of the areas where researchers and industry are aligned in searching for better and more mature solutions.
In conclusion, we see several points of agreement between the trends in research and practice, although it is desirable for a tighter collaboration to be established between the two worlds [53].

Threats to Validity
This section discusses threats to the internal, construct, and external validity of our empirical study. Internal validity is concerned with the confidence on the reported results, and in this study the following aspects can be considered:
Authors' expertise. Our own expertise may have influenced paper selection and classification. In particular, the first step of the screening against quality criteria has been performed by only one author (randomly selected) and might have produced wrong exclusions (false negatives). To reduce this risk, the selection was articulated along several phases adopting a conservative approach, as described in Section 3. Only those papers receiving a very low score have been excluded, whereas for all the others considered of acceptable quality, an additional screening was performed by a second author. Concerning paper classification, along the process, we held several meetings of all the authors in which we compared and aligned the respective assignments.
Framework definition and adoption. We contributed to both the definition of the classification framework and its usage for paper classification. This is an unavoidable threat of these studies. However, we make available the classification data to allow other researchers to evaluate the validity of results.
Framework inclusiveness. The adopted classification framework might not be inclusive of all areas and topics characterizing cloud testing research. To overcome this risk, the framework has been derived incrementally. As described, starting from a first draft framework obtained reading only title, abstract, and keywords, new subtopics have been added within each area after reading the whole paper.
Construct validity includes those threats concerning the correspondence of measures utilized to the related properties. In our study, the following threats can be identified: Identification of primary studies. To identify primary studies of this survey, we defined a search string. A different search string might have produced different results. While this is an intrinsic threat of all systematic surveys, we tried to mitigate this issue by defining a very general search string to be as comprehensive as possible. Another threat to our proposal related to the primary studies identification is due to the considered digital libraries. We initially used three very popular libraries, but it is likely that searching on other sources might have produced different results. This risk has been mitigated by using different iterations of backward and forward snowballing as a search procedure for complementing the search in digital libraries. Finally, we also performed a verification by launching a second automated search on three more libraries, and the results confirmed the validity of the search results. Thus, we see it as very unlikely that we have missed relevant studies.
Selection of primary studies. Exclusion or inclusion of papers was made by first reading title, abstract, and keywords according to defined inclusion and exclusion criteria and then reading the whole paper according to a two-step quality assessment procedure. There is the possibility that papers have been missed due to the defined inclusion/exclusion criteria or to the defined quality assessment checklist of the above selection procedure. However, for defining this selection procedure, we followed the guidelines for systematic reviews in software engineering [24,83] as well as the selection procedure followed in similar studies [73].
Finally, external validity threats descend from potential issues preventing results generalization. In our study, only a subset of papers concerning cloud testing research has been targeted, i.e., only papers published in a five-year period from 2012 to 2017. Therefore, the results might not well represent the overall research in the field. We believe the risk is low, because cloud testing is a new research topic and thus recent years can likely include most relevant advances in the field. A related threat is that the automated search was performed on April 26, 2017: as the field is growing fast, a later search would clearly find more recent papers that were not yet included in the digital libraries. This threat has been partially mitigated by forward snowballing that allowed us to include many additional papers citing the primary studies with publication year in the range 2012-2017, i.e., spanning the whole of 2017.

TiC
ToC ToiC RQ1: main objectives Using huge cloud resources to overcome practical limits of testing. Assessing performance, elasticity, reliability, and security.
Mostly assessing performance attributes such as execution time or latency, but also other aspects such as elasticity or security.
Mostly assessing performance attributes and security, but also functional aspects such as traditional application.
RQ2: resources exploited Allocating dynamically tasks and computing resources for efficiency (load balancing and resource utilization).
Customizing testing workloads and network configurations, also leveraging cloud performance models for evaluating target metrics.
Simulating the real environment or exploiting geographical context information.
RQ3: methods, techniques and tools Simulating the real environment or exploiting geographical context information.
Mainly tools and techniques for cloud configuration setup, test execution, test generation, and performance monitoring.
Mainly tools and techniques for cloud configuration setup.
RQ4: result evaluation Comparing mainly quality attributes but also business costs, through either a persistence layer or a live monitoring infrastructure.
Comparing mainly performance attributes but also business costs, often gathered as test reports.
Comparing performance attributes.
RQ5: research issues and future directions Many novel approaches, methods, and infrastructures, especially for mobile or web applications, allowing various test configurations and conditions.
Many new issues and goals for moving existing model-based testing and debugging techniques to cloud, and for scalability and capacity planning.
Quite heterogeneous issues and research directions depending on the specific nature of the study.

RQ6: application domains
Mostly mobile or web applications, and SOA solutions. But also specific applications such as multi-tenant or data distribution infrastructures, network emulation, and gamings.

SUMMARY OF FINDINGS AND CHALLENGES
With the aim of providing cloud testing researchers with a compendium of the state-of-the-art as emerging from the 147 primary studies, in this section we provide two summaries drawn from two different perspectives. In the next subsection, we provide a one-page summary that recaps in tabular form the answers to the six research questions that we present in extended form in Sections 6.1, 6.2, and 6.3. In Section 7.2, we instead summarize the main research challenges that we identify along the six areas of the classification framework proposed in Figure 1. Table 4 recaps concisely the main findings of the survey, classified along the six RQs.

Research Challenges Along the Six Cloud Testing Areas
We now refer to the proposed framework (see Figure 1) for classification of research in cloud testing to create a summary of the main challenges ahead as they emerge from the primary studies. We start by noticing that some of the challenges concern general aspects of the cloud testing problem that have a wide impact and thus return across more areas. Such common challenges include: (i) evolution, related to the continuous and rapid evolution of cloud technology. This transversally impacts all areas, in that testing activities must continuously face novel challenges (test perspective and test objective) and adapt to new environment constraints and conditions (test design, execution, evaluation), while the test domain evolves itself as well. (ii) cost, descending from the large dimensions and high complexity of cloud systems and their many possible configurations. Such high cost heavily impacts the areas of test design, execution, and evaluation, and also affects test perspective. (iii) lack of standards, clearly related to the newness of the cloud computing discipline.
We find this challenge in studies concerning text execution and evaluation, as well as concerning the portability across test domains. (iv) elasticity, relative to testing the capability of provisioning and de-provisioning cloud resources. This challenge impacts test execution above all and the area of test perspective as well, as it requires novel specific testing approaches. (v) security, which is a crucial concern in cloud systems and entails even more difficulties than in traditional testing. This challenge clearly spans across all areas, mainly impacting test objective and test evaluation.
In the following, we instantiate such cross-challenges within the relevant areas and also present more challenges that are specific to each area.

Challenges in Test Perspective.
Cloud testing is a novel field bringing several new specific concepts, issues, and technologies related to many different testing aspects, spanning from resources management, performance evaluation, quality and risk assessment, computational infrastructures management, and maintenance. As said, cloud technology is very dynamic and evolves incessantly. Hence, a big challenge is that any proposed TiC solution needs to be continuously revised and adapted to this evolution. Additionally, the complexity of applications and infrastructures that are deployed and tested in the cloud is increasingly higher. In our vision, this continuous modification, growing, and revision of the cloud ecosystem goes hand-in-hand with the continuous discovery of new challenges, perspectives, and issues in cloud testing. One main ToC challenge due to the growing complexity is related to the capability of assessing the cloud application as a whole by means of end-to-end tests. As long as the applications under test grow, they tend to become more difficult to set up and configure. The automation and maintenance of the proper setup to support end-to-end testing is a challenge for practitioners. Moreover, endto-end testing of cloud applications is usually a time-consuming activity that yields high costs. Elasticity is commonly identified as a core property provided by cloud-based systems and is one of the common challenges across areas. A potential ToC challenge in this aspect is the use of proper workloads to evaluate the elasticity of an application deployed on a given cloud solution. This challenge is usually divided in different parts, namely workload generation (typically using a given pattern or algorithm), scheduling, and execution of load tests, measurement/monitoring (assessment on how the elasticity is actually behaving in the cloud application), and finally, follow-up activities (quantification and improvement).

Challenges in Test Design.
Concerning the challenges of test design in the cloud, existing techniques do not take properly into account specific cloud environment features such us heterogeneity, scalability, load balancing, communication, frequent failures, and synchronization between distributed components. Parallel algorithms for test data generation, such as parallel concolic execution, graph search heuristics, or combinatorial solutions leverage efficient distributed computing architectures such as Apache Hadoop MapReduce to distribute the test generation tasks over public cloud, mitigating both event-sequence explosion and data-value explosion. An important challenge in this respect is identifying an abstract representation of the evolving cloud environment. The adoption of model-driven engineering and generative programming techniques help application developers to identify an abstract representation of the test scenario and to define the right combinations of configuration options for deployment and testing of their applications. However, well-defined test models and coverage criteria to address the constraints of the different cloud technologies or providers are lacking.
Maximizing the effectiveness of different kinds of tests for cloud applications is usually identified as a challenge. There are different techniques carried out to achieve this optimization; for example, by means of test selection, prioritization, or reduction approaches. Another common research direction in this area is the use of metaheuristic search algorithms.
In cloud infrastructures, engineers usually integrate different SaaS and applications based on their provided APIs and connectivity protocols. This integration is challenging from a testing point of view due to the extra costs and difficulties that directly impact the design and implementation of the underlying tests.
In view of the dynamic and heterogeneous cloud environments, providing a rigorous test plan that can take into account the costs of using a cloud environment from utilization periods through disassembly remains a challenge. Public cloud providers have their operating models and pricing mechanisms but offer very little interpretability when testers need to change vendors. Moreover, a good test plan should consider also associated hidden costs, such as the cost of encrypting data, before moving testing to a cloud environment, as well as the cost of monitoring the utilization of cloud resources to prevent over-usage and over-payment. Another important aspect to be addressed in the test plan is the management of test data, in particular appropriate security policies ruling the supplying of confidential or production data to third parties should be adopted, whereas some strategies for filtering or scrutinizing data before testing in the cloud should be foreseen.

Challenges in Test
Execution. According to test engineers' feedback, the construction of a test environment in the cloud is tedious, time-consuming, and still involves high costs and complexity. More attention should be given to making test execution in the cloud cost-efficient, also trying to reduce the costs due to setting up the test environment on all the machines in the cloud. Indeed, improper sizes of the allocated virtual machines or unbalanced loads can result in low resource utilization or increased response time. Efficient strategies are also needed to execute complex test scenarios by leveraging dynamic scalability and elasticity of underlying computational resources. Important aspects to be further investigated are test decomposition policies, test allocation, and test scheduling methods. These are needed to decompose the test jobs into more test tasks that can be executed concurrently to improve resource utilization and computation time.
Concerning TiC, the cloud-based test environment configuration is still hard to realize: testers need to deal with combinations of various SaaS and applications according to the offered APIs and connectivity protocols. This task appears even more complex when legacy test software is migrated to the cloud. Traditional test configuration practices do not consider the heterogeneity and complexity of the cloud. The challenge in this direction is to investigate the development of a holistic testing framework as an integrated solution with a core TaaS infrastructure, enabling the ease of adding and scaling additional capabilities such as non-functional testing or test planning approaches. This testing framework could support the construction and deployment of on-demand virtual test labs in a TaaS infrastructure, enabling efficient test execution as well as resource and tool-license sharing. This allows to overcome the limitations of some cloud providers that offer only a reduced set of configurations, technology, storage, networking, and bandwidth. The main obstacles to the realization of this framework still remain: the lack of standard interfaces and connections to test tools and third-party solutions as well as the complexity of connectivity with other clouds.
Another important issue is the lack of automated facilities for dealing with test failures or detected bugs during large-scale test execution. Effective test execution solutions as well as self re-settable and auto-recoverable test scripts are needed to support and process any test failures during automated test execution.
Concerning ToC, multi-tenancy is a common aspect of cloud-based applications. Complex scenarios involving SaaS multi-tenancy remain an open challenge. The use of load balancing technologies aimed at decoupling client traffic from application services is typically used for preventing data loss and network outages.
Finally, a relevant aspect of cloud technologies is resource usage (e.g., CPU, memory, disk, or network) and its corresponding cost. A potential challenge in this domain, especially for ToiC, is resource contention required for a given test suite execution on the different cloud providers.

Challenges in Test
Objective. The need to provide viable solutions to meet testing needs within organizations and industries will push research of more effective means to support practitioners during development and testing activities, and the interest for TiC solutions will increase. In this direction, new objectives will involve the decision-making processes and the management of cloud-based testing.
Other important aspects descend from the need to increase the confidence in the cloud system and its components, from its infrastructure to the hosted applications. In our vision, certification, consistency, assurance, and assessment of the cloud environment can become the future keywords for the test objective in the TiC context.
Security has become a hot research topic in the testing community. Testing security aspects is especially challenging when the system under test is deployed in the cloud. Several open questions can lead to further investigation in this domain. For instance, how to assure and assess user privacy or business data privacy hosted in cloud infrastructures.
Achieving higher scalability and performance of ToiC approaches is a common challenge in the current state-of-the-art of cloud testing. These two quality attributes are closely related to different challenges already presented, such as elasticity, multi-tenancy, or load balancing.

Challenges in Test Evaluation.
In non-functional testing (e.g., load, performance, or stress testing) factors such as network bandwidth or workload conditions that can affect the validation must be considered. Thus, to achieve meaningful test evaluation, it is important to be able to properly control and trace such influencing conditions. Even Service Level Agreements (SLAs), i.e., formal contracts that guarantee a negotiated QoS, are not always sufficient. Although SLAs are not supposed to be violated, it may happen that some violation occurs and impacts the outcome of the TiC session. In conclusion, there should be more emphasis on linking the cloud test reports with information/metrics that could help testers.
Most primary studies provide support for evaluating canonical IaaS indicators such as CPU or memory usage, and for reporting summary information. An interesting direction for reporting capabilities in TiC is providing native support for customizable aggregation of the monitored indexes.
The state of the practice in TiC reveals a large adoption of ad hoc solutions for measuring/certifying different quality attributes. In the long term, such a practice could result in some form of technical debt for customers relying on TiC (e.g., vendor lock-in). As an additional impact, the lack of standardized reporting approaches could also limit the possibility to move toward the creation of a concrete cloud brokerage ecosystem for TiC. More effort should be spent in promoting the application of well-known design patterns when structuring/scripting solutions specifically tailored for managing the results produced by a specific TaaS framework.
About security, a well-known obstacle to the adoption of TiC concerns the upload of a SUT in third-party premises. From our perspective, it is important to remark that confidentiality and protection in both public and private clouds should be related to the whole set of testing artifacts. In this sense, technological or legal means (e.g., cryptography, obfuscation, or features enabling the "right to be forgotten") should not be limited to those artifacts loaded in the cloud to be executed, but also to the whole set of historical reports resulting from their execution.
The data gathered during testing and monitoring can be used to learn correlations between the expected test behavior and the observed one. Such correlations can be used to find bottlenecks and defects in the system under test. Facilitating decision-making by means of different machine learning approaches based on test reports and metrics can lead to promising research for testing cloud-based applications.
Similar challenges related to test data analysis are likely to happen also in the ToiC arena. In this domain, an additional problem is the inherent complexity of the cloud testing testbed, and as a result, the data volume can be increasingly higher. All in all, existing big data technologies can be useful for data discovery, integration, or advanced analysis in the evaluation of test results.

Challenges in Test Domain.
When adopting cloud-testing solutions, the impact of costs is usually under-estimated. The pay-as-you-go business model is often referred to, but not with sufficient consideration. For example, to launch testing sessions on fresh environments allocated on the cloud, a staging process is required for uploading the code that will be remotely executed with all its required dependencies. Moreover, further computational resources are needed to properly configure the environment. Often there are hidden testing costs (e.g., due to packages set up over the network) that increase with the usage of the monitoring and the logging resources of the specific TiC solution. The consequence is that a requested level of detail in the analysis of the test should better correspond to a clear understanding of its related costs and how such amounts of information can be properly consulted and extracted.
The rapid evolution of the cloud technologies and the lack of standards also make difficult the portability of specific application domains to different cloud providers. This problem usually forces practitioners to create custom testing solutions, for instance following TaaS or BaaS (Benchmark as a Service) on-demand approaches.
Testing complex cloud scenarios in specific domains (e.g., mobile apps) usually involves huge efforts for provisioning the proper infrastructure and configuration required for tests. The challenges in this arena are two-fold. On the one hand, the lack of automation can lead to high maintenance and operation costs of the proper testing testbeds. On the other hand, the lack of open-source solutions can be a potential problem, especially for small-or medium-size projects aimed to develop and test specific applications (e.g., web, mobile).

CONCLUSIONS AND FUTURE RESEARCH DIRECTIONS
As emerged by our systematic review of the literature, in recent years much research has been devoted to testing in the cloud, testing of the cloud, and testing of the cloud in the cloud. The research related to testing activity in all three categories is in continuous evolution and new approaches, methods, and infrastructures are still proposed, especially in mobile systems, web application, and SOA contexts.
We have developed a classification framework that, by construction, reflects what have been the main areas of research in recent years; namely, test perspective, test design, test execution, test objective, test evaluation, and test domain. Within each of these areas, we have also identified those topics that drew the greatest interest and in which several solutions have been proposed.
In particular, test execution is the most actively investigated area: indeed, the cloud offers the possibility to develop and maintain costly test infrastructures and to leverage on-demand scalable resources for configuration (by using cloud virtualization) and performance (by means of cloud elasticity) testing.
The second-most investigated area is test objective, with performance, functional, security, reliability, and elasticity, in this order, being the most frequently covered ones in the surveyed primary studies. Indeed, the flexibility, the efficiency, and the computational power of the cloud open new, interesting testing possibilities considered infeasible before: massive combinatorial testing, huge amount of parallel executions, simulation and emulation, dynamic scheduling, allocation and adaptation of resources, as well as a more effective and efficient evaluation of quality attributes such as scalability, elasticity, reliability, security, and so on.
Considering the migration of testing to the cloud, not surprisingly, the domains in which this happens most often are mobile and web applications. Other domains that could certainly benefit from the cloud potential but have not yet done so in large measure are IoT and networking.
The field still lacks proper conceptualization: a minority of papers covered test perspectives, within which-paradigmatically-the most-covered topics are by far the open problems and issues. Topics such as terminology and technology are almost not considered: we believe the field hardly needs a theoretical treatment, and we hope that this survey can provide good input for such types of studies.
Our survey also revealed that very important components of any testing activity, such as test monitoring, coverage measurement, and analytical techniques, useful for test evaluation, have received scant attention. Innovative testing infrastructures are needed that can support the assessment of cloud testing outcomes, possibly along different validation metrics.
If the great potential and the apparently unlimited resources that the cloud discloses open the way for pursuing innovative and more effective solutions for the testing activity in all its aspects, controlling and managing them while testing also gives rise to many new challenges. As a contribution to guide future research in cloud testing, we have provided a taxonomy of most relevant challenges that researchers could consider for future work.
From the results collected in this article, it seems clear that the future of software testing research will be more strictly intertwined with the progress of research and developments in cloud computing: the former providing approaches and methodologies for developing, validating, measuring and certifying applications, frameworks, tools, and infrastructures, the latter providing the resources and facilities to assess, simulate, or emulate real-world scenarios.
As an example of a promising research effort in this direction, we can refer to the H2020 European Project ElasTest [23], which has developed a comprehensive platform aimed at improving the efficiency and effectiveness of the testing process of large complex systems. The platform supports end-to-end testing in the cloud (TiC), addressing several of the challenges we summarize in the previous section. For instance, it supports elastic end-to-end testing (test perspective) and provides different testing services to adapt to different test scenarios. Among others, it provides a cost engine to estimate the costs of using a test environment (test design), an instrumentation manager that can induce controlled failures into the infrastructure to simulate real-world conditions (test execution), a security service to find vulnerabilities in the application (test objective), a big-data service to analyze test results (test evaluation), and emulation of Internet of Things devices, thus increasing portability and automation when testing IoT applications (test domain). On top of all services, ElasTest also provides specific visualization tools aimed at helping testers and developers in root-cause localization for those bugs found during the testing process.
ElasTest is just an example of how specific tool-leveraging technologies can be developed to ease Toc, Tic, and ToiC. We expect much more interesting research to appear in the coming years, disclosing the whole potential of the cloud to defeat testing barriers. Table 5 reports the classification of each Primary Study by Area. In the header of the