On the Understandability of Temporal Properties Formalized in Linear Temporal Logic, Property Specification Patterns and Event Processing Language

Temporal properties are important in a wide variety of domains for different purposes. For example, they can be used to avoid architectural drift in software engineering or to support the regulatory compliance of business processes. In this work, we study the understandability of three major temporal property representations: (1) Linear Temporal Logic (LTL) is a formal and well-established logic that offers temporal operators to describe temporal properties; (2) Property Specification Patterns (PSP) are a collection of recurring temporal properties that abstract underlying formal and technical representations; (3) Event Processing Language (EPL) can be used for runtime monitoring of event streams using Complex Event Processing. We conducted two controlled experiments with 216 participants in total to study the understandability of those approaches using a completely randomized design with one alternative per experimental unit. We hypothesized that PSP, as a highly abstracting pattern language, is easier to understand than LTL and EPL, and that EPL, due to separation of concerns (as one or more queries can be used to explicitly define the truth value change that an observed event pattern causes), is easier to understand than LTL. We found evidence supporting our hypotheses which was statistically significant and reproducible.

Linear Temporal Logic (LTL) is a widely used and established language for the specification of temporal properties. It is a logic-based approach that supports not only logical but also temporal operators. Many existing model checkers leverage LTL as a specification language (cf. Cimatti et al. [16] for NuSMV, 1 Blom et al. [17] for LTSmin, 2 Holzmann [18] for SPIN 3 ). Originally developed for reasoning on infinite traces, LTL can also be applied for reasoning on finite traces (cf. De Giacomo & Vardi [19]). The LTL2NFA algorithm (cf. De Giacomo et al. [20]) describes the transformation of an arbitrary LTL formula to a non-deterministic finite automaton (NFA), which can be executed for runtime checking of LTL-based temporal properties.
The Property Specification Patterns (PSP) are a collection of recurring temporal patterns. The relevance of the patterns discovered by Dwyer et al. [14] was confirmed even 13 years after the original study took place by a survey by Bianculli et al. [3] based on 104 scientific case studies. Each pattern represents a specific intent with a mapping to underlying formal representations, most notably LTL and CTL (Computation Tree Logic; cf. Clarke et al. [21]). Many existing approaches reuse PSP or extend the original pattern catalog with more specific context-dependent patterns. Among them are the DecSerFlow language for declarative service descriptions (cf. van der Aalst & Pesic [22]), the declarative workflow approach Declare (cf. Pesic et al. [23]), the Compliance Request Language (abbrev. CRL; cf. Elgammal et al. [24]), and the PROPOLS approach for the verification of BPEL service composition schemes (cf. Yu et al. [25]).
Event Processing Language (EPL) can be used to encode specific event patterns in queries that cause the firing of event listeners once the pattern is observed in the event stream of a Complex Event Processing (CEP) environment (cf. Wu et al. [26]). EPL is part of the open source CEP engine Esper. 4 Numerous studies make use of EPL (cf. Awad et al. [27], Holmes et al. [28], Boubeta-Puig et al. [29], Kunz et al. [30], Adam et al. [31], Aniello et al. [32], to name but a few). EPL is well-suited as a representative for CEP query languages as it supports common CEP query language concepts, such as leads-to (sequence, followed-by) and every (each) operators, that are present in many CEP query languages and engines (e.g., Siddhi 5 and TESLA [33]).

Problem Statement
Despite the long existence of many major temporal property specification approaches (e.g., Linear Temporal Logic was first proposed in 1977, the Property Specification Patterns exist since 1999), the core focus of most researchers has been on the formal/technical perspective of those approaches, whereas studying the usage point of view from an empirical perspective has not drawn much attention from researchers. Indeed, we are not aware of any existing work that provides an empirical study on the understandability of different representative temporal property specification approaches. Gaining more insights into the understandability of temporal property representations is crucial for evaluating their suitability for practical use and finding potential ways for their improvement with regards to understandability. LTL, PSP, and EPL are all powerful approaches for automated temporal property verification and validation, but very little is known about the understandability of these approaches. Intuitively, we might hypothesize that the temporal pattern-based approach PSP is more understandable than the temporal logic-based approach LTL because the former is abstracting the latter, but scientific evidence is required to back up such claims. In this article, we investigate this and similar hypotheses by applying suitable statistical methods on the gathered empirical data.
Studying the currently existing empirical research gaps in this field is not only interesting from a purely scientific point of view, but it is also important for industrial applications. For example, from the cooperation with our industry partners (see e.g., [34]), their customers and other company representatives at conferences and workshops, we realized that industry has a huge demand for, and shows a strong interest in, temporal property specification approaches that are applicable in practice by supporting a comprehensible, fast and accurate adoption of compliance requirements as well as their automated enactment and verification. All representative temporal property specification approaches that we study in this article are well-suited for automated computeraided checking, but BPM vendors are still often reluctant to expose their customers to such approaches, and our discussions with industry partners (see e.g., [35], [36]), that indicate uncertainty regarding how understandable temporal property representations are, are among the reasons for this.
The application of temporal property specifications for supporting software architecture compliance in the SWE & SWA domain faces a similar issue: Architecture descriptions and design decisions (cf. Medvidovic et al. [37], Zdun et al. [38]) must be documented in a comprehensible manner for different stakeholders in the software development process. Nowadays this is still often done in natural language, which cannot be directly used (i.e., without semiautomatic natural language processing; cf. Czepa et al. [39]) for automated software architecture compliance checking. By using a temporal property language for capturing architectural descriptions and decisions, we can directly leverage those architectural descriptions for automated architecture compliance checking.
Empirical research on temporal property understandability has the potential to influence practitioners in making the decision for adopting a specific existing temporal property language and in designing future industrial temporal property specification approaches. Consequently, one of the goals of this empirical study is to pave the way for industrial or practical exploitation of temporal property specification approaches.

Research Objectives
This empirical study has the objective to investigate the understandability of representative temporal property representations. The understandability construct focuses on how well (in terms of correct understanding) and fast (in terms of the response time) a participant understands a given temporal property representation.
We state the experimental goal using the Goal Question Metric (GQM) goal template (cf. Basili et al. [40]) as follows: Analyze the LTL, PSP, and EPL temporal property approaches for the purpose of their evaluation with respect to their understandability from the viewpoint of the novice and moderately advanced software architect, designer or developer in the context (environment) of the Distributed System Engineering Lab and the Advanced Software Engineering Lab courses at the Faculty of Computer Science of the University of Vienna.

Context
The study consists of two controlled experiments with 216 participants in total: The first run was carried out with 70 computer science students who enrolled in the course "Advanced Software Engineering Lab (ASE)" (mandatory part of the master in computer science curricula) at the University of Vienna in the winter term 2015/2016. The second run was carried out with 92 computer science students who enrolled in the course "Distributed System Engineering Lab (DSE)" (optional part of the 4. http://www.espertech.com/esper 5. https://github.com/wso2/siddhi bachelor and master in computer science curricula) at the University of Vienna and 54 computer science students who enrolled in the course "Advanced Software Engineering Lab" (mandatory part of the master in computer science curricula) at the University of Vienna in the winter term 2016/2017. Consequently, we can differentiate between DSE and ASE participants. While the former are used as proxies for novice to moderately advanced software architects, designers or developers, the latter are used as proxies for moderately advanced software architects, designers or developers. According to Kitchenham et al. [41], using students "is not a major issue as long as you are interested in evaluating the use of a technique by novice or nonexpert software engineers. Students are the next generation of software professionals and, so, are relatively close to the population of interest". Besides, a number of our students work while studying and some have even some years of industry experience (cf. Electronic Appendix A.2.1, which can be found on the Computer Society Digital Library at http://doi.ieee computersociety.org/10.1109/TSE.2018.2859926). Several existing studies take it even a step further by suggesting that students can be representatives for professionals under certain circumstances (cf. H€ ost et al. [42], Runeson [43], Svahnberg et al. [44], and Salman et al. [45]).

Guidelines
This work follows and respects existing guidelines for conducting and reporting empirical research in software engineering: Jedlitschka et al. [46] propose guidelines and a structured approach for reporting experiments in software engineering, which had a strong influence on the general structure and contents of this article. Those guidelines integrate (among others) the "Preliminary guidelines for empirical research in software engineering" by Kitchenham et al. [41] and standard books on empirical software engineering (cf. Wohlin et al. [47], Juristo & Moreno [48]). Moreover, we considered and applied the "Robust Statistical Methods for Empirical Software Engineering" by Kitchenham et al. [49] for the statistical evaluation of the acquired data.

BACKGROUND ON TEMPORAL PROPERTY REPRESENTATIONS
In this section, we discuss the general properties of the temporal property representations that are the focus of this study. Readers already familiar with one (or more) of the discussed temporal property representations may consider skipping (parts of) this section.

Linear Temporal Logic (LTL)
Propositional logic is not expressive enough to describe the behavior of systems (i.e., the ordering of events in time), so the notion of temporal logic has been introduced in 1977 (cf. Pnueli [13]). In particular, a logic called Linear Temporal Logic for reasoning over linear traces with the temporal operators G (or t u) for "globally" and F (or () for "finally" is proposed. Additional temporal operators are U for "until", W for "weak until", R for "release", and X (or ) for "next". Gc (or t uc) states that c must be true in every point in time. F c (or (c) states that c must be true at some future point in time. c U f states that c remains true at least until the point in time when f becomes true. c R f states that c remains true at least until and including the point in time when f becomes true. X c (or c) states that c must be true at the next point in time. LTL formulas are composed of the aforementioned temporal operators, atomic propositions (the set AP ), and the boolean operatorŝ (for "and"), _ for "or", : for "not", ! for "implies" (cf. Baier & Katoen [50]). The weak-until operator c W f is defined as ðG cÞ _ ðc U fÞ.
An LTL formula is inductively defined as follows: For every a 2 AP , a is an LTL formula. If c and f are LTL formulas, then so are Gc (or t uc), F c (or (c), c U f, c R f, X c (or c), c^f, c _ f, and :c.
The semantics of LTL over infinite traces is defined as follows: LTL formulas are interpreted as infinite words over the alphabet 2 AP (i.e., the alphabet are all possible propositional interpretations of the propositional symbols in AP ). pðiÞ denotes that state of the trace p at time instant i. We define p; ic (i.e., a trace p at time instant i satisfies the LTL formula c) as follows: In model checking, LTL formulas commonly have two possible truth value states, namely true (satisfied) and false (violated). In case of monitoring an LTL specification in a running system, it might be the case, that it is not only of interest if a specification is satisfied or violated but also whether further state changes are possible that could resolve or cause a violation of a specification. That is, the state of a specification is either temporary (i.e., the state may change) or permanent (i.e., the state may not longer change). Consequently, to enable a more fine-grained analysis of the participants' understanding of LTL in the experiment, we employ the semantics of Runtime Verification Linear Temporal Logic (RV-LTL; cf. Bauer et al. [51]) that supports four truth value states. In particular, an LTL temporal property specification at runtime is either temporarily satisfied, temporarily violated, permanently satisfied, or permanently violated.

Property Specification Patterns (PSP)
Having been inspired by software design patterns, Dwyer et al. have proposed the Property Specification Patterns [14], a collection of recurring temporal properties in software engineering. For each pattern, there exist transformation rules to underlying formal representations (including LTL and CTL). 6 The patterns are categorized into Occurrence Patterns and Order Patterns as follows: Occurrence Patterns: -Absence: a never occurs -Universality: a always occurs -Existence: a occurs -Bounded Existence: a occurs at most n times Order Patterns: -Precedence: a precedes b -Response: a leads to b -2 Cause-1 Effect Precedence Chain: (a, b) precedes c -1 Cause-2 Effect Precedence Chain: a precedes (b, c) -2 Stimulus-1 Response Chain: (a, b) leads to c -1 Stimulus-2 Response Chain: a leads to (b, c) Moreover, each pattern has a scope. Fig. 1 shows the available scopes and their area of effect: The global scope defines that a pattern must hold during the entire execution of a system. This scope is implicitly assumed when no other scope is defined. The before scope before s [ p ] defines that a pattern p must hold before the first occurrence of s. The after scope after s [ p ] defines that a pattern p must hold after the first occurrence of s. The between scope between s 1 and s 2 [ p ] defines that a pattern p must hold between every s 1 (i.e., starting the scope) that is followed by s 2 (i.e., closing the scope). The after-until scope after s 1 until s 2 [ p ] defines that a pattern p must hold after every s 1 (i.e., starting the scope) by no later than s 2 (i.e., closing the scope).

Event Processing Language (EPL)
In this section, we discuss the Event Processing Language (EPL; cf. EsperTech Inc. [15]) and how it can be applied for runtime monitoring of temporal properties. An EPL-based temporal property specification consists of an initial truth value (either temporarily satisfied or temporarily violated) and one or more query-listener pairs. A querylistener pair causes a truth value change of the temporal property as soon as a matching event pattern is observed in the event stream. Consequently, an EPL-based temporal property specification always consists of EPL queries that are composed of EPL operators and listeners that causes truth value changes (to temporarily satisfied, temporarily violated, permanently satisfied, permanently violated) to which the state of the temporal property specification is set to by a positive match of an expression in the event stream. The semantics of those EPL operators is given as follows (cf. [15]): The and operator e 1 and e 2 is a logical conjunction that is matched once both e 1 and e 2 (in any order) have occurred. The or operator e 1 or e 2 is a logical disjunction that is matched once either e 1 or e 2 has occurred. The not operator not e is a logical negation that is matched if the expression e is not matched. The every operator every e not just observes the first occurrence of the expression e in the event stream but also each subsequent one. The leads-to operator e 1 -> e 2 specifies that first e 1 must be observed and only then is e 2 matched. Intuitively, the whole expression is matched once e 1 is followed by e 2 at the occurrence of e 2 . The until operator e 1 until e 2 matches the expression e 1 until e 2 occurs. In practice, this operator is commonly used in the expression not e 1 until e 2 that demands the absence of e 1 before the occurrence of e 2 . Obviously, further truth value changes are not possible once a permanent state (i.e., permanently violated or permanently satisfied) has been reached.

Goals
This experiment has the goal of measuring the construct understandability of temporal property specifications expressed in different representations, namely Linear Temporal Logic, Property Specification Patterns, and Event Processing Language. The focus is on the correctness and response time of the answers given by the participants.

Experimental Units
All participants of the experiment are students of the Faculty of Computer Science at the University of Vienna, Austria, who enrolled in the courses "Distributed System Engineering Lab" and "Advanced Software Engineering Lab". We differentiate between two kinds of participants: Participants of DSE are used as proxies for novice to moderately advanced software architects, designers or developers. 6. http://patterns.projects.cs.ksu.edu/documentation/patterns. shtml Participants of ASE are used as proxies for moderately advanced software architects, designers or developers. The first experiment run aims to evaluate the languages with moderately advanced software architects, designers or developers, whereas the second experiment run considers both novice to moderately advanced and moderately advanced software architects, designers or developers. Another difference between the two experiment runs concerns the incentive for participation, the sampling strategy, and the setting. In the first experiment run, the experiment was carried out as a normal course assignment. Consequently, attendance was mandatory, and the submitted solutions were graded as an integral part of the course with up to 10 points (10 percent of the total course points). In the second experiment run, we changed to optional attendance that was rewarded by up to 10 bonus points. In both cases, the participants' performance in the experiment determined the achieved points, and the participants were randomly allocated to the treatments (i.e., the three temporal property representations).

Experimental Material & Tasks
The temporal property specifications used in the tasks of this empirical study are based on recurring temporal property specification patterns (cf. Dwyer et al. [14] and Bianculli et al. [3]). Each task of the experiment consists of a temporal property definition and six combinations of an execution trace and a truth value. To optimize the execution of the experiment and to be independent from a specific application domain, the traces only consist of capital letters that represent surrogates of events (e.g., capital letter A could represent a task event "Apply for Loan started" in the BPM domain or a function/method invocation event in the SWA & SWE domain). For each combination the participant must evaluate whether it is correct or incorrect (i.e., whether the truth value is correct for the given trace). For example, Fig. 2a shows a task of the PSP group that is concerned with the Precedence pattern in the Between scope. In this task, only the choices b) and f) are correct. The same task is shown for the LTL group in Fig. 2b and for the EPL group in Fig. 2c. Obviously, the expression of the temporal property in each case is changed to the appropriate formalism. Furthermore, a different set of letters is used as a preventive measure against cheating (in addition to the seating arrangements).
The experiment document consisted of 10 tasks in the first experiment run. We reduced the number of tasks in the second experiment run to 9 tasks because a relatively large number of participants could not complete the first experiment run in time. Another difference between the two experiment runs is the order of tasks and answer choices. In the first run, the order was randomized between the groups whereas in the second run there has been no difference in order between the groups. Randomization has the advantage that cheating is hampered, but it might introduce an unwanted variable to the experiment. For example, one group might have an easy first task, while another group has a hard one that hinders further progression and/or frustrates the participant. To avoid such unwanted effects, we kept the order unchanged in the second experiment run.
For the creation of the tasks of the experiment, we used an algorithm that generates traces and computes the correct truth value of a temporal property specification that corresponds to each trace automatically. This algorithm leverages both the LTL and EPL specifications used in this experiment. For checking a trace against an LTL specification, the LTL formula is transformed to a non-deterministic finite automaton (cf. De Giacomo & Vardi [19]). By executing the automaton and analyzing its accepting states, the truth value of the LTL formula can be determined. Moreover, EPL temporal property specifications are enacted in a CEP engine to evaluate their truth value. Using either LTL or EPL would suffice to create the tasks for the experiment. Nevertheless, we used both to double check the correctness of the temporal property representations. Please note that it is not possible to use PSP specifications directly for execution (i.e., they are an abstraction of formal languages such as LTL and EPL), so they cannot be used for automated task generation. After the automated generation, we manually checked each task for correctness.
A slightly adapted version of the algorithm was used for the second experiment run. For the first run, the truth value of an answer choice was randomly altered to another truth value to create both wrong and correct answer choices. That kind of alteration might affect the results of the EPL group because the EPL approach explicitly contains truth values in its specifications. That is, some answer choices can be ruled out by matching the truth value of an answer choice against the set of possible truth values in the EPL specification. As we will discuss later (in the evaluation of the experiments in Section 5.1), apparently, these answer choices did not introduce bias that affected the EPL results positively, but they even had a negative impact on the response times in the EPL group in the first experiment run. We eliminated that threat to validity in the second experiment run by limiting random alterations of truth values in the answer choices of all groups to the set of possible truth values of a specification.
The tasks of both controlled experiment runs are available online (cf. Czepa & Zdun [58]) to support a replication of the study. In addition, code was released as open source that supports the automated generation of experiment tasks. 7

Hypotheses, Parameters, and Variables
We hypothesized that PSP, as a highly abstract pattern language, is easier to understand than LTL and EPL, and that EPL, due to separation of concerns (as one or more queries can be used to explicitly define the truth value change that an observed event pattern causes), is easier to understand than LTL. Consequently, we formulated the following hypotheses for the two controlled experiment runs: H 0;1 : There is no difference in terms of understandability between PSP and LTL. H 1;1 : PSP has a higher level of understandability than LTL. H 0;2 : There is no difference in terms of understandability between PSP and EPL. H 1;2 : PSP has a higher level of understandability than EPL. H 0;3 : There is no difference in terms of understandability between EPL and LTL. H 1;3 : EPL has a higher level of understandability than LTL. In both runs of this controlled experiment, there are two dependent variables, namely: the correctness achieved in trying to mark the correct answers, and the response time, which is the time it took to complete the 10 tasks in the first experiment run / the 9 tasks in the second experiment run. These two dependent variables are commonly used to measure the construct understandability (cf. Feigenspan et al. [59] and Hoisl et al. [60]). The independent variable (also called factor) has three treatments, namely the three temporal property representations (LTL, EPL, and PSP).

Experiment Design & Execution
We used a completely randomized design with one alternative per experimental unit, which is appropriate for the stated goal. Through this, we tried to avoid learning effects of the participants. Moreover, chances of selection bias are limited by using a computer-aided randomization for the assignment of participants to groups. The experiment is designed as a multiple-choice test for automated processing by the e-learning platform Moodle 8 to avoid experimenter bias in the analysis of the answers submitted. For that reason, the participants mark the answers in an answer sheet that will be scanned and evaluated automatically. In some cases, it was necessary to correct some issues (e.g., imprecise markings) manually. To further limit the chances of experimenter bias, we used the four eyes principle while performing any such manual actions.
Two weeks before each experiment run, we handed out preparation material to the participants. This material consists of two documents: a document that provided a general introduction to the temporal property language and slides that represent a kind of quick reference guide with the important aspects of the temporal property representations and further examples. The participants were allowed to use the preparation material also during the experiment session.
The preparation material is based on (informal) natural language descriptions of the approaches and practical examples of application. There are two main reasons for this design of the preparation material: First, we needed to ensure that all three languages are presented by the same educational methods at a comparable level of detail to not introduce unnecessary bias into our experiment. Second, we tried to present the approaches in an approachable manner to the participants as suggested by numerous existing research on teaching undergraduate students in theoretical computer science, formal methods, and logic (cf. Habiballa  Please note that the tasks used in the experiment were randomly generated and not taken from the learning material. However, there were similarities between the temporal properties used in some of the experiment tasks and those used in the examples discussed in the learning material, but we could not find any indication of bias introduced by these similarities in the gathered data. In particular, the number of possibly affected experiment tasks was almost balanced between the groups, and the measured correctness of possibly affected tasks was overall similar to those of the remaining tasks (cf. Electronic Appendix D.1, available in the online supplemental material).
Since the first experiment run also involved two qualitative questions regarding all temporal property representations, we made the decision to provide the preparation materials of all three temporal property representations to every participant. That is, the participants studied all temporal property languages, and were unaware to which group they had been assigned until the start of the experiment session. However, having knowledge of all the representations could have introduced bias. For example, learning a representation could lead to a better understanding of another one, or the languages were mixed up unintentionally. As a result, we handed out preparation material for each group individually in the second experiment run.

Procedure
The first experiment run had a duration of 90 minutes for working on the 10 tasks plus an additional 10 minutes for answering the two qualitative questions. The second experiment run had a total duration of 90 minutes for working on the 9 given tasks. No qualitative questions were asked in the second run. Seating arrangements were made to limit opportunity for misbehavior (i.e., cheating). At the beginning of each experiment run, the experiment material was handed out in form of printed documents. Furthermore, we provided copies of the preparation materials for those participants who did not bring their own. Next, the participants were informed about the procedure of the experiment. This involves time tracking and how to mark answers correctly in the answer sheet for automatic processing. Following this, the participant had to fill out a general question sheet by which we gathered information about the previous knowledge and experience of the participants. Next, the main part of the experiment started, in which the participants tried to solve the tasks of the experiment. The experiment runs were carried out following this plan without known deviations. Table 1 contains the number of observations, central tendency measures and dispersion measures of the dependent variables (correctness and response time) per temporal property representation and experiment run. The second experiment run consists of measurements in two courses, namely DSE and ASE (cf. Section 3.2). That is, we tested our hypotheses three times, namely in the first experiment run in ASE, and in the second experiment run in DSE and ASE. In all three cases, the PSP group reached the highest mean and median correctness (about 70-75 percent), followed by the EPL group (about 50-55 percent correctness) and the LTL group (about 30-35 percent correctness). The maximum measured response time in the first run is the 90 minutes limit in all groups. In response to this, we reduced the number of tasks in the second run by one (from 10 to 9). In the second run, the maximum response time is 88 minutes. Interestingly, students in the second run in ASE managed to finish on the average about 20-40 percent faster than their colleagues in the first run which cannot be caused by the removal of a single task alone as the expected response time reduction would be only about 10 percent. We suspect that this difference is caused by the change from total experiment time recordings in the first experiment run to per task time recordings in the second experiment run, and the late assignment of participants to groups at the beginning of the experiment session in the first run. Obviously, the time recordings of the participants in the first experiment run included times such as pauses, task switching times, and times spent on consulting the accompanying documents that are not directly related to solving a specific task. In the first experiment run the participants had to be prepared for all three representations, and the experiment group was assigned at the beginning at the experiment session. Up to this point in time, the participants did not know to which experiment group they were assigned to. That is, once it became clear which of the three approaches must be applied, the participants revisited the learning material related to the assigned representation intensely. In the second experiment run, group assignment was clear beforehand, so this initial consulting of the info material did not take place in a comparable intensity. Furthermore, the mean (72.12 minutes) and median response times (78.5 minutes) of the EPL group are longer than those of the LTL group (69.85 minutes mean and 73 minutes median) in the first run. With regard to the hypotheses of this experiment, the response time measurements in the first experiment run are an unexpected result since we expected that the response times in the EPL group would be faster than in the LTL group. In contrast, the EPL group has a faster response time than the LTL group in the second run. We suspect that this effect could have been caused by the task design which contained truth value states in the answer choices that are not part of the EPL temporal property definition. Originally (i.e., at the time the first run was completed, and before the second run was carried out), we thought that there might have been a bias present in the first experiment run in favor of the EPL group, because wrong answer choices could have been potentially easier to identify by the EPL participants. However, these answer choices seemingly rather confused the participants than helped them. During the the first experiment run, EPL participants repeatedly asked whether there is an error in the exercise or whether it can be really that easy to solve it. Due to their confusion, EPL participants spent considerable more time on solving the tasks in the first experiment run. For a more detailed descriptive statistics of the dependent variables, we refer the interested reader to Electronic Appendix A.2, available in the online supplemental material. After a thorough evaluation of model assumptions (cf. Electronic Appendix B, available in the online supplemental material), we decided to use Cliff's delta (cf. Cliff [66] and Rogmann [67]), a robust non-parametric test that is unaffected by change in distribution, nonnormal-data and possible non-stable variance. The results of the test are shown in Table 2 for the first experiment run and Table 3 for the second experiment run. We consider False Discovery Rate (FDR) adjusted pvalues (cf. Benjamini & Hochberg [68]) due to multiple testing. According to these FDR adjusted p-values, there is evidence for the rejection of the null hypotheses of this study (cf. Section 3.4).

ANALYSIS
In the first experiment run (cf. Table 2), almost all test results are significant which suggests a rejection of H 0;1 and H 0;2 . H 0;3 can only be rejected on basis of the correctness variable since the test result does not indicate any significant difference in the response times of the EPL and LTL group. Moreover, the results suggest that the difference in terms of correctness between the PSP and LTL group are highly significant with a large effect size magnitude. All         Table 3), the majority of the test results is significant. Only one test, namely the PSP/ EPL response time with ASE participants, has no significant result, which means that H 0;2 (in ASE) can only be rejected on basis of the correctness result. All other test results are ranging from significant (a ¼ 0:05) to highly significant (a ¼ 0:001) which suggests a rejection of the null hypotheses. Moreover, all significant results show a large or medium effect size magnitude. It is striking that all PSP/LTL test results are highly significant with a large-sized effect.

Evaluation of Results and Implications
Most results of this study are in accordance with the initial expectations of this study, but there are some deviations that must be further discussed. In the first experiment run, H 0;3 cannot be rejected for the response time variable. We suspect that this effect could be related to the experimental tasks of the first experiment run that offered answer choices with truth value states that are not part of the EPL temporal property specification. Apparently, these answer choices caused confusion that resulted in longer response times. To avoid potential bias, the answer choices in the second experiment run included only truth value states that are mentioned in the EPL temporal property specification. In the second experiment run, H 0;2 cannot be rejected for the response time variable. In this case, we could not find any plausible interpretation other than the sample size of ASE students in the second experiment run. With 50 participants, the sample size is borderline, and we cannot rule out disturbing effects. Nevertheless, aside from that, the statistical inference shows significant results with medium to large effect size magnitudes. Consequently, the controlled experiment runs of this study clearly indicate that PSP specifications provide a higher level of understandability than LTL specifications, PSP specifications provide a higher level of understandability than EPL specifications, and EPL specifications provide a higher level of understandability than LTL specifications. When it comes to the personal preference of the participants (cf. Electronic Appendix C, available in the online supplemental material), PSP seems to be the most preferred temporal property representation. This result is in accordance with the outcome of the controlled experiment runs as well. In contrast, the personal preference ranking of the EPL and LTL representations does not seem to match the results of the controlled experiment runs, since the EPL representation seems to be less popular among the participants than LTL. However, the survey on which the ranking is based must be interpreted with caution, because the sample size might not be large enough to draw valid conclusions on the basis of the data. Please note that we did not replicate the survey in the second experiment run intentionally to improve the validity of the controlled experiment in the second run (cf. Section 5.2). Moreover, the constructs "personal preference" and "understandability" might be inherently different and incomparable. In either case, this peculiarity is important to report, and it might be a possible cornerstone for further investigations in future empirical studies.
Both in terms of understandability and the personal preference of the participants, the PSP representation outperformed the other two approaches examined. The patternbased, high-level nature of the approach seems to make it highly appealing as a temporal property representation. However, a major limitation of the approach is its inflexibility in the case where the set of available patterns does not fit the purpose. In such a case, the pattern set must be extended, i.e., the creation of underlying low-level temporal property representations is required. Both EPL and LTL are more low-level temporal property representations that can be used either as underlying temporal property representations for PSP, or to directly create temporal property specifications for automated verification. EPL supports runtime monitoring, whereas LTL can be used for both runtime monitoring by non-deterministic finite automata (cf. De Giacomo et al. [20]), and design time verification by model checking (cf. Cimatti et al. [16], Blom et al. [17], Holzmann [18]). If a temporal property representation is solely used for runtime monitoring, the study would-based on the measured understandabilityimply a preference for EPL over LTL. Another scenario is conceivable as well: During the creation of new PSP patterns, easier to understand EPL temporal property specifications can be used as plausibility specifications for harder to understand LTL formulas to countercheck whether a created LTL formula contains errors (cf. Czepa et al. [69]). However, an obstacle could be the possibly low user acceptance of EPL (cf. Electronic Appendix C, available in the online supplemental material), which must be further investigated.

Threats to Validity
All known threats that might have an impact on the validity of the results are discussed in Electronic Appendix D, available in the online supplemental material.

RELATED WORK
To the best of our knowledge, we are not aware of any existing empirical studies that investigate the differences in understandability of representative temporal property languages in a similar way and depth as the presented study does. However, there exist related empirical studies that evaluate representations of properties/models in software engineering. This section will focus on those studies.
The first study we would like to present in the field of software architecture and engineering is indirectly related to temporal property specifications as it focuses on architecture descriptions in general. Heijstek et al. [70] try to find out whether there are differences in understanding of textual and graphical software architecture descriptions in a controlled experiment with 47 participants. Interestingly, participants who used textual architecture descriptions performed significantly better, which suggests that textual architectural descriptions could be superior to their graphical counterparts.
An eye-tracking experiment carried out by Sharafi et al. [71] with 28 participants investigates the understandability of graphical and textual software requirement models. They observed no statistically significant difference in terms of correctness of the two approaches, but the response times of participants working with the graphical representations were slower.
Czepa et al. [39] compared the understandability of three languages for behavioral software architecture compliance checking, namely the Natural Language Constraint language (NLC), the Cause-Effect Constraint language (CEC), and the Temporal Logic Pattern-based Constraint language (TLC), in a controlled experiment with 190 participants. The NLC language is simply using the English language for software architecture descriptions. CEC is a high-level structured architectural description language that abstracts EPL and enables nesting of cause parts, that observe an event stream for a specific event pattern, and effect parts, that can contain further cause-effect structures and truth value change commands. TLC is a high-level structured architectural description language that abstracts temporal patterns (such as the Property Specification Patterns by Dwyer et al. [14]). Interestingly, the statistical inference of this study suggests that there is no difference in understandability of the tested languages. This could indicate that the high-level abstractions employed bring those structured languages closer to the understandability of unstructured natural language architecture descriptions. Moreover, it might also suggest that natural language leaves more room for ambiguity, which is detrimental for its understanding. Overall, the understandability of all three approaches is at a high level. However, the results must be interpreted with caution. Potential limitations of that study are that its tasks are based on common architectural patterns/styles (i.e., a participant possibly recognizes the meaning of a constraint more easily by having knowledge of the related architectural pattern) and the rather small set of involved behavioral constraint patterns (i.e., only very few behavioral constraint patterns were necessary to represent the architecture descriptions). In contrast, the controlled experiment runs presented in this article do not focus on software architecture compliance. Instead, we try to be independent from specific areas of application to evaluate the temporal property representations in a more general context. While the software architecture compliance constraints in that study wrap only a very few patterns in high-level structured languages, the empirical study presented in this article is based on a larger, representative set of temporal property patterns, and is focuses on the formalisms' core features instead of high-level, domain-specific abstractions of them.
Hoisl et al. [60] conducted a controlled experiment on three notations for defining scenario based model tests with 20 participants. In particular, they tested a semi-structured natural language scenario notation, a diagrammatic scenario notation, and a fully-structured textual scenario notation. The authors conclude that the semi-structured natural language scenario notation is recommended for scenario-based model tests, because the participants of this group were able to solve the given tasks faster and more correctly. However, the validity of the experiment is strongly limited by the small sample size and the lack of statistical hypothesis testing.

Summary
This article reports two controlled experiments on the understandability of temporal property representations with 216 participants in total (70 in the first run and 146 in the second run). The results of the statistical evaluation suggest that PSP-based temporal property specifications are significantly easier to understand than EPL temporal property specifications, that are based on Complex Event Processing, and Linear Temporal Logic temporal property specifications. Moreover, the results imply that EPL temporal property specifications are significantly easier to understand than LTL temporal property specifications. Despite the threats to validity listed in Electronic Appendix D, available in the online supplemental material, we consider the validity of our results high because of the repetition and replication by a second experiment run with two different populations, the overall large sample size, the automated generation of the tasks, the automated evaluation of the given answers, and the thorough statistical evaluation.

Impact
This study seems to support the original assumption that the pattern-based PSP approach is the most user-friendly temporal property representation for novice and moderately advanced users. Therefore, if possible (i.e., if the approach is applicable to the domain), the results suggest that the pattern-based temporal property approach should be preferred. Since many existing approaches (e.g., the Compliance Request Language CRL by Elgammal et al. [24] and the PRO-POLS approach for the verification of BPEL service composition schemes by Yu et al. [25]) reuse PSP or extend the original pattern catalog (cf. Dywer et al. [14]) with more specific context-dependent patterns, there is strong evidence that the results of the study hold for these approaches as well. However, in contrast to the two other temporal property approaches tested in this study, the pattern-based approach is the most limited one in terms of its expressiveness. That is, if the set of supported patterns is incompatible with a specific requirement (e.g., a company internal policy that must be covered by the IT system), it is necessary to extend the pattern catalog. Since the pattern-based approach merely abstracts other temporal property representations (most often LTL formulas), creating new patterns always requires the creation of the underlying temporal property specifications as well. Creating those underlying temporal property specifications is considered to be difficult and error-prone. Plausibility checking (cf. Czepa et al. [69]) tries to alleviate the risk to create incorrect LTL specifications by leveraging EPL specifications to countercheck if the LTL formula contains errors. Since EPL temporal property specifications are more understandable than LTL formulas, the results of the presented study can be seen as an empirical evaluation of the plausibility checking approach as well.

Future Work
The presented study focuses on the understandability of already given temporal properties. That is, the authoring of temporal property specifications is not yet sufficiently covered. It is possible to further investigate the understandability of temporal property languages by running different kinds of experiments. In particular, we plan to study the understandability of temporal property representations during the authoring process as well. We suspect that creating correct temporal property specifications from scratch is more difficult than interpreting already given temporal property specifications correctly. Moreover, we are curious whether the measured significant differences in understandability of the three temporal property representations are also present during the creation process of temporal property specifications. Another interesting opportunity for future work is studying the understandability of temporal property specifications with professionals working in the industry (e.g., senior system administrators and senior software architects). Studying whether there exist differences in understandability between textual and graphical temporal property representation is another interesting opportunity for future work. In particular, it would be interesting to find out whether the results of the studies by Heijstek et al. [70] and Sharafi et al. [71], that investigated the differences in understandability of textual and graphical models in the software architecture and engineering domain with results in favor of the textual approaches, are transferable to temporal property specifications. In this context, it might be interesting as well to compare textual LTL representations against the graphical NFA representations since NFAs are often the transformation product of LTL formulas.