How Understandable Are Pattern-based Behavioral Constraints for Novice Software Designers?

This article reports a controlled experiment with 116 participants on the understandability of representative graphical and textual pattern-based behavioral constraint representations from the viewpoint of novice software designers. Particularly, graphical and textual behavioral constraint patterns present in the declarative business process language Declare and textual behavioral constraints based on Property Specification Patterns are the subjects of this study. In addition to measuring the understandability construct, this study assesses subjective aspects such as perceived difficulties regarding learning and application of the tested approaches. An interesting finding of this study is the overall low achieved correctness in the experimental tasks, which seems to indicate that pattern-based behavioral constraint representations are hard to understand for novice software designers in the absence of additional supportive measures. The results of the descriptive statistics regarding achieved correctness are slightly in favor of the textual representations, but the inference statistics do not indicate any significant differences in terms of understandability between graphical and textual behavioral constraint representations.


INTRODUCTION
Since the early days of computer science, supporting the correctness of computer programs has been a recurring research interest. In 1977, Pnueli introduced an approach for the verification of sequential and parallel programs that is based on temporal reasoning [65]. His approach became widely popular under the term Linear Temporal Logic (LTL). A plethora of different temporal logics have been proposed since then. For example, in 1988, Clarke and Emerson [10] applied the Computation Tree Logic (CTL), a branching time logic, for model checking of computer programs. Both Understandability of Graphical/Textual Pattern-based Behavioral Constraints 11:3 but a few. Pattern-based behavioral constraints can be used to shield the user from the complexity of formal temporal logics used in the context of formal verification methods such as model checking (cf. Rozier [72] and Baier and Katoen [2]) and runtime monitoring by nondeterministic finite automata (cf. De Giacomo et al. [19,20]). Many textual and graphical pattern-based behavioral constraint approaches exist (e.g., References [1,25,80,86]), which originated from PSP. However, current studies predominantly focus on technical contributions in specific application areas. Only a few studies focus on empirical evaluations of behavioral constraint representations [14,21,34,64,82,87], and even fewer of them are concerned with comparing graphical and textual behavioral constraint representations specifically [33,52]. Interestingly, the body of existing studies (cf. Section 8, which discusses those related works in depth) yields contradictory results that indicates that the understandability of pattern-based graphical and textual behavioral constraints is not yet well understood. Two prior empirical studies (both reported in Reference [16]) indicated that the pattern-based PSP representation provides a high level of understandability (about 70% on average in the specific setup of those studies), but these studies did not consider graphical patternbased behavioral constraints. As it is our experience from multiple industry projects that industry experts in areas such as business process management tend to prefer graphical over textual constraint representations when given the choice, it would be important to test if their gut feeling can be empirically confirmed. Also, non-expert users seem to prefer graphical models over structured text and textual descriptions when the goal is to understand a process (cf. Figl and Recker [30]). It is yet unknown whether there are differences in understandability between graphical and textual pattern-based behavioral constraint approaches. In addition, it is unknown whether there exist problematic language elements that pose an obstacle for correct understanding of textual and graphical pattern-based behavioral constraint representations. The discovery of such problematic elements could provide a starting point for improving the comprehensibility of the representations and making them more applicable in practice.
Studying the understandability of graphical and textual behavioral constraints is not only interesting from a purely scientific point of view, but it is also important for industrial applications. For example, from the cooperation with our industry partners (see, e.g., Reference [79]), their customers, and other company representatives at conferences and workshops, we realized that the industry has a huge demand for, and shows a strong interest in, behavioral constraint approaches that are applicable in practice by supporting the comprehensible, fast and accurate adoption of compliance requirements, as well as their automated enactment and verification. The pattern-based behavioral constraint representations that we study in this article are well-suited for automated computer-aided verification at runtime and design time, but vendors are still often reluctant to expose their customers to such approaches. Our discussions with industry partners (see, e.g., References [77,78]) indicate uncertainty regarding how understandable the constraints are, and this could be among the reasons for this reluctance.
In addition to triggering further empirical evaluations and thus new insights, empirical research on behavioral constraints has the potential to influence practitioners in decision-making for adopting a specific behavioral constraint language and in designing future industrial solutions. Consequently, a farther-reaching goal for our research on behavioral constraint representations is to pave the way for their future industrial or practical exploitation.

Research Objectives
This empirical study has the objective to investigate the understandability of representative graphical and textual behavioral constraint representations. The understandability construct focuses on how well (in terms of correct understanding) and fast (in terms of the response time) a participant understands a given behavioral constraint representation. Particularly, this empirical study considers the Property Specification Patterns, which are the origin of numerous existing behavioral constraint approaches (cf. Section 1), and the Declare approach, which seems to be the most popular graphical behavioral constraint approach in the field of business process management (cf. Goedertier et al. [32], Schonenberg et al. [73], and van der Aalst et al. [81]). We are not aware of any other graphical behavioral constraint language being of similar relevance. Originally, Declare was proposed in the domain of business process management (cf. Pešić and van der Aalst [62]) and also applied in service-oriented computing (cf. van der Aalst and Pešić [80]), but there seem to be no limiting factors for the application of Declare in different domains. Its graphical pattern-based representation is versatile and transformable to underlying formal representations (e.g., LTL [54] and event calculus [55]) for verification at design time (i.e., model checking) and runtime verification in general. Declare is considered in three variants, namely as a purely graphical, a purely textual, and a hybrid (mixed graphical/textual) behavioral constraint approach.
We state the experimental goal using the GQM (Goal Question Metric) goal template (cf. Basili et al. [3]) as follows: Analyze the textual PSP-based representation approach, the purely graphical Declare representation approach (DG), the purely textual Declare representation approach (DT), and the hybrid (i.e., showing a textual label in addition to the graphical relation) Declare representation approach (DGT) for the purpose of their evaluation with respect to their understandability from the viewpoint of the novice software designer in the context (i.e., environment) of the Distributed System Engineering and the Software Engineering 2 courses at the University of Vienna, Austria.

Guidelines
Jedlitschka et al. [42] propose guidelines for reporting experiments, which had a strong influence on the general structure of this article. Those guidelines integrate (among others) the "Preliminary guidelines for empirical research in software engineering" by Kitchenham et al. [46] and standard books on empirical software engineering (cf. Wohlin et al. [84], Juristo and Moreno [44]). Moreover, the "Robust Statistical Methods for Empirical Software Engineering" by Kitchenham et al. [45] had a strong impact on the statistical methods used for the evaluation of the gathered data.

BACKGROUND ON PATTERN-BASED BEHAVIORAL CONSTRAINT REPRESENTATIONS 2.1 Property Specification Patterns
Dwyer et al. [23,24] proposed the PSP, a collection of recurring behavioral constraint patterns. Since the patterns cannot be directly used for formal verification, there exist transformations to underlying formal representations (among them are LTL [65] and CTL [10] formulas) that can be found online. 4 The discovered patterns were categorized into Occurrence Patterns and Order Patterns: • Occurrence Patterns: -Absence: a never occurs -Universality: a always occurs -Existence: a occurs -Bounded Existence: a occurs at most n times Each pattern has a scope. Figure 1 shows the available scopes and their area of effect: • The global scope defines that a pattern must hold during the entire execution of a system. If no scope is defined, then this scope is implicitly assumed. • The before scope before s [ p ] defines that a pattern p must hold before the first occurrence of s. • The after scope after s [ p ] defines that a pattern p must hold after the first occurrence of s. • The between scope between s 1 and s 2 [ p ] defines that a pattern p must hold between every s 1 (i.e., starting the scope) that is followed by s 2 (i.e., closing the scope). • The after-until scope after s 1 until s 2 [ p ] defines that a pattern p must hold after every s 1 (i.e., starting the scope) by no later than s 2 (i.e., closing the scope).

Declare
Declare (cf. Pešić and van der Aalst [61]), also known by the names DecSerFlow (cf. van der Aalst and Pešić [80]) and ConDec (cf. Pešić and van der Aalst [62]), is a graphical declarative business process modeling language and approach. There exist transformations of its high-level graphical representations to LTL (cf. Pnueli [65] and Montali [54]) and Event Calculus (EC) (cf. Kowalski and Sergot [47] and Montali et al. [55]). As of Declare Version 2.1.0, the available constraint templates are organized as follows 5 : • Existence Patterns (cf. Figure 2 for graphical representations): -"at least" * existence_n(A): State A must occur at least n times.  -"at most" * absence_n(A): State A must occur at most n − 1 times. -"exactly" * exactly_n(A): State A must occur exactly n times (i.e., not more, not less). -"position" * strong_init(A): A must start and complete first. * init(A): A must start first, and it must complete first or remain active indefinitely. * last(A): A must be the last occurring element. There must not occur any other element than A after A. error(A): This appears to be an auxiliary pattern to detect a completion of A that should not occur if A has never been started. • Relation Patterns (cf. Figure 3 for graphical representations): -"no order" * responded_existence(A, B): If state A happens (at least once), then state B must have happened (at least once) before state A or must happen after state A.

Goals
The primary goal of the experiment is measuring the construct understandability of graphical and textual pattern-based behavioral constraint representations by the correctness and response time of the answers given by the participants. Additionally, the experiment aims at studying the perceived learning difficulty, the perceived difficulty regarding applying the learned behavioral constraint representation approach (i.e., the perceived application difficulty), the personal interest in using the representation, the perceived practical applicability, and the perceived potential for further improvement of the behavioral constraint representations.

Experimental Units
All 116 participants were students at the University of Vienna, Austria, who enrolled in the courses "Distributed System Engineering Lab (DSE)" and "Software Engineering 2 (SE2)" in the winter term 2017. This study aims at evaluating the understandability of pattern-based behavioral constraints from the perspective of novice software designers, which makes undergraduate students suitable test subjects. The attendance was optional and rewarded by extra credits (i.e., bonus points) for the course based on the performance in the experiment (i.e., the achieved correctness and completeness of time records). Alternatively, the students were given the chance to gain extra credits in other lab activities by going beyond the normal course requirements (e.g., by implementing more functionalities than required, or paying attention to excellent code quality). As required for a valid controlled experiment setup, all participants were randomly allocated to the four experiment groups (i.e., one for each of the four notations being studied).

Experimental Material and Tasks
In total, three documents were used per representation: • An info sheet about the assigned behavioral constraint representation was made available to the participants one week before the experiment execution for preparation purposes. The descriptions used in these documents are based on the pattern descriptions provided by Declare and the Property Specification Patterns. 6 To keep the number of language elements to remember approachable, the experiment design considers limitations in human capacity for processing information (cf. Miller [53]). That is, the info sheets of all groups were limited to introducing at most nine language elements. The experiment itself was similar to a closed book exam, so no additional means of help were allowed. This step was taken to ensure unbiased testing of the participants' understanding of the textual terms and graphical shapes of a notation under the exclusion of potential effects resulting from looking up graphical shapes or textual terms. • A question sheet consisting of general questions on the background of the participant (age, gender, level of education, years of work experience, etc.), the experimental tasks, and a Likert scale-based questionnaire to gain insights on how the different representations are subjectively perceived (e.g., perceived learning difficulty) was handed out at the beginning of the experiment session. • An answer sheet accompanied the question sheet for marking the answers to the questions of the experimental tasks. This document makes an automated evaluation by the e-learning platform Moodle possible. 7 For the creation of the tasks of the experiment, we used an algorithm that randomly generates traces and computes the correct truth value of a constraint (i.e., fitting to the trace) automatically. The implementation makes use of the Event Processing Language (EPL) [27] to encode the behavioral constraint patterns in the Complex Event Processing (CEP) engine Esper. 8 Truth values were automatically randomly altered to another truth value to create both wrong and correct answer choices. After that automated generation of the task, we manually checked each answer choice to make sure that correct and incorrect answer choices will be treated in the right way (i.e., wrong answer choices are treated as incorrect and correct answer choices are indeed treated as correct) during the automated processing by Moodle.
In total, there were 18 experimental tasks, each consisting of a behavioral constraint, and the instruction to select the correct answers in the answer sheet and to keep time records. Per task six multiple choice answer options were available, each of them consisting of an execution trace and a (correct or incorrect) truth value. For each option the participant had to decide whether it is correct or incorrect (i.e., whether the truth value is correct for the given trace). Figure 6 shows the first task for each of the four groups, which is based on the Succession pattern. Please note that the 11:10 C. Czepa and U. Zdun   instruction text and the table for time tracking is only shown in Figure 6(a) and omitted in (b), (c), and (d). In case a participant works on a task several times, the time tracking table offers not just a single column, but four columns with four separate start and end times. Instead of letters that may be suggestive of a chronological order of events by the alphabetical order (after "A" comes "B") of the used letters, we use the abstract concepts "space" and "time" (cf. behavioral constraints in Figure 6), which do not indicate any kind of chronological order. In Figure 6(a), the answer choices (c) and (d) are correct.
In case of monitoring a behavioral constraint in a system at runtime, it might be the case that it is not only of interest if a specification is satisfied or violated but also whether further state changes are possible that could resolve or cause a violation of a specification. That is, the state of a specification can be either temporary (i.e., the state may change) or permanent (i.e., the state may not longer change). Consequently, to enable a more fine-grained analysis of the participants' understanding of behavioral constraints in the experiment, we employ the concept of runtime states (cf. Bauer et al. [4,5]) that support four truth value states. In particular, a behavioral constraint at runtime is either temporarily satisfied, temporarily violated, permanently satisfied, or permanently violated. Several existing studies make use of the concept of four LTL truth value states (cf. Pešić et al. [60], De Giacomo et al. [18], Maggi et al. [51], Falcone et al. [28], Joshi et al. [43], Morse et al. [56], to name but a few).
To reduce chances of misbehavior, the order of the answer choices was randomized between the experimental groups (cf. Figure 6(a)-(d)). That is, the answer choices remained the same in each group, only their order of presentation was different. Moreover, in the design of the experiment, orientation variations (i.e., the connector shapes were also presented rotated 180 • ) in the pattern presentation (cf. Figure 7 and Figure 6(b)) were considered, since the orientation possibly has an impact on understandability. However, with regards to orientation variations, the results did not reveal any conclusive impact on understandability.
Since the Succession pattern is not explicitly covered in PSP, it was realized by a combination of the Response and Precedence patterns (cf. Figure 6(a)). Table 1 summarizes other Declare patterns that are represented in PSP by combining available PSP patterns.
To support a replication of the study, we made the experimental material available online (cf. Czepa and Zdun [15]).

Hypotheses, Parameters, and Variables
Primarily, this controlled experiment focuses on the following hypotheses: • H 0,1 : There is no difference in terms of understandability between the representations.
• H A,1 : The approaches differ in terms of their understandability.
The understandability construct consists of two dependent variables, namely: • the correctness achieved in trying to mark the correct answers, and • the response time, which is the time it took to complete the 18 tasks.
These dependent variables are commonly used to measure the construct understandability (cf. Feigenspan et al. [29] and Hoisl et al. [38]). The independent variable (also called factor) focuses on the four behavioral constraint representations.
Secondarily, there are hypotheses that are concerned with the participants' opinion on the tested behavioral constraint representations: • H 0,2 : There is no difference in terms of perceived learning difficulty between the representations. • H A,2 : The representations differ in terms of perceived learning difficulty.
• H 0,3 : There is no difference in terms of perceived application difficulty between the representations. 3 : The representations differ in terms of perceived application difficulty.
• H 0,4 : There is no difference in terms of personal interest in using the approach between the representations. • H A,4 : The representations differ in terms of personal interest in using the approach.
• H 0,5 : There is no difference in terms of perceived practical application potential between the representations. • H A,5 : The representations differ in terms of perceived practical application potential.
• H 0,6 : There is no difference in terms of perceived improvement potential between the representations. • H A, 6 : The representations differ in terms of perceived improvement potential.

Experiment Design
Wohlin et al. [84] and Kitchenham et al. [46] recommend using a simple experiment design that is appropriate for the goal of a study. In consequence, we applied a completely randomized design with one alternative per participant, which is both a simple design and appropriate for the stated goals (cf. Section 3.1). The participants are assigned to representations in an unbiased manner by using a computerized random allocation to groups.

Procedure
In total, the experiment had a duration of 90 minutes. The experimental material, namely the question and answer sheet, was provided in the form of printed documents, and the participants were informed about the procedure of the experiment. That involved instructions on how to track time, how to mark answers correctly on the answer sheet, and a pointer to the questionnaire on the last page of the question sheet. During the whole experiment session, a clock was displayed by a projector, and the participants were instructed to write down the displayed time when starting and completing work on a task. Seating arrangements were made to limit chances for misbehavior. To limit chances for experimenter bias, the experiment was designed as a multiple-choice test that supports automated processing of the given answers by the e-learning platform Moodle. In case of necessary manual interventions (e.g., imprecise markings that we had to clarify), we always made use of the four eyes principle. Moreover, the time recordings and questionnaire answers were processed manually and double-checked subsequently.

Dataset Preparation
The data set was prepared as follows: We had to remove the data of two participants from the data set. One participant used an answer sheet of a different group, which could have been wrongly assigned by the experimenters. To be on the safe side, we decided not to consider this answer sheet as it might have led to confusion (e.g., the results might have accidentally been assigned to the wrong group). The experiment procedure was rigorously implemented in accordance to the planning described in Section 3. Unfortunately, one participant used unauthorized means of aid during the experiment, which has led to the exclusion of this participant from the study. Missing values (5.6% of the dependent variables excluding correctness) were substituted by the arithmetic mean (in case of interval scale data) and median (in case of ordinal scale data) of the data attribute per group.

Analysis of Previous Knowledge, Experience, and Other Features of Participants
In Figure 8, a kernel density plot and box plot of the participants' age per group is shown. The peak density of the participants' age is 23 years, and a high density can be found in the range between 21 and 25 years (cf. Figure 8(a)). Only very few participants are older than 28 years (cf. Figure 8(a)), some of them are shown as outliers in the box plot (cf. Figure 8(b)). The graphical inspection of the data indicates no major differences in the age distribution between the groups. Neither do statistical significance tests indicate any significant differences between the experiment groups (all p > 0.05; test applied: two-sided Cliff's delta [11,70]). Figure 9(a) shows a bar chart of the participants' gender distribution. In total, there were 36 female (31.6%) and 78 male participants (68.4%). Inside of the groups, the gender distribution was as follows: • DG: 9 female (36%) and 16   No significant differences were found in gender distribution (all p > 0.05; test applied: two-sided Cliff's delta [11,70]).
The participants' level of education in computer science is shown in Figure 9(b). Since the courses we recruited the participants from are primarily targeting bachelor students, only 21.1% of the participants hold a Bachelor of Science (BSc) degree in computer science. The distribution between the groups was as follows: • DG: 4 participants with BSc degree (16%) and 21 participants without any computer science degree (84%) • DGT: 7 participants with BSc degree (24.1%) and 22 participants without any computer science degree (75.9%) • DT: 5 participants with BSc degree (17.2%) and 24 participants without any computer science degree (82.8%) • PSP: 8 participants with BSc degree (25.8%) and 23 participants without any computer science degree (74.2%) Both the DGT group and PSP group have a slightly larger share of participants with a BSc degree in computer science than the DG and DT groups, but no significant differences were found in level of education between the groups (all p > 0.05; test applied: two-sided Cliff's delta [11,70]).
With regards to programming experience (cf. Figure 10), all groups have a high density in the range of 1 to 4 years of experience. Only very few participants have less than 1 year or more than 4 years of programming experience. Overall, the groups are similar with regards to programming experience. The steeper distribution shape of the DT group appears to be no major difference, since we could not find any significant difference in programming experience between the groups (all p > 0.05; test applied: two-sided Cliff's delta [11,70]).
In the majority of cases, the participants' modeling experience is between 1 and 3 years (cf. Figure 11). There are no significant differences in modeling experience between the groups (all p > 0.05 when p values are adjusted to take multiple testing into account [6]); test applied: twosided Cliff's delta [11,70]).
Since the computer science curricula at the University of Vienna are designed for full time studying, the majority of the participants have little to no work experience in the software industry. Some of the students work beside studying or had been working for years in the software industry prior to becoming a computer science student. These circumstances are very well reflected in Figure 12. The industry experience of the different groups appears to be similarly low. There are   no significant differences in industry experience between the groups (all p > 0.05; test applied: two-sided Cliff's delta [11,70]).
Overall, with the exception of minor differences, which are to be expected in a completely randomized group allocation, the groups are well balanced. We could not find any significant differences. That is, there are no indications of disturbing effects on the dependent variables that might have resulted from unbalanced groups.

Descriptive Statistics of Dependent Variables
This section presents the descriptive statistics of the dependent variables. All gathered data have been made publicly available (cf. Czepa and Zdun [15]). Table 2 contains the number of observations, central tendency and dispersion of the dependent variables correctness and response time per group. In consequence of the completely random allocation to groups, there were 29 participants in the DT group, 26 participants in DG, 29 participants in DGT, and 32 participants in PSP. Due to irregularities (cf. Section 4.1), we had to exclude the data of one DG participant and one PSP participant. With 43.64 and 41.46%, the correctness arithmetic means of the DT and PSP groups, which are both purely textual, are between about 4 to 10% higher than those of the DG (37.79%) and DGT (34.13%) groups. Also the median correctness values of the PSP (34.44%) and DT (33.61%) groups are between about 7 to 12% higher than those of the DG (21.67%) and DGT (26.76%) groups. According to the mean and median values, the response times appear to be slightly (about 2 to 3 minutes) faster in the PSP and DGT groups than in the DT and DG groups. Interestingly, many participants achieved a rather low level of correctness while the response times are overall far below the 90 minutes limit in all experiment groups. That is, time was no limiting factor and cannot be the cause of the low achieved correctness scores.
The results show large differences in range between the minimum and maximum correctness. We commonly observed such large ranges in course exercises over the past years. Consequently, the results of this study in that regard are aligned with these past observations. Almost all skew values are positive, which indicates a right-tailed distribution. With a small negative skew value of −0.01, the PSP response time distribution is rather symmetric. Kurtosis is another measure for the shape of a distribution that focuses on the general tailedness. Positive kurtosis values indicate skinny tails with a distribution toward the mean whereas negative kurtosis values indicate fat tails. The majority of the kurtosis values of the correctness variable are negative. The sole exception is the DGT kurtosis value of 0.53, which clearly indicates a steeper distribution than in the other groups.
With kurtosis values close to zero, the DT (−0.03) and DGT (−0.08) response time distributions are normal-tailed. In contrast, the DG response time distribution has a positive value (0.49) indicating skinny tails, and the PSP response time distribution has a negative value (−0.62) indicating fat tails.
In Figure 14, kernel density plots and box plots of the dependent variables correctness and response time are shown. The correctness distribution of the DGT group is steeper than those of the other groups (cf. Figure 14(a)). There are three outliers in the DGT group indicating that only a few participants were able to achieve high levels of correctness in this group (cf. Figure 14(b)). The outlier at 80.37% correctness had prior knowledge of graphical and textual behavioral constraint approaches while the other two outliers with correctness values of 97.2% and 94.4% did not have any prior knowledge of graphical and textual behavioral constraint approaches. According to the kernel density plot in Figure 14    The scatter plots of the dependent variables correctness and response time do not show any clear signs of correlation (cf. Figure 15). Moreover, the results of all Kendall's rank correlation tau tests are non-significant (cf. Table 3). Consequently, there appears to be no correlation between those dependent variables.
For a more detailed analysis of the correctness variable, we make use of a color scale plot (cf. Figure 16) to identify potentially problematic language elements. Interestingly, relations in which the order of the involved states is of importance (e.g., succession) result in lower mean correctness values than patterns in which the order is not of importance (e.g., choice). Especially, the response and succession patterns show a low level of correctness in all groups. Figure 17 shows diverging stacked bar charts (cf. Heiberger and Robbins [36]) of all Likert responses. In the following, we will discuss them.  • Statement 1: Studying the behavioral constraint representation approach has been difficult. 48% of the PSP participants strongly agree or agree that the approach is difficult to learn, followed by DT with 41%, and DGT with 38%. The share of strongly agree answers is about two times higher in the PSP group than in the DGT and DG groups. The purely graphical DG approach has the lowest percentage of (strongly) agree answers (36%). Interestingly, none of the DT participants answered with strongly agree. According to the bar chart, the purely graphical DG approach appears to be perceived slightly less difficult to learn than the other approaches. • Statement 2: Applying the knowledge about the behavioral constraint representation approach has been difficult. According to the gathered data, it appears to be more difficult to apply the approaches than learning them. With 81%, the share of (strongly) agree answers is higher in the PSP group than in the other groups, followed by the DGT group with 76%. The DG (with 16% disagreeing and 64% (strongly) agreeing) and DT approaches (with 14% (strongly) disagreeing and 69% (strongly) agreeing) are overall perceived a little less difficult to apply than PSP (with 6% disagreeing) and DGT (with 3% disagreeing). • Statement 3: I am personally interested in the approach and would like to use it in the (near) future. With 59%, the majority of DT participants does not show any interest in using the approach in the future. The share of neutral answers is largest in the PSP and DGT groups (52% and 48%), which indicates that the participants of this group are rather undecided. There is, however, a tendency toward (strongly) disagreeing with a share of about 1/3 of the given answers in those groups. Also the DG group shows with a share of 40% (strongly) disagree answers a rather negative attitude toward adopting the approach. • Statement 4: The behavioral constraint representation approach can be applied in practice. With 55% (strongly) agreeing, the PSP group has the largest share of positive answers. Interestingly, the share of strongly agree answers is smaller than in the other three groups, and the PSP group has no strongly disagree answers at all. DG is second with 48% positive and only 8% negative answers, followed by DT with 45% (strongly) agreeing and 8% (strongly) disagreeing. 55% of the DGT participants are undecided, but there is a clear tendency toward (strongly) agreeing (38%). • Statement 5: The behavioral constraint representation approach can be further improved. The share of (strongly) agree answers is large (≥60%) in all groups. At the same time, there are no strongly disagree answers and the share of disagree answers is very low (6% in PSP and 3% in DT). The largest share of (strongly) agree answers is present in the DGT group (72%).
In addition to the visualization by diverging stacked bar charts in Figure 17, we are interested in the shape of the distributions as well, as they are important for testing model assumptions of statistical tests. Figure 18 shows kernel density plots of the Likert responses with especially striking differences in distribution shape. In Figure 18(a), the PSP and DG distribution shapes are less steep than those of the other two approaches, and the PSP group has its peak at agree whereas the the DG group has its peak at neutral. In Figure 18(b), the DG distribution shape is rather flat in comparison to the remaining distribution shapes. The DT group has its peak at disagree while the DGT and PSP approaches show a similar distribution shape and location with a peak at neutral.

STATISTICAL INFERENCE
For proper hypothesis testing, it is important to select the most suitable method. Particularly, it is preferable to choose the method with the greatest statistical power given the properties of the data. Specific model assumptions must be met. A crucial model assumption of parametric testing is normality. The graphical analysis by normal Q-Q plots (cf. Figure 19) and Shapiro-Wilk tests of normality (cf. Table 4) suggest that the normality assumption does not hold in multiple cases. Specifically, the normality assumption does not hold for the correctness variables of all groups. Since there are indications of non-normality in the metric dependent variables correctness and response time (cf. Section 4.3), the model assumptions for parametric testing are violated. Therefore, parametric testing is ruled out. The non-parametric Kruskal-Wallis test assumes that   the distribution shapes do not differ apart from their central locations. The relevant descriptive statistics of the data (cf. Section 4.3), namely the skew/kurtosis values and the kernel density plots, suggest differences in the shape of distribution between the groups. Due to the properties of the data, we make use of Cliff's delta (cf. Cliff [11] and Rogmann [70]), a robust non-parametric test that is unaffected by change in distribution and non-normal data. Neither of the test results is   Table 5 and Table 6). Consequently, the null hypotheses H 0,1 to H 0,6 (cf. Section 3.4) cannot be rejected. The statistics software R was used for all statistical analyses. 9 In particular, we used the following libraries in the course of our statistical evaluations: biotools [17], car [31], ggplot2 [83], mvnormtest [75], mvoutlier [63], orddom [70], psych [68], usdm [57], and likert [41].

ANALYSIS OF FREE TEXT ANSWERS
In addition to the 18 experiment tasks and the Likert scale-based questionnaire, we asked three free text questions to capture the thoughts of the participants regarding the studied and applied behavioral constraint representation. These questions focused on the personal opinion of the participants regarding positive ("likes") and negative ("dislikes") aspects of the assigned behavioral constraint approach as well as suggestions for improvement. In particular, the following three questions were asked: • What do you like about the behavioral constraint representation approach?
• What do you dislike about the behavioral constraint representation approach?
• How would you improve the behavioral constraint representation approach?
Our analysis of the textual answers of the participants has been inspired by the summative content analysis approach [40]. Since the majority of answers given by the participants is very short and in note form, running a full-blown summative content analysis, which usually focuses on journal manuscripts or specific content in textbooks, is impossible. Nevertheless, it is possible to use the core idea of the technique, namely the counting of occurrences of specific content and the interpretation of the context associated with its use. In the following, we present the results of this analysis: • Of all participants, 17.5% (13.8% in DT, 20% in DG, 17.2% in DGT, and 19.4% in PSP) have shown interest in practical examples and case studies to deepen the knowledge and to grasp the full potential of their assigned approach, especially when applied in real-world scenarios. • Of all participants, 16.7% (20.7% in DT, 8% in DG, 27.6% in DGT, and 9.7% in PSP) stated that they prefer a more formal definition of the available patterns (and scopes) of the assigned behavioral constraint representation to alleviate ambiguities that are inherently present in natural language. • Three participants (9.7%) of the PSP group and one participant (3.4%) of the DT group reported issues regarding understanding truth values, especially the difference between temporary and permanent states, whereas one participant (3.4%) of the DGT group mentioned truth values positively and another one negatively. Neither positive nor negative aspects regarding truth values were mentioned by any participant of the DG group. • Two participants (6.9%) of the DGT group and one participant (3.4%) of the DT group stated that they would have wanted access to the learning material during the experiment, because they had problems memorizing the meaning of the behavioral constraint patterns. • Of the DG group participants, 40% reported problems regarding the graphical notation while 8% mentioned positive aspects. Two DG participants mentioned that the exclusive choice and choice symbols are hard to differentiate. Another participant stated that the not_succession pattern was difficult to grasp and remember. One participant reported that it was difficult to remember the order associated with relation patterns. Making use of more symbols rather than using combinations of symbols was proposed by one participant. The feedback of the remaining participants was more general in nature (e.g., "syntax is confusing" or "meaning of signs is hard to understand"). Merely a single participant stated that the shapes used in the graphical representation are clear and easy-to-read. Another participant mentioned his personal preference toward graphical approaches in general. • The share of DGT participants reporting problems with their assigned representation is 24.1%. Similar to the feedback of one DG participant, one participant would prefer using more shapes for a better differentiation of the patterns. Another participant found the naming of the patterns unclear. In this regard, another participant suggested to use terms present in Boolean algebra (e.g., "or" instead of "choice"). The rest of the comments are more general in nature (e.g., "not intuitive"). Two participants (6.9%) mentioned the graphical operators positively ("I liked the graphic representation, as the graphics contained some semantic information about the constraint" and "easy to understand representation in form of simple symbols"). • Of the DT participants, 20.7% mentioned negative aspects about their assigned textual representation while 10.3% mentioned positive aspects. Like one of the DGT participants, one DT participant desires clearer naming of the patterns. Another participant mentioned that the naming of the patterns is appropriate. Interestingly, one participant stated problems regarding understanding the meaning of the direction of statements (e.g., whether succession (A, B) means A succeeds B or B succeeds A). Originally, we had assumed that the direction would be understood by the reading direction implicitly. However, just a single participant reported this issue, so it is highly questionable whether this is a general issue. Two participant reported difficulties regarding using the approach due to "a lot of similarities between the constraints," which makes them hard to distinguish. Two participants suggested adding a negation operator to the constraint language to support negations for each of the available patterns. Another participant liked that there exists a specific pattern for "every case" but at the same time criticized that the number of constraints is growing rapidly if the implementation of new scenarios becomes necessary. One participant liked the function-like style of the approach, which is familiar to programmers. • Of the PSP participants, 19.4% criticized some aspects of their assigned representation while 9.8% mentioned positive aspects. One participant would prefer a "more sophisticated visualization" instead of the textual representation. Another participant was fond of the natural language approach, but criticized the lack of syntax highlighting in the experiment. We wanted to present all four tested representations by similar means to avoid bias toward a specific representation, so syntax highlighting was omitted in the PSP task descriptions intentionally. In actual implementations of the PSP approach, syntax highlighting or similar techniques are usually supported (e.g., Czepa et al. [13]). One participant liked the use of Boolean connectors, but would have wanted to see the actual symbols instead of words (e.g., ∧ for and). A similar comment was made by another participant who suggests "writing in a more mathematical way." Another participant would have wanted to see the xor operator in use. The difference between the between and after-until scopes was mentioned as "difficult to understand" by one participant. Another participant mentioned that the scopes and patterns are clearly understandable.

Evaluation of Results and Implications
While the descriptive statistics and the results of the analysis of free text answers appear to be slightly in favor of the textual approaches, the results of the inference statistics do not indicate any significant difference between the tested representations. That is, the following null hypotheses cannot be rejected: • H 0,1 : There is no difference in terms of understandability between the representations.
• H 0,2 : There is no difference in terms of perceived learning difficulty between the representations. • H 0,3 : There is no difference in terms of perceived application difficulty between the representations. • H 0,4 : There is no difference in terms of personal interest in using the approach between the representations. • H 0,5 : There is no difference in terms of perceived practical application potential between the representations. • H 0,6 : There is no difference in terms of perceived improvement potential between the representations.
However, it is striking that the achieved correctness is rather low on the average. From a prior experiment on the understandability of textual behavioral constraint approaches (cf. Czepa and Zdun [16]), it is evident that higher correctness values (about 70% on the average in PSP) are achievable if access to learning material and other material (e.g., handwritten notes) is granted during the experiment session. That is, it appears to be difficult to deduce the meaning of pattern-based behavioral constraints from their textual and/or graphical representations without additional support. As additional support was given for none of the groups, we do not think that this aspect could have influenced the relative outcomes of the experiment. In this regard, it might be possible that both approaches could benefit from a greater degree of additional support in a similar fashion. The analysis of the given free text answers regarding positive/negative aspects and suggestions for improvement with regard to the achievable correctness levels (cf. Section 6) provides additional evidence. Consequently, there are two angles for improvement, namely the representation itself (i.e., finding better graphical and/or textual representations) and the support provided (i.e., supportive means provided by a behavioral constraint modeling tool).

Threats to Validity
All known threats that might have an impact on the validity of the results are discussed in this subsection.

Threats to Internal Validity.
Threats to internal validity can be described as unobserved variables that might have an unwanted influence on an experiment's result by disturbing the causal relationship of independent and dependent variables. There exist several threats to the internal validity of this experiment, which must be discussed: • History effects refer to events that occur in the environment and change the conditions of a study. The short duration of the study limits the possibility of changes in environmental conditions, and we did not observe any such, but we cannot entirely rule out any such effect, prior to the study taking place. However, in such a case, it would be extremely unlikely that the scores of one group are more affected than another, because of the random allocation of participants to groups. • Maturation effects refer to the impact that time has on an individual. Since the duration of the experiment was short, maturation effects are considered to be of minor importance, and we did not observe any such effects. • Testing effects comprise learning effects and experimental fatigue. Learning effects were avoided by testing each person only once. Experimental fatigue is concerned with occurrences during the experiment that exhaust the participant either physically or mentally. The participants did not report, and we did not observe, any signs of fatigue. • Instrumental bias occurs if the measuring instrument (i.e., a physical measuring device or the actions/assessment of the researcher) changes over time during the experiment. We tried to avoid instrumental bias by using an experimental design that enables an automated and standardized processing of the test results. • Selection bias is present if the experimental groups are unequal before the start of the experiment (e.g., severe differences in relevant experience, age, or gender). Usually, selection bias is likely to be more threatening in quasi-experimental research. By using an experimental research design with the fundamental requirement to randomly assignment participants to the different groups of the experiment, we can avoid selection bias to a large extent. Moreover, our evaluation of the composition of the groups (regarding age, gender, experience/education in different dimensions) did not indicate any major differences.
• Experimental mortality is only likely to occur if the experiment lasts for a long time, because the chances for dropouts increase (e.g., participants moving to another geographical location). Due to the short time frame of this study, experimental mortality was not an issue at all. • Diffusion of groups occurs if a group of the experiment is contaminated in some way by another experiment group. We tried to mitigate this threat by asking the participants not to disclose or discuss anything related to the experiment before the experiment session. Since the participants share the same social group, and they are interacting outside the research process as well, we cannot entirely rule out a cross-contamination between the groups. • Compensatory rivalry is present if participants of a group put in extra effort when they have the impression that the representation of another group might lead to better results than their own. This threat is mitigated by insisting on nondisclosure. • Demoralization could occur if a participant is assigned to a specific group that she/he does not want to be part of. We did not observe any signs of demoralization such as increased dropout rates or complaints regarding group allocation. • Experimenter bias refers to undesired effects on the dependent variables that are unintentionally introduced by the researcher. The experiment was designed in a way that limits chances for this kind of bias. In particular, all participants received a similar training and worked on the same set of tasks (i.e., only the constraint representation differs). Moreover, the results of the controlled experiment were processed automatically in a standardized procedure. • Self-selection bias: The possibility of self-selection bias appears to be negligible as merely three students participated in alternative activities. • Impact of preparation: We designed the preparation material in such a way to keep the necessary effort involved in learning the patterns at a doable level for each participant. In accordance with Miller's law [53], at most nine language elements were presented in each experiment group. Consequently, the learning effort was minimal, which strongly mitigates the risk of insufficient preparation. Moreover, instead of directly asking the degree of effort spent on preparation, which might lead to insincere answers (i.e., a participant might expect to be punished for not preparing well), we tried to check indirectly by asking how difficult the studying of the approach was. We assumed that a participant who did not prepare himself/herself will not have a strong opinion on that topic and tick the neutral item or abstain. Subsequently, we removed the data of these participants from the data set and performed hypotheses testing. As in the full data set, no test result was significant. That is, even if participants who possibly did not prepare themselves are not considered, the results still apply.

Threats to External Validity.
The external validity of a study focuses on its generalizability. In the following, we discuss potential threats that hinder a generalization. Different types of generalizations must be considered: • Generalizations across populations: By statistical inference, we try to make generalizations from the sample to the immediate population. We do not intend to claim generalizability across populations without further empirical evidence. This study has a strong focus on the understandability of the tested representations from the viewpoint of novice software designers. We acknowledge that expert users who are familiar with Declare and/or Property Specification Patterns potentially perform better. • Generalizations across groups: Since the experiment groups focus on specific behavioral constraint representations, options for variation are limited. Nevertheless, in future studies, it might be interesting to introduce new or amended representations (e.g., a graphical representation that is based on DGT but using just a single relation shape). • Generalizations across settings/contexts: The participants of this study are students who enrolled in computer science courses at the University of Vienna, Austria. Apparently, a majority of the students are Austrian citizens, but there is a large presence of foreign students as well. Surely, it would be interesting to repeat the experiment in different settings/contexts to evaluate the generalizability in that regard. For example, repeating the experiment with English native speakers might lead to different (presumably better) results, since English terms are used in the textual/hybrid constraint representations. • Generalizations across time: In general, it is hard to predict whether the results of this study hold over time. For example, if teaching of graphical or textual behavioral constraint approaches is intensified in the computer science curricula at the University of Vienna, then the students would bring in more expertise, which likely would have an impact on the results.

Threats to Construct Validity.
There are potential threats to the validity of the construct that must be discussed: • Inexact definition and Construct confounding: This study has a primary focus on the construct understandability, which is measured by the dependent variables correctness and response time. This construct is exact and adequate. Several existing studies use this construct and its variables (cf. Feigenspan et al. [29] and Hoisl et al. [38]). • Mono-method bias: To measure the correctness of answers, the evaluation by an automated method appears to be the most accurate measure as it does not suffer from experimenter bias or instrumental bias. Keeping time records was the personal responsibility of each participant due to organizational reasons. The participants were instructed extensively on how to keep time records, and they were informed that accurate time record keeping will have a positive impact on the final grading. We also made clear that the overall response time has no influence on the grading to avoid time stress. We did not detect any irregularities (e.g., overlapping time frames or long pauses) in those records. Nonetheless, this measuring method leaves room for measuring errors, and an additional or alternative measuring method (e.g., performing the experiment with an online tool that handles record keeping) would reduce this threat. The additional task of keeping accurate time records might have had a negative impact on performing the actual experiment tasks, but no participant reported any such effect. • Reducing levels of measurements: Both the correctness and response time are continuous variables. That is, the levels of measurements are not reduced. The Likert scales (also called agree-disagree rating scales) used in this study offer 5 answer categories rather than 7 or 11, because the latter produce data of lower quality according to Revilla et al. [69]. • Group-sensitive factorial structure: In some empirical studies a specific assigned experiment group might sensitize participants to develop a different view on a construct. Since we did not ask questions regarding the subjective level of understandability, but tried to measure the actual level of understandability objectively, this threat appears to not be present at all. The questionnaire at the end of the question sheet is neither meant nor used to measure the understandability construct, but used to measure other aspects. Here, we tried to mitigate this threat by focusing on one-dimensional constructs (i.e., the multi-dimensional construct perceived difficulty is split up into perceived learning difficulty and perceived application difficulty).

Threats to Content
Validity. Content validity is concerned with the relevance and representativeness of the elements of a study for the construct that is measured: • Relevance: All tasks of this study are based on recurring behavioral constraint patterns that are present in existing graphical and textual behavioral constraint approaches (cf. References [24,25,61,80]). • Representativeness: A representative subset of existing behavioral constraint patterns was used for designing the tasks of the experiment. In this study we focused on a set of commonly used binary relations (cf. References [24,25,61,80]). Unary constraints are common as well, but we decided to omit them due to their simplicity. It is also worth noting that some behavioral constraints in Declare are not covered in the Property Specification Patterns, and vice versa. In particular, the scopes of PSP are not part of Declare, the chain patterns have a different meaning in Declare and PSP, and Declare supports additional behavioral constraints (i.e., alternate, negative relation and choice patterns). Nevertheless, some of the Declare patterns that are not explicitly covered by PSP can be realized by combinations of Property Specification Patterns. Others, like the alternate patterns of Declare are not covered by the originally proposed PSP collection. Naturally, it would have been possible for us to extend both Declare and PSP with new patterns including new graphical and textual elements, but proposing new pattern representations was not the goal of this empirical study, so the established as-is state of these approaches was covered.

Threats to Conclusion Validity.
Retaining outliers might be a threat to conclusion validity. However, all outliers appear to be valid measurements, so deleting them would pose a threat to conclusion validity as well. We performed a thorough investigation of model assumptions before applying the most suitable statistical test with the greatest statistical power, given the properties of the acquired data. That course of action is considered to be extremely beneficial to the conclusion validity of this study.

RELATED WORK
We are not aware of any existing empirical studies that investigate the differences in understandability of representative graphical and textual behavioral constraint languages in a similar way and depth as the presented study does.
We also would like to mention that there exists a huge body of studies on understandability of models that are merely remotely related to our work. For example, a study by Reijers and Mendling [67] investigates the understandability of classical flow-driven business process models. Interestingly, professionals could not be distinguished from the students of two participating universities, and the students of one university performed even better than professionals. However, since that study considers flow-driven business processes only, the results are hardly transferable to behavioral constraints that are declarative in nature.
In the following, we focus on studies that are highly related to the presented work, namely studies that are concerned with the understandability of behavioral constraints.

Empirical Studies on the Understandability of Behavioral Constraint
Representations in Software Architecture and Software Engineering There exist only very few studies on analyzing and comparing different behavioral constraint languages in the field of software architecture and engineering.
An eye-tracking experiment with 28 participants by Sharafi et al. [74] on the understandability of graphical and textual software requirement models did not reveal any statistically significant difference in terms of correctness of the approaches, which is in line with the results of our study.
However, software requirement models and behavioral constraints are merely distantly related. The study also reports that the response times of participants working with the graphical representations were slower. Interestingly though, the participants preferred the graphical notation.
Czepa et al. [14] compared the understandability of three languages for behavioral software architecture compliance checking, namely the Natural Language Constraint language (NLC), the Cause-Effect Constraint language (CEC), and the Temporal Logic Pattern-based Constraint language (TLC), in a controlled experiment with 190 participants. The NLC language is simply using the English language for software architecture descriptions. CEC is a high-level structured architectural description language that abstracts the Event Processing Language [27] and enables nesting of cause parts, that observe an event stream for a specific event pattern, and effect parts, that can contain further cause-effect structures and truth value change commands. TLC is a highlevel structured architectural description language that abstracts behavioral patterns. Interestingly, the statistical inference of this study suggests that there is no difference in understandability of the tested languages. This could indicate that the high-level abstractions employed bring those structured languages closer to the understandability of unstructured natural language architecture descriptions. Moreover, it might also suggest that natural language leaves more room for ambiguity, which is detrimental for its understanding. Potential limitations of that study are that its tasks are based on common architectural patterns/styles (i.e., a participant possibly recognizes the meaning of a constraint more easily by having knowledge of the related architectural pattern) and the rather small set of involved behavioral constraint patterns (i.e., only very few behavioral constraint patterns were necessary to represent the architecture descriptions).
Hoisl et al. [38] conducted a controlled experiment on three notations for scenario based model tests with 20 participants. In particular, they evaluated the understandability of a semi-structured natural language scenario notation, a diagrammatic scenario notation, and a fully-structured textual scenario notation. According to the authors, the purely textual semi-structured natural language scenario notation is recommended for scenario-based model tests, because the participants of this group were able to solve the given tasks faster and more correctly. That is, the study might indicate that a textual approach outperforms a graphical one for scenario based model test, an effect that our study did not discover for behavioral constraints. However, the validity of the experiment is limited by its sample size and the lack of statistical hypothesis testing.
A controlled experiment carried out by Heijstek et al. [37] with 47 participants focused on finding differences in understanding of textual and graphical software architecture descriptions. Interestingly, participants who predominantly used textual architecture descriptions performed significantly better, which suggests that textual architectural descriptions could be superior to their graphical counterparts. In our study, which has a focus specifically on textual and graphical behavioral constraints instead of software architecture descriptions, such an effect was not measurable.

Empirical Studies on the Understandability of Behavioral Constraint
Representations in Business Process Management In the field of business process management, there exist studies that evaluate the understandability of declarative business processes that are composed of a set of behavioral constraint patterns. These studies are highly related to our work, since they investigate the understandability of pattern-based behavioral constraints in the context of declarative business processes.
Weber et al. [82] carried out a controlled experiment (with 25 and 16 participants) on the impact of varying the levels of pattern-based behavioral constraints in planning and executing a journey. In particular, one group was exposed to only 2 behavioral constraints while another had to take 12 behavioral constraints into account. Interestingly, their statistical analysis does not show any significant difference in understanding. That might indicate that potential users handle varying constraint numbers well, but there also might not be enough measurable difference between 2 and 12 constraints. It would be interesting to evaluate how users cope with larger numbers of constraints (e.g., 25, 50, 100) as well. Moreover, the small sample sizes are a threat to validity of this study.
Zugal et al. [87] investigate the understandability of hierarchies in declarative business processes in an experiment with nine participants. The results of their research indicate that hierarchies must be handled with care. While information hiding and improved pattern recognition are considered to be positive aspects of hierarchies since the mental effort for understanding a process model is lowered, the fragmentation of processes by hierarchies might lower overall understandability of the process model. Another important finding of their study is that users appear to approach declarative process models in a sequential manner even if the user is definitely not biased by previous experiences with sequential business process models (e.g., BPMN [59]). They conclude that the abstract nature of declarative process models does not seem to fit the human way of thinking. Moreover, they observed that the participants of their study tried to reduce the number of constraints to consider by putting away sheets that describe irrelevant sub-process or by using the hand to hide parts of the process model that are irrelevant. The validity of this study is strongly limited by the extremely small sample size.
Haisjackl et al. [34] investigate the users' understanding of declarative business process models that are composed of a set of 10 behavioral constraint patterns with nine participants. The evaluation seems to be based on the same experimental data as in the work by Zugal et al. [87]. Like in that work, they point out that users tend to read such models sequentially despite the declarative nature of the approach. The larger a model, the more often are hidden dependencies overlooked, which indicates increasing numbers of constraints lower understanding. Moreover, they report that single constraints are overall well understood, but there seem to be problems with understanding the precedence constraint. As the authors point out, this kind of confusion could be related to the graphical arrow-based representation of the constraints where subtle differences decide on the actual meaning. That is, the arrow could be confused with a sequence flow as present in flow-driven, sequential business processes. As previously stated for the work by Zugal et al. [87], the validity of this study is possibly strongly affected by the small sample size.
Haisjackl and Zugal [33] investigated differences between textual and graphical declarative workflows using the Declare notation in an empirical study with nine participants. The evaluation seems to be based on the same experimental data as in the work by Zugal et al. [87] and Haisjackl et al. [34]. This study is highly related to our work presented in this article. The authors state that the results of their study indicate that the graphical representation are advantageous in terms of perceived understandability, error rate, duration, and mental effort, but this conclusion seems to be based merely on descriptive statistics (i.e., arithmetic means and counting occurrences). The lack of hypothesis testing and the small number of participants are severe threats to the validity of this study.
An approach by De Smedt et al. [21] tries to improve the understandability of declarative business process models by revealing hidden dependencies. They conduced an experiment with 95 students. The result suggests that explicitly showing hidden dependencies enables a better understanding of declarative business process models.
A study by Pichler et al. [64] compares the understandability of imperative and declarative business process modeling notations. This study indicates that imperative process models are significantly more understandable than declarative models, but the authors also state that the participants had more previous experience with imperative process modeling than with declarative process modeling. Moreover, the sample size (28 participants) is rather small, which is a threat to validity of this study.
Mendes Cunha et al. [52] try to improve declarative business process modeling by taking the comments of 4 persons into consideration. The resulting language is based on the same behavioral constraint patterns, but it proposes different graphical notations. Obviously, the small number of participants and the lack of evaluation of the proposed alternative graphical elements are serious threats to validity.

Summary
The results of this controlled experiment study with 116 participants did not reveal any significant difference in understandability nor in any other tested aspects (i.e., perceived learning difficulty, perceived application difficulty, personal interest in using the representation, perceived practical applicability, perceived potential for further improvement of the behavioral constraint representations) between graphical, textual, and hybrid behavioral constraint representations. Merely the descriptive statistics and the results of the analysis of free text answers are slightly in favor of the tested textual behavioral constraint approaches. The achieved correctness is rather low on the average in all experiment groups. A prior experiment on the understandability of textual behavioral constraint approaches (cf. Czepa and Zdun [16]) yielded higher correctness values (about 70% on the average in PSP) when access to learning material and other material (e.g., handwritten notes) is granted during the experiment session. That is, it appears to be difficult to deduce the meaning of pattern-based behavioral constraints from their textual and/or graphical representations without additional support. The analysis of the given free text answers regarding positive/negative aspects and suggestions for improvement (cf. Section 6) provides additional evidence in that regard.

Impact
Since there appears to be no significant difference in understandability of textual and graphical behavioral constraint approaches, the results of this empirical study might indicate that the tested representations can be used interchangeably. However, a major obstacle in this regard could be the overall low level of achieved correctness, which must be further investigated. In response to the low level of achieved correctness, this study indicates two angles for further research and improvement of textual and graphical behavioral constraint representations, namely the representation itself (i.e., finding better graphical and/or textual representations) and the technology support provided (i.e., the support provided by a behavioral constraint modeling tool or by analysis, refactoring, and debugging tools).
Our carefully designed and conducted empirical study can work as a solid foundation for further empirical evaluations of pattern-based behavioral constraint representations and their future development.

Future Work
The experiment could be repeated with different user groups (e.g., industrial practitioners) to gain further insights in the understandability of the representation from different perspectives. Other experiments could be run to further investigate the results. Experiments with different symbols in graphical behavioral constraint representations and variations of the terms used in textual approaches are also opportunities for future research. For example, a new representation could be introduced to the current experimental setup that streamlines the hybrid Declare approach (DGT) by reducing the number of available connector shapes to a single shape, just like relations in an ontology. For the evaluation of the understandability of interrelated behavioral constraint collections or the creation process of such, qualitative studies that are based on eye-tracking and think-aloud protocols [26] would further evolve the body of knowledge. Moreover, the presented study focuses on the understandability of already given textual and graphical behavioral constraints, so conducting an experiment on the understandability related to authoring textual and graphical constraints would be another interesting possibility for future research. In this regard, adequate tool support is assumed to be a major topic. The results of this experiment indicate that behavioral constraints in which the order of the involved states is of importance are particularly difficult to understand, so these elements should receive special attention. To improve the understandability of these constraints, a behavioral constraint editor could, for example, provide a tooltip with an animation that illustrates the temporal order of the involved elements (e.g., showing sample traces with the corresponding truth value state). Also, the textual terms and/or graphical elements of the representations should be revisited. For example, the response pattern (both in the textual and/or graphical form) can be easily misunderstood as a strict sequence (e.g., as known from procedural/imperative modeling languages). Such ambiguities must be avoided to achieve higher levels of understanding. An alternative textual representation of the response pattern, which might leave less room for misunderstandings by emphasizing the temporal order, could be A at time ta requires B at time tb > ta. The corresponding amended Declare notation of the response pattern is shown in Figure 20. Such amendments can be the starting point for further empirical evaluations with the goal to improve the understandability of behavioral constraint representations.