No search allowed: what risk modeling notation to choose?

[Background] Industry relies on the use of tabular notations to document the risk assessment results, while academia encourages to use graphical notations. Previous studies revealed that tabular and graphical notations with textual labels provide better support for extracting correct information about security risks in comparison to iconic graphical notation. [Aim] In this study we examine how well tabular and graphical risk modeling notations support extraction and memorization of information about risks when models cannot be searched. [Method] We present results of two experiments with 60 MSc and 31 BSc students where we compared their performance in extraction and memorization of security risk models in tabular, UML-style and iconic graphical modeling notations. [Result] Once search is restricted, tabular notation demonstrates results similar to the iconic graphical notation in information extraction. In memorization task tabular and graphical notations showed equivalent results, but it is statistically significant only between two graphical notations. [Conclusion] Three notations provide similar support to decision-makers when they need to extract and remember correct information about security risks.

clearly communicated with stakeholders to bene t from the ndings, and lead to the implementation of proposed recommendations and necessary decisions, e.g., selecting a proper cyber insurance product. Professionals highlighted the importance of communication as one of the critical features for SRA methods [26, Table 2]. Speci cally, in large corporations decision-making process involves stakeholders with di erent backgrounds and visions. Therefore, it is critical to communicate security risk information in a straightforward and objective way. For this purpose, industrial practitioners mostly rely on tabular notations, e.g., ISO 27005, NIST 800-30, or BSI IT-Grundschutz standards. Academia bets on graphical notations like i* [14] and CORAS languages [29], or recently proposed approach by Li et al. [28] for visualizing information security threats. There are some exceptions, for example, academia proposed SREP method [33] based on tables, while industry applies Microsoft STRIDE [18] approach that uses Data Flow Diagrams.
Previous studies with students and professionals showed that tabular notation supports better extraction of correct information about security risk over the iconic graphical notation [23,25]. However, those nding might not give a full picture and have some limitations of construct validity: the comprehension task could potentially be biased in favor of tabular notations and did not reveal comprehensibility potential of graphical representation. Therefore, the goal of this study is to 1) compare tabular and graphical risk models in more equal settings and 2) advance in evaluation at different comprehensibility facets, namely information extraction and memorization. In extraction task we address validity concern by providing both tabular and graphical risk models in the form of images that does not allow participants to search (or lter tables) in models' artifacts. The memorization facet aims at mitigating a possible look-up nature of the comprehension task as participants have to ful ll the task without the model. It also tests how well the di erent types of models support memorization of information about security risks from decision-maker viewpoint.
The results reported in this paper address the following question: "Which security risk model is more e ective in extracting and memorizing correct information about security risks?" To answer this question we conducted two controlled experiments with 60 MSc and 31 BSc students who were asked to complete similar comprehensibility tasks with and without having security risk models. From the results, in information extraction task we observed that participants with tabular notation obtain precision and recall similar to participants who used iconic graphical notation, even though it was not possible to search or sort tabular model. In memorization task, participants with tabular notation showed slightly lower comprehensibility in comparison to participants who used iconic graphical or UML notations, but the di erence is insigni cant.

RELATED WORK
There are three main research streams in the literature comparing textual and visual notations. The rst stream includes studies proposing cognitive theories to explain the di erence between notations and their relative strengths [34,48]. The second stream consists of studies comparing various notations from a conceptual viewpoint [21,38]. The last one contains empirical studies comparing graphical and textual representations for, e.g., business processes [36], software architectures [17], safety and system requirements [9,42,[44][45][46]. Recently, there were published a few empirical studies examining representations for security risks [15,19,25,50] or comparing graphical and tabular methods for security risk assessment in full scale application [22,24,27,30].
Empirical Research of Software Modeling Notation. Abrahao et al. [1] presented a family of controlled experiments with 112 participants with di erent levels of experience to investigate the e ectiveness of dynamic modeling in requirements comprehension. The ndings suggest that requirements speci cations complemented with dynamic models (sequence diagrams) improve the understanding of software requirements in contrast to using only speci cation document. Scanniello et al. [39] reported a meta-analysis of a family of 12 controlled experiments with students and professionals to study the e ect of UML analysis models on source-code comprehensibility. As treatments, they provided participants with source code with and without UML analysis models. The ndings suggest that using UML models harms source code understanding and increases the time necessary to complete comprehension task. Shara et al. [42] compared three requirements modeling notations (Tropos diagrams, structured textual representation and the mix of two) regarding their e ect on requirements comprehension. They did not observe any signi cant di erences between models in participants' response precision, but they found that participants who used mixed representation used signi cantly less time to complete the task in comparison to the participants with only textual or graphical models. The authors explained that the latter nding could be due to the learning e ect.
Empirical Research of Safety Modeling Notations. A research group of Stålhane et al. made a signi cant contribution to comprehensibility research in safety domain. They conducted a family of controlled experiments [44][45][46] to compare how useful textual and graphical notations for identi cation of safety hazards in security requirements analysis. The authors provided participants with textual use cases with system sequence diagrams [45,46] and misuse case diagrams with textual misuse cases [44]. The results showed that textual representation assists users in focusing on the relevant areas. Also, textual alternative demonstrated better results in the identication of threats related to functionality and user behavior, while diagrams helped in understanding the system's internal working functionality and identifying related threats. Recently, de la Vara et al. [8] investigated the comprehension of safety compliance needs with textual and UML representations. The results revealed a small positive e ect of using UML activity diagrams on the average e ectiveness and e ciency of participants in understanding compliance needs, but the di erence was not statistically signi cant.
Empirical Research of Security Risk Modeling Notations. In the past decade, empirical studies of security risk model comprehension got more contribution from di erent research groups. Matulevičius [31] reported an experiment with 28 graduate students in Computer Science to compare BPMN, Secure Tropos and misuse cases risk-oriented modeling notations w.r.t. their comprehensibility. The outcomes showed that BPMN based models were the most comprehensible out of three, while Secure Tropos and misuse case models were almost equal. A possible limitation of the study is that comprehension was measured by a simple 'look up' questions (e.g., "what is the security criterion?"). Managers who get SRA models must understand not only individual threat actors or vulnerabilities but also the relationships between them. We tried to address this aspect in the design of our comprehension questions (see Sec. 3 on p. 3).
Hogganvik and Stolen [19] compared the comprehensibility of UML and CORAS models in two controlled experiments with students. The results showed that the participants who used CORAS gave slightly more correct responses and spent less time to answer questions. A possible constraint of the study is that participants had~5 min to answer 4-5 questions. We addressed this issue by allocating 40 minutes to answer 12 comprehension questions in total. The weakness of this work is the focus on diagram-based notations. In our work, we lled this gap by comparing UML-based and iconic CORAS representations with a tabular notation which is widely used in industrial security standards (e.g., NIST 800-30, ISO 27001, SESAR SecRAM, UK HMG IS1).
Yildiz and Böhme [50] recently conducted a controlled experiment with 85 participants to investigate the e ects of risk visualization on managerial decision making in information security. This study showed that supplementing a textual description of security decision problem with graphical representation improves risk perception and participants' con dence in decisions, but does not contribute to the comprehension of the problem or security investment decision. In our prior study [25] we also found that participants achieved better or equal comprehension of described risk scenarios with tabular and UML-based notations.
There are few signi cant di erence and some similarities between this study and our previous works [23,25] which we summarize in Table 1. The main contribution of this work is studying how well tabular and graphical risk modeling notations support memorization of information about security risks. This development was suggested by the reviewer of our journal paper [23] with the goal to mitigate a possible bias in the comprehension task in favor of tabular notation.

EXPERIMENTAL DESIGN
We de ne the goal of our study according to the Goal Question Metric (GQM) template by Basili [4]: We analyze risk model comprehensibility for the purpose of assessing tabular and graphical modeling notations with respect to the extraction and memorization of correct information about security risks from the viewpoint of the decision-maker in the context of MSc and BSc students from the Delft University of Technology. We de ne the following research questions for our study: RQ1 Which representation (tabular vs. graphical) improves participants e ectiveness in extracting correct information about security risks? We used application scenario and comprehension questionnaire similar to the second study reported in [23]. Both studies involved MSc students.
We introduced the memorization task and added UML-like risk modeling notation that combines textual labels with graphical representation. Labunets et al. [25] (ESEM '17) We used the same application scenario, risk modeling notations, and comprehension questionnaire.
In addition to information extraction, we introduced the memorization task. The presented experiments involved MSc students, while an earlier study was conducted with IT professionals.
RQ2 Which representation (tabular vs. graphical) improves participants e ectiveness in memorizing correct information about security risks?
Experimental Task. In the extraction and memorization parts we asked our participants to answer a set of questions about information described in a risk model. Each set included six questions of di erent complexity levels. An example of the question: "Which threat events can be initiated by Cyber criminal to impact the asset "Con dentiality of customer data"? Please select all unique threat events that meet the conditions (one or more element maybe correct)." The sets were comparable between each other as included one question per combination of complexity factors along Wood's theory of task complexity [49] (i.e. information cues, relationships, and judgment acts) as adopted in practice by Labunets et al. [23]. The complexity factor was used to allow comparability of task between experimental parts and provide the diversity of questions regarding notation concepts to be understood. We did not have a goal to investigate the e ect of task complexity factor as our experimental design provide a too small sample size for this purpose. Table 2 presents two sets of comprehension questions that we provided to participants with graphical risk models. Questions for the tabular risk model are identical (except for the instantiation of the names of the elements to the textual risk modeling notation).
To test memorization performance and control participants' access to the artifacts, we had to provide participants with a picture of an assigned model and disabled the possibility to save images via the context menu of the browser. Also, we provided participants with multi-choice options for each question which consisted of a list of all unique elements present in the model. The list contained only elements' names (sorted alphabetically) but not their types (e.g., threat or vulnerability) as this could introduce additional bias by reducing the role of the model in task execution. The reason behind this step is to reduce possible mistakes due to manual typing of responses in the memorization part and make it comparable with the extraction part. Our participants were provided with images of risk models and could not copy-paste information like it was possible in Labunets et al. [23].
Research Hypotheses and Data Collection. From our GQM goal, we derived a set of null and alternative hypotheses (see Table 3). We did not formulate one-sided hypotheses like in Labunets et al. [23] as this study signi cantly di erent from the previous works.
The independent variable of our study is a risk modeling notation (tabular, UML, and CORAS). The dependent variable is comprehension level of participants that we evaluated based on participants' responses to a set of comprehension questions. As participants had to answer questions with one or more options, to quantify the comprehension level we could use information retrieval metrics, namely precision, recall, and their harmonic combination, the F-measure. Since our comprehension task included more than one question and we needed a single measure of participants' comprehension level, we aggregated all responses to calculate precision, recall, and F-measure at the level of the individual participant: where answer m,s ,q is the set of answers given by participant s to question q when looking at model m, and correct q is the set of correct responses to question q. Application Scenario. We kept the same scenario as in our prior works [23,25] to have some comparability with our previous ndings and mitigate possible threats to external validity. This scenario describes online banking services provided through a home banking portal, a mobile application, and prepaid cards. It was developed by our industrial partner, a large Italian corporation o ering integrated services in nance and logistics. See Giacalone et al. [13] for more details on the company's internal SRA process.
Risk Modeling Notations. Our selection criteria are: 1) comparability and 2) representativeness of studied notations and 3) coverage of core concepts used by the most common international security standards (e.g., ISO/IEC 27000 or NIST 800-30). Thus, we selected CORAS [29] as the most comprehensive graphical notation. This notation provides adequate coverage of central SRA concepts like an asset, threat, vulnerability, risk, and security control [11,32]. Other possible candidates were ISSRM [32], Secure Tropos [35], and si * [14]. The special feature of CORAS is a treatment overview diagram summarizing the SRA results. It is equivalent to summary tables in NIST's or ISO's standards.
As a tabular notation, we chose the table template for adversarial and non-adversarial risk from NIST 800-30 standard [43]. To show all the relevant information, we consolidated in a single table also the impact, asset and security control concepts (usually present in separate NIST tables).
For the mixed representation we used a UML-like modeling notation which replaces iconic elements in CORAS diagram by textual labels with element types. The related work [25] suggests that the availability of textual labels can help participants to understand risk models better and was found to be more preferred [15] over the pure graphical representation.
Figures 3a-3c in the appendix provide examples of fragments from CORAS and UML treatment diagrams, and NIST tables related to the risk of an HCN scenario that we used in the previous study.
Design. This experiment has a between-subject design where each participant completed a comprehension task using one of three This table presents two set of comprehension questions provided to participants of the study with a graphical risk models (i.e. CORAS and UML). Questions for tabular model were identical up to renaming of the elements. Note: C -complexity level, IC -# of information cues, R -# of relationships, A -# of judgment acts.
What are the consequences that can be caused for the asset "Availability of service"? Please select the consequences that meet the conditions (one or more element maybe correct). 2 3= 2 + 1 + 0 Which assets can be impacted by Hacker or System failure? Please select all unique assets that meet the conditions (one or more elements maybe correct). 3 4= 2 + 2 + 0 Which treatments can be used to mitigate attack paths which exploit any of the vulnerabilities "Poor security awareness" or "Lack of mechanisms for authentication of app"? Please select all unique treatments for all attack paths caused by any of the speci ed vulnerabilities (one or more elements maybe correct).
What is the lowest consequence that can be caused for the asset "User authenticity"? Please select the consequence that meets the conditions (one or more elements may be correct). 5 3= 1 + 1 + 1 Which unwanted incidents can be caused by Hacker with "likely" or higher likelihood? Please select all unwanted incidents that meet the conditions (one or more elements maybe correct). 6 4= 2 + 1 + 1 What is the lowest consequence of the unwanted incidents that can be caused by Hacker and mitigated by treatment "Regularly inform customers of security best practices"? Please specify the lowest consequence that meets the conditions (one or more elements may be correct).
Q C=I C +R +A Question statement 1 2= 1 + 1 + 0 Which vulnerabilities can lead to the unwanted incident "Unauthorized transaction via Poste App"? Please select all vulnerabilities that meet the conditions (one or more elements may be correct). 2 3= 2 + 1 + 0 Which unwanted incidents can be caused by Cyber criminal with "severe" consequence? Please select all unwanted incidents that meet the conditions (one or more elements may be correct). 3 4= 2 + 2 + 0 Which threat scenarios can be initiated by Cyber criminal to impact the asset "Con dentiality of customer data"? Please select all unique threat scenarios that meet the conditions (one or more elements may be correct). 4 3= 1 + 1 + 1 Which threats can cause an unwanted incident with "severe" or higher consequence? Please select all threats that meet the conditions (one or more elements may be correct).
What is the lowest likelihood of the unwanted incidents that can be caused by any of the vulnerabilities "Use of web application" or "Poor security awareness"? Please select the lowest likelihood of the unwanted incidents that can be initiated using any of the speci ed vulnerabilities (one or more elements may be correct). 6 5= 2 + 2 + 1 Which vulnerabilities can be exploited by Hacker to cause unwanted incidents with "likely" or higher likelihood? Please select all vulnerabilities that meet the conditions (one or more elements may be correct). There is a di erence between notations in the level of comprehension when answering comprehension questions with available risk model (extraction task).

H 2 No di erence between notations in the level
of comprehension when answering comprehension questions without having a risk model (memorization task).
There is a di erence between notations in the level of comprehension when answering comprehension questions without having a risk model (memorization task).  Table 4 summarizes the experimental design of our study. Participants were randomly distributed between the three types of treatments and two sets of questions and worked individually. We chose this design for two reasons: 1) to eliminate a possible learning e ect between treatments and 2) control a possible e ect of di erent sets of questions. We also limited the time a single participant could spend on the overall experiment by 20 minutes as we used level of comprehension as performance metric [5]. The participation was anonymous and volunteer without any reward. Participants could withdraw from the experiment any moment before experiment completion. Experimental Protocol. We used a three-phase protocol [30]: • Training: Participants answered a brief individual demographics and background questionnaire and during 5 min watched a video tutorial on the appointed modeling notation and Online Banking application scenario. • Application: The participants had to complete two parts: -Part 1 was an extraction task where the participants had to review the appointed risk model and answer six comprehension questions. The order of questions was the same for all participants due to limitations of survey platform. Participants had 20 minutes to complete the task after which they were automatically advanced to the next page. An image of corresponding risk models was built in on the top of the task page and protected from downloading or opening in another tab in the browser. The tutorial on notation and scenario was provided at the beginning of the task and can be downloaded 1 . After nishing the task, participants lled in a post-task questionnaire. -Part 2 was a memorization task where participants rst need to memorize the same model as in part 1. After 5 minutes they were automatically forwarded to the comprehension task, but they no longer had access to the risk model and had to answer another six questions similar to part 1. The rest of the task was the same. Data Analysis Procedure. To validate our null hypotheses we could use ANOVA test as we compare three treatments. However, ANOVA test makes assumptions regarding normality distribution (checked by Shapiro-Wilk test) and homogeneity of variance (checked by Levene's test) of our samples. In our case samples do not meet these requirements, we use the Kruskal-Wallis (KW) test and a post-hoc Mann-Whitney (MW) test (corrected for multiple tests with Bonferroni method). We adopt 5% as a threshold of α (i.e. the probability of committing a Type-I error).
In case we fail to observe a statistically signi cant di erence between treatments we can test their equivalence with TOST which initially was proposed by Schuirmann for testing the equivalence of generic and branded drugs [41]. The problem of the equivalence test can be formulated as follows: where µ A and µ B are means of methods A and B, and δ corresponds to the range within which we consider two methods to be equivalent. The p-value is then the maximum among p-values of the two tests. The underlying test for each of the two hypotheses can then be any di erence tests (e.g., t-test, Wilcoxon, etc.) as appropriate.
The FDA [12] recommends to use δ = [80%; 125%]. On our bounded scale a percentage range could warrant statistical equivalence too easily when the mean value is close to the upper bound. Thus, we conservatively adopted δ = 1 2 σ = ±0.12 that has been empirically derived by Labunets et al. [25] from related studies.
To control the e ect of co-factors (e.g., working experience or level of English) on the actual comprehension in the form of Fmeasure, we use permutation test for two-way ANOVA, which is a suitable approach in case of violation of ANOVA's assumptions [20] (e.g., data has an ordinal type). The post-task questionnaire is used to control for the e ect of the experimental settings and the documentation materials.

STUDY EXECUTION
The initial implementation of experimental setup has been tested in a pilot with several PhD students and faculty members at the Delft University of Technology (TU Delft). The initial experiment took place on September 14, 2017, at one of the lectures of the Cyber Risk Management course taught by a colleague of the author to MSc students in TU Delft. The replication of the experiment with BSc students occurred on September 18, 2017, as a part of the lecture at the Security & Organisation course. We collected 60 complete responses in the rst experiment (20 with CORAS model, 20 with UML, and 20 with Tabular) and 31 in the second experiment (11 with CORAS, 9 with UML, and 11 with Tabular). Table 5 presents the demographic and background information about participants in both experiments. Overall, our participants reported basic knowledge of requirements engineering, graphical modeling languages and security, and limited knowledge of risk assessment. Regarding the application scenario, they had a basic knowledge of online banking domain.

RESULTS
We begin with the analysis of the di erent experimental factors like the di erences between experiments and sets of questions. We report the factors separately with a statistically signi cant di erence, while factors without statistically signi cant di erences were aggregated and analyze together.
Experiment. The permutation test for two-way ANOVA did not reveal any statistically signi cant interaction of experiments with modeling notation (p=1) nor e ect on F-measure of participants' responses (p=0.80). Therefore, we analyze data collected in two experiments together.  Experimental Task. We did not observe any statistically significant e ect of two sets of questions (Set 1 and 2) on F-measure (permutation test for two-way ANOVA p= 0.10 in extraction and p= 0.60 in memorization parts) nor interaction with modeling notation (p=0.31 in extraction and p=0.79 in memorization parts). Thus, we analyze together the results of from two sets of questions. In this way, we eliminate a possible e ect of a task order on the results. Figure 1 compares the distribution of precision and recall of participants' responses in extraction (left) and memorization parts (right). If we take median precision and recall as a quality threshold for the level of comprehension, then we can see that 14/31 and 15/31 participants who used tabular and CORAS risk models respectively managed to reach the top right corner of the plot. In the case of UML, more participants appeared in the bottom left corner.
RQ1: Information Extraction. Table 6 presents descriptive statistics for precision and recall of responses to the extraction task. We can see the di erence in favor of CORAS model over UML model (4% for precision and 16% for recall) and in favor of Tabular model over UML model (3% for precision and 14% for recall). The di erence between CORAS and Tabular models is less than 1.5% both for precision and recall.
The results of the KW test did not reveal any statistically signicant di erence in precision and recall between three modeling notations (KW p> 0.17). We further investigated if there is a statistically signi cant equivalence between pairs of modeling notations using TOST with MW test with δ = ±0.12. First, we tested the equivalence    Table 8 presents descriptive statistics for precision and recall of responses to the memorization task. We observed that both graphical modeling notation demonstrated 10-12% better response precision in memorization task than the tabular notation. The di erence in the recall is smaller (7%) comparing to precision, but still in favor of graphical notations.
The results of the KW test did not reveal any statistically signi cant di erence in precision and recall between three modeling notations (KW p> 0.57). Therefore, we investigated if there is a statistically signi cant equivalence between pairs of modeling notations using TOST with MW test. The Levene's test con rmed that the samples have equal variances (Levene's p> 0.21). Thus MW assumption holds for our samples. Table 9 summarizes ndings of the statistical tests for memorization part. Regarding precision, we found that CORAS and UML models are equivalent with statistical signi cance, but for tabular and graphical models we got inconclusive results. In respect to recall, we obtained similar results. There is an equivalence between CORAS and UML models which was found statistically signi cant, while tabular and graphical models are equivalent at 10% signi cance level only because TOST test returned p-value = 0.029 and 0.041. Thus, we can reject an alternative hypothesis H 2 a only for the pair of graphical models, but not for pairs of tabular and any of two graphical models.
Post-task Questionnaire. We asked our participants to evaluate di erent aspects of study execution via the post-task questionnaire after each experimental part. Several questions were di erent between extraction and memorization parts. Table 10 presents descriptive statistics of participants' feedback. Responses are on a ve-item Likert scale from 1 (strongly disagree) to 5 (strongly agree).
Overall, the participants found time to complete the task to be reasonable (question Q2) in both parts. The objectives of our study (Q3), task (Q4), and comprehension questions (Q5) were clear enough. Also, the participants did not struggle with understanding risk models (Q8) and using the electronic version of tabular and graphical models (Q9). They have a positive experience in using survey platform (Q10). Only participants who used tabular model reported 0.5 points lower responses regarding their experience with the Qualtrics platform. The di erence is likely caused by the fact that tabular model was available in the form of a picture rather than a searchable document which is not a problem for graphical models. In extraction, part participants reported no signi cant di culties in answering comprehension questions (Q6 in Table 10a), but in memorization part, the same questions were more challenging to the participants (Q6 in Table 10b). Also, participants were not sure if they had enough time to memorize risk model (Q1) and report about problems in model memorization (Q7). Higher cognitive load in memorization task comparing to extraction part can explain this.
Co-factor Analysis. We used the permutation test for two-way ANOVA to investigate the possible interaction between independent and dependent variables with several co-factors: participants' level of English, working experience, the level of participants' knowledge of security engineering, risk assessment, requirements   engineering, graphical modeling languages, online banking. There was no statistically signi cant interaction between risk modeling type, dependent variables and any co-factor except one case. The test revealed a statistically signi cant interaction of participants' knowledge of online banking domain and modeling notation on F-measure in extraction part (p= 0.0046). We check this nding using the interaction plot presented in Figure 10a. As we had a small number of participants who reported their knowledge as "expert" (1 participant) and "pro cient user" (4 participants) in online banking, we merged these categories with the category "competent user". We can see that the participants with lower levels of knowledge demonstrated a better overall level of understanding across all models. At the same time, participants with a higher level of knowledge showed worse comprehension level with tabular and UML models while the CORAS group demonstrated consistent results. A possible explanation could be the presence of Dunning-Kruger e ect [10], when less competent people tend to overrate themselves higher due to a lack of competence and illusion about the level of their skills, while more competent people are likely to devalue their skills level as they think that others are more knowledgeable than themselves.

THREATS TO VALIDITY
Construct threats: Construct validity concerns whether the right metrics were used to investigate the comprehensibility of risk models. To mitigate this threat, we measured participants' level of comprehension using questionnaire and evaluated answers using information retrieval metrics (precision, recall, and F-measure) to avoid possible subjectivity in assessment. These metrics are widely adopted in the empirical software engineering literature [2,16,40]. The comprehension questionnaire was designed following a systematic approach inspired by related works [16,37] and has been validated in our previous studies [23,25]. Another relevant threat is the in uence of the experimenter which we reduced by minimizing our involvement in the experimental process down to 10 minutes presentation about the high-level goals of the experiment and its procedure. The rest of the experiment was implemented using Qualtrics survey platform. Also, the decisions regarding experimental design were discussed with colleagues and tested in the pilot study to limit possible experimenter bias. Internal Threats: Learning e ects and order of task execution could threat internal validity. We mitigated it by adopting a withinsubject design with a random assignment of subjects to the groups. Participants were instructed to ful ll task individually without interacting with other participants. To mitigate learning between extraction and memorization parts, we kept the same order of the parts for all participants. It was a feature of our experimental design to give participants enough time to get a hands-on experience with the model and learn it not only during 5 minutes before memorization task but also during the completion of the extraction part.
External Threats: Using students as experimental subjects could potentially harm the external validity, as they may be not representative enough for practitioners population. However, Svahnberg et al. [47] suggested that students can perform well in empirical studies. Moreover, we tried to recruit participants who have basic knowledge about security and modeling languages. To make the experimental settings as real as possible, our scenario was developed by an industrial nancial company.
Conclusion Threats: A possible conclusion validity threat is related to the data analysis. We used the non-parametric tests as they do not require a normal distribution of the sample. To mitigate low statistical power, we adopted α = 0.05 for the di erence test and for the equivalence test δ = ±0.12 that has been empirically determined by Labunets et al. [25] from related works.

DISCUSSION AND CONCLUSIONS
We summarize our ndings as follows: RQ1: Which representation (tabular vs. graphical) improves participants e ectiveness in extracting correct information about security risks? The results revealed that the tabular model is equivalent to  both graphical models with statistical signi cance for precision, but the equivalence between CORAS and UML is signi cant at 10% only. For recall, only tabular and CORAS models have statistically signicant equivalence, while the other pairs showed some di erence but not statistically signi cant. The UML notation showed lower recall compared to tabular (16% di erence) and CORAS models (14%). In contrast to our previous studies [23,25], tabular representation did not show the best comprehension but performed at the level of other representations. The changes in the experimental settings could explain this. First of all, in our study the participants were not able to search provided models and lter tabular model which was available at our prior works. Previously this feature was extensively used by the participants with tabular (71% of participants) and CORAS models (70% of participants) (see responses to Q9 in posttask questionnaire in [25, Table IX]).
We also can notice that the level of precision (0.79 for CORAS group) in extraction task is similar to the precision of participants who used graphical models (overall precision 0.80-0.82) in two our previous studies with students (see precision results in [23,). In this study, our participants got a slightly better recall with CORAS (0.79) in comparison to the studies mentioned above where the participants who used CORAS model had an overall recall equals to 0.73 and 0.68, respectively in study 1 and 2. We suspect that multichoice questions could have some contribution to better recall as it may provide a handy way to check that all relevant elements are selected in response to a speci c question. This phenomenon requires additional research to be con rmed.
RQ2: Which representation (tabular vs. graphical) improves participants e ectiveness in memorizing correct information about security risks? There is a small di erence in the comprehensibility of three modeling notation in favor of graphical notations, but tests did not con rm statistically signi cant equivalence between tabular and graphical models in precision and simply at 10% signi cance level in response recall. Only two graphical models were found to be equivalent with statistical signi cance both in precision and recall. It can be explained that the UML and iconic risk models are equally good at supporting memorization of the correct information about security risks. Tabular notation provides a less clear representation of relations and presence of information duplication that a ects participants' precision in responses.
Implications for research: This work contributes to the body of knowledge on model comprehensibility, speci cally, for security risk management. We studied the e ectiveness of tabular and graphical risk models in extraction and memorization of correct information about security risks. The results suggest that di erent type of comprehensibility task (extraction vs. retention) could expose di erent evaluation results [7].
Implications for practice: The main implication of our results for practitioners is the illustration of how well studies notation perform in di erent settings. If you need to present results in a xed format (e.g., picture or slide), then both tables and diagrams could provide a similar level of information extraction and retain. However, if a decision-maker can work with risk model documents (e.g., search document) and does not need to remember all information, then tables are your best and more straightforward choice [25].
Future research: These days more and more information is delivered in electronic format. Thus, di erent scenarios of model communication and usage should be taken into account. For example, an experiment comparing comprehensibility and usability of a model snapshot vs. model le with available search and sort/ lter function would ll this gap. Also, factors like level of participant's con dence in given responses and perceived di culty of the task might shed some light on the comprehensibility of risk modeling notations as suggested by Aranda et al. [3].
like to thank B. Solhaug and K. Stølen from SINTEF for support in the de nition of the CORAS and UML models, F. Paci from the University of Southampton and P. van Gelder, M. van Eeten, W. Pieters from the Delft University of Technology for their critique of experimental design and implementation.