Learning Analytics Perspective: Evidencing Learning from Digital Datasets in Cybersecurity Exercises

Cybersecurity exercises are gaining in popularity in university curricula and professional training paths and are seen as an effective teaching method. Such exercises provide digital datasets to facilitate a learning analytics approach such as by using the traces that learners leave behind to improve the learning process and environment. While there are various learning measurement efforts from digital datasets in the existing literature, a holistic learning analytics approach incorporated into cybersecurity exercises is still lacking. We propose a practical reference model for incorporating a learning analytics approach into the cybersecurity exercise life-cycle. To facilitate this application, we have performed an extensive review of existing academic research on applying learning analytics in the context of cybersecurity exercises. We specifically focus on the learning indicators used to measure empirical impact and training effectiveness that could indicate achievement of defined learning outcomes. This reference model and overview of existing learning analytics use cases and learning metrics in various types of exercises can help educators, organisers and cyber range developers. This results in more adaptive exercise design and measurement using evidence-based data and connects digital learning traces to skills and competencies.


Introduction
Effective learning, teaching, and skills improvement of cybersecurity students and professionals is a critical research area. As there is a high demand for skilled professionals and a shortage of such individuals [1], the development of scalable and effective teaching methods is critical. We focus on the application of learning analytics in the cybersecurity training, specifically in cybersecurity exercises, as a way to provide a more evidence-based and systematic approach for the evaluation of learning impact and to enable the design of more effective learning. This is a critical aspect to consider for educators, organisers and cyber range developers.
Learning analytics (LA) is defined as "the measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimizing learning and the environments in which it occurs" [2]. As a field of research, LA aims to predict and advise on identifying students' learning needs and improve pedagogical strategies based on analytical approaches [2]. However, establishing plausible relationships between models derived from quantifiable digital data, and the complex socio-cognitive world of "learning" is challenging [3]. LA is closely intertwined with educational data mining (EDM) that develops, researches, and applies computerized methods to detect patterns in large educational data sets [4].
Cybersecurity training teaches both technical and soft skills, as the field involves technology, people, information, and processes. A wide range of training methods have been developed by universities [5] and organisations to provide cybersecurity education. As part of such cybersecurity trainings, hands-on exercises (both online and classroom) are gaining in popularity in university curricula and professional training paths. Cybersecurity exercises (CSXs) are viewed as an effective and engaging way of teaching both technical and soft skills in addition to CSXs for learning purposes (e.g., as part of university courses, competitions across universities, etc.). Most national and international CSXs (47%) also focus on training and providing participants an opportunity to gain knowledge, understanding and skills [6]. The CSXs can vary significantly in scale and content, ranging from short online or classroom exercises, Capture the Flags (CTFs) to large-scale/multi-stakeholder exercises, etc. However, most share common aims and challenges with respect to learning. We view CSX as a learning or training event in which individuals or teams implement, manage and defend/attack a network of computers at a tactical or strategic level.
As those exercises always leave an extensive digital footprint of learning processes, it makes them an ideal base to develop the methods within the learning analytics field itself. As a result, using these evidencebased learning traces in learning design can improve the experience for both students and specialists. It also helps to investigate the validity of common, yet unsubstantiated claims, such as "everyone feels they had learned important lessons [7]" or "exercises are a very effective way of learning the practical aspects of information security" [8].
We propose a reference model for LA in CSXs in Section 2 offering a practical guide to the exercise organisers to enhance the conceptualisation and integration of learning analytics into the exercise life-cycle. We support the proposals presented in the model with an extensive overview of existing uses of learning analytics by pro- When implementing LA measurements into exercises we need to understand (1) what metrics evidence learning and (2) are they helping the learners to learn or teachers to teach? The metrics (i.e., indicators of learning success) collected and analysed provide technical data (e.g., time, command-line, tools used, etc.), and there appears to be an overemphasis upon what we can measure, instead of measuring what we value-a longstanding concern in educational assessment [9]. However, applying a learning analytics approach and analysing metrics from digital datasets, can provide a more detailed and evidence-based input to more comprehensive learning evaluations, such Kickpatrick or other chosen evaluation models [10].

Reference Model for LA in CSXs
Learning analytics should be incorporated to the CSXs' identifying, planning, conducting and evaluating phases (as described by [11]) and be seen as an integral part of the exercise design in line with the overall pedagogical approach selected [12]. When starting to implement a LA approach into an exercise it is useful to think about LA process from aspects of What (Data, Environments, Context), Why (Objectives), Who (Stakeholders) and How (Methods) [13].
We propose a practical tool: the CSX LA reference model, Fig. 1, that builds upon [13] and [11]. Our contribution is to combine and outline the key learning analytics considerations to incorporate into the CSXs life-cycle and support the model with an extensive overview of existing use cases for a practical implementation. Developers would need to consider LA aspects in their initial design of the cyber ranges when they incorporate the technological foundation of instrumenting the exercises.
Asking these learning analytics related questions and finding the answers during an exercise life-cycle, will ensure that learning measurements are not simply an afterthough but rather are incorporated in the "identifying phase" (Fig. 1). Considering questions, such as "What data can we collect that will help learners to learn?" and collecting only that relevant dataset, would help with the challenges of storing huge datasets from an exercise and later trying to determine what data could be used to provide feedback. For example, if the learning objective of an exercise is to improve the incident response process, then timestamps that would indicate team communication would be critical data to collect [14]. However, when the proficiency of using various forensics tools and commandlines is exercised, then capturing bash history or keystrokes is relevant (e.g., [15], [16]). Also consideration should be given how to support instructors in giving feedback. Designing automated feedback that takes into account the users behaviour and predicts their actions and questions becomes available and can make the learning experience more individualised and effective (e.g., [17], [18]). Depending on the purpose, scale and type of a CSX, it may be recommended to include a learning specialist in the organising team to coordinate the LA implementation throughout the CSX life-cycle.

Supporting Reference Model with Existing Research Results and Practical Considerations in Implementing LA Approach in CSXs
To gain an insight about what empirical evidence and metrics have been collected and analysed in the scientific literature, we conducted extensive related work review. We searched Google Scholar, a widely used and available scientific database and limited the search to empirical studies published in a peer-reviewed journals and conferences in English. We used the keyword "learning" in combination with "cyber(-)security" for different exercise types. For the exercise type classification we followed the European Union Agency for Network and Information Security (ENISA) taxonomy: Capture the flag, Discussion based game, Drill, Red team / Blue team, Seminar, Simulation, Table-top and Workshop [6]. In some cases, exercises with gamified elements are referred as "serious games", or with competitive elements as "competition", and thus included these in our search strings for completeness.
We reviewed the abstracts of 200 articles for each exercise type, as this was considered sufficient to encom-pass all relevant material. The academic papers identified as covering CSXs for learning purposes were manually reviewed to identify LA topics and all empirical/analytical learning data collected or analysed. It should be noted that even though there is a large number of articles describing the CSXs, these do not necessarily include empirical data from the digital datasets to evidence learning. We used feature mapping [66] in which the content is analysed and recorded in a standardized format documenting the key features of a predetermined aspects (i.e., LA model [2]) to produce a summary of the topic. Related work was mapped to an overview matrix in Table 1 by the ENISA exercise types [6] and LA model [2] consisting of collection, storage, data cleaning, integration, analysis, representation and visualization and action.

Uses of Digital Datasets to Evidence of Learning Effectiveness in CSXs
An overview matrix that shows the application of learning analytics by exercise types and LA process steps is presented in Table 1. 3.1.1. Capture the Flag. A CTF is typically a challenge designed to help sharpen cybersecurity skills and provide hands-on learning taking various styles, such as jeopardy, attack-defence and a mix of the two. However, participating in CTFs does not necessarily ensure future success, and participants rarely receive a detailed critique of their performance, which is essential in learning [67].
There are few studies that provide empirical evidence from digital game-play traces for learning and skill acquisition in CTFs. Clothia and Novakovic [19] show that Jeopardy-style challenges with automatic marking of flag submissions complemented by manual marking of detailed written answers provided students with instant feedback during an exercise with an improved student satisfaction with the academic course. The ability of students to acquire the flag is highly correlated with their overall marks (written assignment), and flag-based marking effectively assesses a student's basic skills and understanding of cybersecurity topics [19]. However, acquiring flags and a student's deeper understanding of the underlying issues is much less-correlated [19]. Cheung et al. [20] describe combining logging results from log servers along with a key-logger to track participant sessions. The authors gathered statistics about login times (e.g., the environment was most often used on Saturdays between 3:30pm and 8pm) and command usage to see what commands students were still having trouble with after the lectures, which helped determining what to spend more time on [20]. Chapman et al. [21] evaluate PicoCTF based on survey responses and user interaction logs to explore the effectiveness of design choices (e.g., younger students prefer a game interface compared to older students, and a general dislike for challenges requiring learning new tools). The system kept track of answers submitted by every team, both correct and incorrect-recording the time, content and relevant problem identifier, as well as the IP address of the submission [21]. In order to evaluate student engagement, the authors determined periods of time during which teams were most active (i.e., time interval at submissions) [21]. Tseng et al. [22] also focused on data collection from heterogeneous environments and proposed an ontology to represent concepts (consisting of teams profile, skills (tools), test items and environment) within exercises and their relationships (including linking heterogeneous logs to participants' intentions). The authors focused on problem-solving behaviours and applied a modified a priori algorithm, analysed frequent item-sets and identified learning behaviours such as that novices keep guessing keys and better performing students focused on particular items [22]. The above papers describe metrics applied and the authors research selection of patterns and relations in the digital learning dataset. [23] describe possible metrics, such as teamwork, challenge difficulty, challenge strategies, tools, and general problem solving techniques. Whereas [27] and [26] propose, apply, and experimentally evaluate data analysis and machine learning techniques to obtain interactions from the ingame data and provide learners who progress differently with individualized feedback. Vykopal et al. [25] suggest decomposing the exercise training activity into individual levels to achieve specific learning objectives, and collecting timestamps (or events) such as start and end of the game, start and end of each level, submission of incorrect flags and their content, hints used, skipping a level, displaying a level's solution, game ID [24].
Dark and Mirkovic [61] bring out an aspect that to measure learning we may need to rely on proxy indicators (e.g., identifying reasonable and observable indicators of adversarial thinking). Overall, we can see that analytical approaches are emerging to evaluate learning from digital dataset(s) that are good starting points when incorporating the learning analytics process into the exercises.

Red Team and Blue
Team. Red-Blue exercises are often team-based and therefore add another complexity level in LA application-measuring team learning vs. individual learning. Several authors discuss learning impact; however, not evidence-based analysis from digital footprints (e.g., [68] describes organizing team-based exercise, where teams were directly monitored and evaluation of skills improvement is observational). Typical evaluation methods are score-boards, verbal feedback and after-action reports highlighting conclusions from manual analysis of exercise data [34] that do not apply an analytical approach using digital datasets. Some papers do describe data collection and perform initial LA on digital dataset, e.g., [14] breaking time into intervals that can be meaningful for different learning objectives (e.g., incident responding, team communication) with walkthrough. Ošlejšek et al. [33] show that visual analytics tools could provide automated statistical analysis and an in-depth insight into the learner's behaviour using observation software (Fowler's analysis) [32]. Vykopal et al. [34] describe an interactive timeline visualisation allowing learners to explore a scoring timeline and details about individual events.

Simulation.
Simulations are a very common type of exercise and take various forms, e.g., online vs. live, gamified, etc. There are some emerging examples of LA performed already. Thompson and Irvine [35] present basic LA data, such as time played and discuss abstraction layers for data analysis. No formal effectiveness assessment was performed and conclusions are based on adhoc student interaction and logs review [35]. Legato and Mazza [36] assume a set of regeneration points that correspond to skill achievement through learning. However, the model was not validated and numerical results are reported for illustrative purposes only.
Nicholson et al. [37] system uses a dynamic tailoring system, which maintains a model of student proficiency and adapts training difficulty, while providing detailed feedback. Santos et al. [38] use a post-simulation analysis using a variety of graphics and reports to verify the network traffic, which teams were attacked, which services are still vulnerable, teams activity rate and strategy used, etc. This data enables statistical evaluation of what happened during the simulation, including how teams perform compared to previous exercises [38]. Furfaro et al. [39] used cloud based learning system including a dashboard (for managing scenarios, agents and VMs, displaying system usage and statistics, etc.); a report tool (provides statistical data from logging engine, and queries for business intelligence analysis displayed on charts), and a set of development tools.
Many simulations have a gamification element. Tioh et al. [69] performed a literature review (18 papers including some kind of empirical effect measurement) and concluded "the question as to the effectiveness of serious games dealing with the training of cyber security is a difficult one to answer conclusively at this point". We identified additional papers with games for security specialists or students (e.g., [69], [40], [41]). However, none used empirical learning analysis from digital datasets.
Several authors focus on cognitive levels of learning process in articles covering simulation-type exercises, as simulations allow experimentation. Some examples include a simulation-based approach to understanding cybersecurity threats when attempting multiple actions, the user is provided with an "awareness" measure [70]; a socio-technical systems approach to support the emerging role of systems thinking and using an agent-based simulation tool to change the students' thinking [71] [72]; a computational model based on Instance-Based Learning Theory that proposes a way to analyse the cyber analyst's awareness at both threat level and attack scenario level [43]. Such human dynamic decision making analysis can help to determine various player models at individual and aggregated levels [73]. Table-top. A table-top exercise is typically a meeting to discuss a fictional cyber emergency situation increasing participants' engagement and strengthening their awareness and competences in strategic decisionmaking [74]. Although typically digital traces are typically limited in table-tops, some research on the system architecture for tracking learning process is emerging. Brilingaite et al. [45] presents a model of a web-based environment that enables playing table-top exercises in person and remotely. The environment includes the visual representation of decision-making during the game and provides the comparison to the correct solution.

3.1.4.
However, in many cases the data analysis is not presented, offering an experience without learning process or empirical impact data. Cozine et al. [75] examine the pedagogical approach to incorporate game play, specifically probability-based tabletop exercises, into course curricula and collected survey data from students enrolled in the courses. Ottis [46] presents a light weight tabletop exercise format that has been successfully used in cybersecurity education to demonstrate these and many other concepts to master level students.
3.1.5. Drill, Seminar, Discussion Based Game and Workshop. The research builds upon experience but lacks evidence-based measures or uses mainly surveys/selfassessments as a tool to analyse learning. Some research results start to emerge, such as [47] describing a workshop using clickers for running a series of questions that allow easy data collection and analysis.
3.1.6. Exercises with no clear classification. In several papers CSXs are described in generic manner as "exercise" (often in classroom environment and part of a course) or include multiple exercise types (e.g., platform allowing both Red-Blue team and individual simulation game). However, several empirical learning data analyses have been completed. Weiss et al. [15] express that simply recording the number of correct answers is inferior to in-depth assessments and explores the use of command line history and visualization. Authors follow the "path" taken by a student in command-line when completing different tasks and levels (for skills level measurement some commands were identified as significant) [15]. The "path" is visualized in graph that can be decomposed into chains and cycles [15]. Similarly, Labushange et al. [49] assesses technical skill level based on indexed similarity (i.e., participants were ranked based on commands usage to achieve objectives) and classifies actions that can be automatically deducted using the clustering of commands (e.g., combination of "ifconfig", "sudo aptget install nmap" and "sudo nmap -sT targetIP" together was classified as reconnaissance). However, the paper does not go into details of how such clustering can be achieved [49]. Caliskan et al. [50] use educational data mining, machine learning and identifies metrics for learning effectiveness (predicting final grade) in the university classroom course with efforts to validate their predictive model. Caliskan et al. also compare participant evaluation metrics and scoring systems in [76]. Moore et al. [55] focuses on the development of specific individual skill levels and state that competence in progressively harder levels of capabilities was observed over time in relation to the training components. [52] apply automated mechanism for parsing log entries into blocks of time during which participants are focused on specific highlevel objectives, with instrumentation capturing students' computer-based transactions [53]. Several authors propose evaluation metrics or data that should be collected. However, the actual learning data analysis is lacking so far. For example, [54] suggest metrics such as time, participant numbers who succeed and feedback, and [8] suggests number of detected attacks from total attacks for learning task of monitoring systems' security, etc.

Analysis of LA Process Described in CSXs
The explicit use of learning analytics and relevant vocabulary in the cybersecurity education (incl. CSXs) is in its early stages. However, recently there is more focus in the academic community on applying LA methods to improve the cybersecurity education [77]. The discussion below is organized by LA process to analyse main steps in a CSX's life-cycle. The process steps span across the life-cycle and are re-iterative, however the whole process needs to be designed in planning phase and instrumented to the cyber ranges.

Collection and Acquisition.
Existing research focuses on data collection-i.e., how to build an exercise platform, cyber range, etc. However, the literature lacks considerations for what purpose and what data is actually collected, and also contains discussion on learner consent and ethical aspects. As the tendency is for collecting simple technical measures, rather than more complex cognitive learning measures (Section 4), often there is no clear connection whether it was collected for evidencing learning, and what data is relevant for evidencing learning.

Storage.
How and what data is stored (and how long period) is mostly not covered (only few examples such as [32]). The security of information stored and privacy concerns (anonymization processes) appear not to be a high priority. Only one paper [23] was identified that describes security measures for captured data.

Cleaning.
The data cleaning process is not typically described, however network traffic (pcap) and other datasets are expected to include non-relevant data. Few papers start discussing what processes were used to clean the data, e.g., [50]. Also the principles of privacy and anonymization needs to considered here.

Integration.
As the exercises generate multiple datasets, combining multiple datasets and different formats, e.g., technical and timing data with self-reported learner data, is an important process step. Some examples were found about aligning timestamps of datasets, such as [52], [53] [14], [22]-however, no detailed methods are provided on how data is processed and time-synced.

Analysis.
Due to LA being a novel research area, the data analysis performed has been limited, with a predominance of studies undertaking low-level analysis relating to the readily accessible data such as the reporting of number/frequency login times, number of messages posted, time online, etc to academic performance as measured by grades [78]. Similarly in CSXs, we did not identify commonly used tools or methods for data analysis (statistical, machine learning model, etc.). Similarly to overall LA field, the analysis has genreally been conducted using easily obtainable metrics (such as time), see Table 2 and in typically linked to high-level learning objective(s).

Representation and Visualization.
The challenge is determining the relationship between visualizations and learning. The feedback about low level user actionssuch as number of log ins, videos watched, or documents submitted-does not illustrate progress in learning for students or educators [79]. Visualizations and dashboards usefulness and effectiveness is not widely covered in the exercises. Several papers, such as [32] analyse the use of visualizations in CSXs in addition to describing the system architecture.
3.2.7. Action. Actions, such as intervention, optimization, systematic improvements (including design) are not necessarily evidence-based. Rather learning design choices are based on the authors' experience or the learners' selfreporting/survey evaluations. Some relevant research is emerging how to improve feedback loop from using digital traces, e.g. [17].

Inferring
Learning from Digital Datasets-What to Measure?
The various frameworks in Section 2.1 have been developed for learning in CSXs and cover different aspects but none directly incorporate or utilize learning analytics processes. When inferring learning from the granular digital dataset, the challenge is linking learning objectives and competencies to the granular raw data to, as the design of a CSX should follow a top-down pattern [18]. The cyber range should be designed to allow such learning design and measurement process. Table 2 summarises the learning indicators from digital datasets that have been used in academic research, which could be used as a starting point to brainstorm when selecting the metrics to measure that the training objectives have been achieved. It should be noted, that any papers on learning in the CSXs are based on the experience and interpretation of the authors or based on the traditional learner evaluation (e.g., feedback surveys, evaluation forms).

What to Consider in Choosing Metrics?
We should focus on measuring what we value. The metrics used in CSXs often focus on easily measurable data (e.g., time spent, number of attacks mitigated, etc.) and individual actions. However, the students are "too easily satisfied that a system is secure after identifying only one possible source of security for a system rather than seeking to explore the adversarial space more thoroughly" [83]. Thus it is important to understand not only whether the students found the correct answer but how they found it [15]. There is some research that starts to look into "how" the learner completes tasks (i.e., use of tools, attempts, submission of wrong answers), such as [52], [53], [49]. However, validation is limited (e.g., 4 participants [49]). In regards to teamwork and communication, there is some research, such as [81], [36] that have started to explore the use of analytics as evidence for achieving learning in teams.
Also as learning is complex cognitive process, the further research should focus on cognitive metrics, such as Knox et al. [82]. From the LA research, a similar measure to "cognitive presence" can be applied in cybersecurity training (e.g., "Active Learning Squared (AL2)" paradigm, which emphasises metacognition and uses both active student learning and machine learning [84], [85]).
Metrics are valuable, however, "being able to report upon a metric does not mean that you should use it, either in the tool, or in reporting its worth [3]". The metrics will depend on the exercise goals that in turn are guided by different pedagogical principles (e.g., behaviorist, cognitivist or constructivist) [12] and the wider evaluation model chosen [10]. Therefore, we need to be mindful of learner and learning process, and measurement should move towards mapping of digital traces describing student activity onto interpretable constructs of interest (e.g., Knowledge Components, Q-matrix), which facilitate actionable analytics [86].

Challenges in Implementing LA approaches in CSXs
Scientifically-valid evidence that learning outcomes were achieved in CSXs is difficult to obtain, especially as the exercise design, objectives, technology and learner characteristics vary. These factors make inter-institutional and between exercises comparisons difficult. However, sharing the measurement results would enhance measuring that the learning was achieved and new skills obtained.
The data analysis until now has been limited, with a predominance of studies undertaking low-level analysis using easily obtainable metrics, such as login times, time to complete tasks, number of attack mitigated, see Table 2. The related work did not reveal commonly used data analysis tools or methods (statistical, machine learning, etc.) in CSXs, but developing and sharing methods used would enhance validity of the results.
Security challenges, such as intrusion detection, insider threats, malware detection, and phishing detection lack exact algorithmic solutions and the boundary between normal and anomalous behaviour isn't clear-cut as attackers are continuously improving their techniques and strategies [87]. This also impacts LA, as it needs to keep up with moving algorithms and learning patterns. In addition, the challenge relates to data volume-one large exercise can create terabytes of data including multiple Awareness measure/critical thinking/decision making [70], [43], [72], [73] Understanding threats and systems, making desicions No Discussion and experience Cognitive Agility Index [82] Individual cognitive performance Yes Regression analyses, science based, validation of 31 participants and big datasets. The large amount of data generated by automatic logs and sensors necessitates efficient and automated data and LA techniques. There may not be enough traces to identify learning patterns (e.g., short time of detection, gaps in time-line) and data may be very diverse (e.g., different OS, applications). Therefore, identification of the relevant learning traces requires techniques that can deal with such imbalance and diversity. To combine multiple datasets and formats, e.g., technical and timing data with self-reported learner data no detailed descriptions or methods are provided how data is processed and time-synced. However, some examples were found about aligning timestamps of datasets, such as [53], [14], [22]. Also, the CSXs and related studies mostly work over a short period but it is known that short-term interventions are not particularly effective at affecting behavioral change [88]. Thus longitudinal studies are needed to evidence learning and behavior change as result of the exercises, and also to separate from other learning.

Conclusion
The opportunity to improve the learning in CSXs as part of educational effort is missed without considering the learners experience, different learning styles and pace, and the impact of the learning environment. The application of learning analytics and analysing digital datasets can provide a deeper understanding of learning behaviour and lead to evidence-based improvement. The consideration of LA aspects is also vital for the cyber range developers, as they design the technological foundation of instrumenting exercises that enable the application effective LA methods.
We proposed a LA reference model to assist in implementing LA into the CSXs life-cycle to achieve a more adaptive design and measurement using evidencebased data from the learning environment. As a practical starting point, we shared extensive related work overview of existing research describing some aspects of learning analytics process and the analysis of empirical evidence from the digital datasets to assist in implementing the model across all exercise types. We described the learning indicators (metrics) used for evidencing learning in CSXs, with focus on analytical evidence from digital dataset. Such metrics are mainly simple technical measures (time, number of attacks mitigated, availability of service, etc.) that are not necessarily validated and may not evidence effective learning (i.e., metacognition achieved). With LA and evidence-based measurement, we also need keep in mind and validate that what we measure (i.e., metrics used) actually help learners to learn. In turn, the validated metrics have the potential to provide more detailed and evidence-based input that form an integral part of the comprehensive training evaluations.
Further work should seek to identify and validate what learning metrics are evidencing the learning process and learning improvement in CSXs. Understanding the current use of learning analytics in CSXs is expected to help setting the baseline for further research and practical implementation by combining two evolving disciplines. By doing this, the cybersecurity community can establish more evidence-based and systematic approach for the evaluation of learning impact that will enable the design of more effective learning experiences.