Measuring trust in automated driving using a multi-level approach to human factors*

As the driving is shifting towards automation, the maximization of related benefits would profit from improved user acceptance of the new technology. Studies suggest a strong connection between acceptance and trust in technical solutions. We investigate the improvement of user trust to driving automation through demonstrations carried out in a sophisticated driving simulator. The study correlates subjective data with objective psychophysiological measurements. The multi-factorial and multivariate analysis of variance investigates the influence of learning effects and pre-experience with advanced driver assistance systems on trust. Results show improvement in trust through user interaction with a human-machine interface of the demonstrated AD system, hence illustrating the relevance of human-centered development processes. The conclusion is supported by the observation of driver cardiac signals.


I. INTRODUCTION
Autonomous Driving (AD) is edging towards an inevitable reality. The technological progress combined with market expectations are requesting high research needs in terms of human-machine interaction [1]. The key contributor to future road safety is the promise of a reduced number of road accidents and their consequences. The majority of those are caused by humans [2], [3], whose reaction times, distraction levels, drowsiness, or influence of (legal/illegal) performance altering substances affects their driving performance. These deficiencies could be prevented by automated vehicle controls. Three major drivers of the shift towards automated vehicle controls are legislation [4], safety, and comfort [5]. The complementary AD features contribute to improved traffic flow, and increased road capacity, and include applications such *The work presented in this document has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreements (Omitted for blinded review). No 723324 (TrustVehicle) and No 871385 (TEACHING) as intelligent route-planning, or platooning [6]. However, the positive benefits are matched by a possible lack of acceptance of the new technology [7], which is related to cyber-security [8] and privacy [9]. The reduced acceptance level is also correlated to the lack of trust in technical solutions [10], and safety [11]. Demonstrations with sophisticated simulators could improve trust levels by revealing the real benefits of AD to the human user. Such virtual environment studies yield insights into user behaviour in safety-critical scenarios without compromising user safety. The objective data on soft factors (trust, comfort, and drowsiness) are gathered in a series of scenarios. Besides, standardisation demands a humancentric approach to manage handover and takeover between the vehicle and the humans for SAE Level 2 and 3 automation [12]. The same complies with the General Safety Regulation (GSR) [13]. A level of driver interaction, including exchange about the state of the vehicle and monitoring of the driver state, e.g., the Driver Drowsiness and Attention Warning (DDAW), is essential when the driver must take over the control of the vehicle, for the environment monitoring task (SAE L2), or the availability as fallback level (SAE L3). Vehicles must also fulfil safety requirements, which are vital for the end-users. Additional demands come from Euro NCAP in terms of Safety Assist test procedure [14], which centres around occupant monitoring. They focus on the minimisation of human errors through occupant monitoring and warning, as some studies suggest that 90% of the accident are caused by human mistakes (purposeful or erroneous) [14]. Comfort, which is difficult to assess objectively, is another crucial quality for the users [10]. Comfort distinguishes different service providers and can also contribute to the perceived safety, which is either a selling factor or a blocker. Driving simulator studies support the full development cycle from the concept through the implementation, to the validation, and verification phase. Their unique value is safe testing of safety-critical situations (e.g., drivers falling asleep). Their major benefit is repeatable testing with exactly the same variables and covariates for a given scenario at the concept, design, validation, and verification phase. This eases monitoring and development. The high reliability enables assessment of the forthcoming events in the development and adoption of goals in agile structures. In on-road testing, it is impossible to reach the same real driving repeatability due to the dynamic environment and road conditions. That leads to a high internal validity of the testing method. Simulator benefits from cost efficiency, internal, and external validity, precise control of co-variate, independent repeatability, high situational specificity, exposure to critical situations, safe environment, etc. This work exploits the offered benefits to examine trust and end-user acceptance of AD. If defining trust as a belief in the dependability of the AD system, its strong link to acceptance and use of automated technology [15] is crucial for the impact of AD. This paper investigates the improvement of trust through pre-existing and ongoing human-machine interaction with an AD system.

II. METHOD
This section describes the setting and assets employed in the study. The process executes a tailored set of simulator scenarios and investigates the prospect of growing trust in AD.

A. Sample
The experiment was completed by 55 of 60 registered participants. Upon removing 4 incomplete data sets and 2 sets with outliers including inconsistent answers between control questions, the number of complete data sets and valid results for the analysis is 49. The participants are grouped to evaluate demographic differences and group interactions with other factors. The groups are: gender (28 male, 21 female, 0 neutral), age (21 aged 18-29 years, 20 aged 30-45 years, and 8 aged 45-65 years), yearly driving range (9 below 5000 km/y, 12 between 5001 and 10000 km/y, and 29 over 10001 km/y), driving assistant experience (23 without, and 26 with), and educational level (37 higher than Bachelor's or equivalent degree and 12 below according to ISCED) [16].

B. Scenarios
The scenario selection follows the method of assessment concepts for TrustVehicles [17]. Ten scenarios with one or more sequences are chosen with changes in the driving mode between them. Similar sequences are clustered for the evaluation. Two controller modes are applied: a comfortable mode with moderate deceleration and larger gaps, and a sporty mode with higher deceleration and smaller gaps. The scenarios contain takeover and function decrease due to dirt (scenario 1 & 2), follow with constant speed and function decrease due to weather change (scenario 3 & 4), emergency brake (scenario 5, 6, & 7), and lane change situations and weather change (scenario 8, 9, & 10). All scenarios are in fully AD mode, so each subject experiences the same situation. This is a requirement for a high internal validity and precise control of covariates. A representative scenario (emergency brake) is described below.
The emergency brake scenario is applied in a narrow oneway street with vehicles parked on both sides of the moving ego-vehicle. Upon approaching this location the ego-vehicle slows down from 30 km/h to an appropriate velocity for the road type (15 km/h). The deceleration is mandated by the limited perception boundaries due to the reduced view of sight of the vehicle sensors in such an environment. The egovehicle continues at a constant velocity until it detects a child attempting to cross the road at a location with no labelled road crossings, between the parked vehicles from the right side (see figure 1). The sudden appearance prompts a swift reaction of the vehicle controls. The breaks are activated. The key parameter in this scenario is the timing and the intensity of the breaking. The scenario cluster is based on the following order: S5 Early braking (comfortable controller configuration) S6 Late braking (sporty controller configuration) S7 Early braking again (identical to the first scenario) Scenario five is used to monitor driver reactions to a comfortable braking response to the child crossing the road. The resulting measurements are treated as the baseline measurement. Scenario six intends to measure the driver's reaction to a sporty controller response, thus creating a more critical situation. The intention is to convey to the subject that even in the event of the late braking, the situation is safe. Scenario seven is compared to the previous two scenarios to observe the impact of the learning effect on the trust in the technology.

C. Study materials
The study engages a multilevel approach to subjective and objective data, as well as psychophysiological parameters to examine the influence of the AD system on trust. The created "Sequence-Specific Questionnaire on Trust" (SSQT) results from the experience of driving behaviour testing and development [18]. Its three key questions focus on comfort (Q1), trust (Q2) and overall perception (Q3). Additionally to these three, sequence-specific questions are raised. The rating scale ranges from 1 (very bad) to 10 (very good) or in some sequence-specific questions from 1 (bad, e.g., too early) through 10 (good, e.g., just right) to 20 (bad, e.g., too late). The three standard questions for all sequences are: Q1 How comfortable did you feel? Q2 How much confidence did you have in the vehicle? Q3 How was the overall perception? The three standard questions are complemented by the sequence-related questions, which assess the behaviour of the AD controller and its perceived performance. For scenarios 5 to 7, these questions are "How do you evaluate the reaction time of the vehicle until the first reaction in front of the pedestrian?" "How do you evaluate the distance from the stopped vehicle to the crossing pedestrian?" and "How do you evaluate the time until the vehicle starts again after the pedestrian has left the roadway?". The SSQT is accompanied by a short video of the scene to aid participants' memory. The answers are entered on a slider scale (color, text and smiley coded) shown in figure 2. Trust is related to question number two, and supported by the other questions, especially one (comfort), and question three (overall perception). These questions indicate participants' experience of the situation. The heart rate is measured by POLAR H10 chest belt [19] and is communicated via ANT+ and Bluetooth to the cosimulation framework. The demographic data and subjective experience are quantified using questionnaires prior, during and on completion of the test drive. Additional assessment materials are used but are not evaluated for this paper. NASA TLX [20] measures subjective assessment of the participants' task load during the testing. The EXPRESSION OF TRUST questionnaire is a variation of [21], which is built on the original TRUST IN AUTOMATION scale [22]. It enquires the trust in AD functions using a seven-point Likert scale from do not agree at all to agree completely. Questions are aimed at answering if the trust in AD can be raised through a safe experience. The usability of the system under test is assessed using usability scale [23] with a 10 question questionnaire slightly adapted to the needs of the study by following the actual advice [24], also during the post-test phase. The eye tracking is implemented using a Pupil Labs Core device [25] to monitor the gaze direction, pupil size and blink rate. The connection to the co-simulation framework uses a USB interface to handle the video data volume.

D. Environment and procedure
The controlled laboratory environment study engages 60 volunteers of diverse ages and nearly equal gender division recruited over official websites and inter-company information. These are licensed drivers with no driving simulation experience. They follow an identical set of procedures and scenarios using the same set of hardware and software tools. Test sessions start with a familiarization phase of 10 minutes, which is followed by a set of AD scenarios including driving in rural, city, country, narrow parking, construction site and other environments.

E. Equipment and Techniques
The technical setting consists of the simulator itself, the background hardware (e.g., computers, sensors) and the software combining the hardware into the test environment. The simulator is a moving hexapod platform with a real vehicle cut out containing the driver seat and the complete driver environment (the cockpit). The virtual environment is visualised by projection onto a 180°canvas by 3 video projectors with a high frame rate and resolution (100Hz, 4K) for realistic simulation and reduction of motion sickness. Additional rearview mirror screens increase the field of view and improve realism. The motion system, characterised by six degrees of freedom, offers many manoeuvring possibilities, as required for simulation of the following motion types: three dimension translational and pivoting around each axis. The simulator acceleration is limited to 6m/s² with a manoeuvring possibility of approximately 1m in all three dimensions (x/y/z) and a rotational capability. In terms of the reduced motion sickness experienced by the participants (section II-A), the improvement is traced back to an optimized motion cueing and adopted platform motion to real vehicle test (external validity). The platform is limited to perform real road behaviour by its technical boundaries. Biometric data is recorded with a range of sensors during tests within the cockpit: chest belt for the heart rate (HR) and the heart rate variability (HRV), Time of Flight (ToF) cameras, eye tracker and eye gaze monitor, head and body movement monitors partly visualised in figure 3. The sensor combination differs subject to the study, topic, and availability of the evolving sensors. The driver monitoring sensors enable timely and appropriate driver alerting actions about an impeding driver takeover of vehicle controls, as specified for the L3 automated vehicles. Additional subjective data is gathered through pre-simulation, interim and postsimulation questionnaires.
A co-simulation framework with integrated AD functions integrates software and hardware to facilitate the usability of all components. The framework also enables the simulation of scenarios that are unlikely to be tested in real driving situations due to safety concerns. Two main technical challenges associated with the setup of the driving simulator are posed by the data time synchronisation and synchronisation of simulator movements. The data synchronisation is tackled by component interconnection using the AVL Model.CONNECT™ [26], a co-simulation platform that enables data acquisition, and monitoring from a single communication centre. The advantage of a centralised data collection point is to mitigate and avoid the realisation of unforeseen simulation interruption or data loss through a complete run-time overview of the system. Data with divergent sampling rates are partially handled by firstand second-order interpolation and extrapolation techniques in combination with the Nearly Energy Preserving Coupling Element (NEPCE) method [27], which minimises the coupling errors and increases numerical stability.
The motion platform is controlled by AVL VSM™ [28], which simulates the vehicle behaviour and translates expected accelerations into precise simulator motions. This is essential for a realistic feeling and reduction of motion sickness. VIRES VTD™ [29] supports road visualisation by offering environment setup including road networks, traffic infrastructure, and other road users. It also aids the data distribution from simulated perception sensors, as needed for influencing vehicle control outputs. The structured data distribution and synchronisation of the sensor outputs are crucial for the optimisation of the simulator and control strategy. The simulator also offers freedom in terms of hardware integration. The environment allows usage of bulky hardware that can not be installed in vehicle testing due to limitations in terms of physical space or technical integration, or safety.
While other experiments [30]-[32] are based on the same methodology, the unique character of this study is based on a specific combination of the physiological measurement data with the subjective purpose-designed questionnaire within a study that targets user trust in AD. Figure 4 depicts the process a single subject is exposed to. Through upfront information, a subject arriving at the test site is already familiar with the testing procedure and data to be collected. They are going through six major test phases.

F. The Experimental Process
A subject arriving at the test site is familiarised with the information and the agreement sheets at their own pace at the start of phase one. They are offered a verbal question and answer session, which also includes a mandatory introduction of the relevant General Data Protection Regulation (GDPR) [33] aspects. It is crucial at this stage that the subject can stop the testing at any time and withdraw from the study. A pre-testing questionnaire, the EXPRESSION OF TRUST, is provided to the subject. This questionnaire is answered twice, once in phase one and once in phase six.
In phase two, the subject is equipped with the relevant sensors. The subject is introduced to the cockpit environment, i.e., steering, braking, human-machine interfaces (e.g., indicators, dashboards), and the automated subjective feedback system in the centre console. Example questionnaires are hosted to familiarise the subject with the process and reduce the influence of an unknown system in the later answers.
The third phase (testing) lets the subject experiment with the behaviour of the moving platform for about 10 minutes. They are also confronted with several test questions. All sensors and questions are processed to confirm proper behaviour. The starting sub-phase does not include platform activation to allow gradual familiarisation of the participants. The second sub-phase triggers the motion platform.
In phase four, the actual test phase, the subject drives automatically through several scenarios. Every scenario completion causes freezes of the environment to raise associated SSQTs. Once completed, the subsequent scenario follows. The process is repeated for all scheduled scenarios and takes around one hour. During phases 3 and 4, all sensor data is recorded.
In phase five, the subject leaves the cockpit and gets all the sensors and equipment removed. They are engaged in conversation, in which their mental and physical health is checked.
The participant answers further questions in the last phase. The procedure keeps the participants occupied and under observation for at least 30 minutes after the test. This prevents post-test motion sickness outside the controlled test environment. The questionnaires at this phase are the NASA TLX test [20], the EXPRESSION OF TRUST (again), the SYSTEM USABILITY SCALE [23], and a demographic questionnaire to relate their results to demographic groups. Finally, optional free form feedback enables reporting likes or dislikes of the complete participation.

III. RESULTS
The repeated measurement data is subjected to multifactorial and multivariate analysis of variance. The calculations are aimed at investigating differences between the scenarios, the driving mode, the driving experience and the AD experience. A fraction of the results is presented in this section, aiming to establish the link and likely trust increase through user interaction with an AD system.

A. Questionnaires
The results are focused on example scenarios 5, 6 and 7 (described in section II-B). For each SSQT question, we calculate a single analysis of variance with repeated measurements on three points of time. Significant differences emerge over time (p t ) within each question (see table I  p t =0.49, η 2 =0.061) for questions 1, 2 and 3, respectively. The outcome is depicted in figure 5. Figure 5 shows the changes over time regarding the mean values of comfort (Q1), trust (Q2) and overall perception (Q3) in the example scenarios (5, 6 and 7). While comfort decreases from scenario 5 (comfort driving mode) to scenario 6 (sports driving mode), an increase can be shown from scenario 6 to scenario 7 (comfort driving mode), while values in scenario 7 outrun the other scenarios. Results for trust and overall perception show similar development and characteristics between the different scenarios. Results regarding experience of AD functionality depending on the three SSQT questions are shown in table II and in the figures 6, 7 and 8. For each scenario-specific question, we calculated a single analysis of variance with repeated measurements on three points of time including the factor preexperience on ADAS.
As shown in table II, results reveal a trend towards a significant interaction (F=2.171, p int =0.119, η 2 =0.044) from the group and time on question 3. Significant differences can be seen in the main effect time in question 1 (F=4.521, p t =0.013, η 2 =0.088) and 2 (F=3.102, p t =0.049, η 2 =0.062) and moreover a significant difference regarding the main effect   Figure 6 (Q1), figure 7 (Q2) and figure 8 (Q3) provide a similar picture regarding the progress of comfort, trust and overall perception over the three scenarios. Participants who already had an ADAS experience show lower ratings on the functionality at the first two emergency brake scenarios (S5 and S6) and higher ratings in the third scenario (S7) for all three questions. They also show a lower rating on comfort (Q1) and overall perception (Q3) in the late braking scenario (S6) than the participants without ADAS pre-experience.

B. Heart rate
Physiological parameters (i.e., the heart rate) were baseline corrected and normalized using an overall baseline regarding participants' interindividual physical conditions. The correction was done for each subject per scenario to avoid cross interaction and effects of diverse participants' individual phys- In formula 1 bpm acti reffers to the heart rate of the single subject at the measurement point in time, the bpm meani is the mean bpm of the single subject for the actual scenario with its standard deviation bpm sdi . The result is the corrected and standardized heart rate of the single subject in the actual point in time bpm corri .
The intervals of interest were defined for the scenarios around the excitement actions with a predefined length. In scenario 5, this is the moment when the child appears between the parked vehicles visually for the vehicle sensors. Also, a build-up interval and a relaxation interval were defined to the same extent.  Table III shows means, sd and p-values for the analysis of variance regarding the repeated measurements within scenario Fig. 9. Baseline corrected and normalized heart rate values depending on the scenarios 5, 6, and 7 and the build-up, excitement, and relaxation intervals 5, 6, and 7. A highly significant difference is revealed within the points of measurement in scenario 5 (F=17.441, p t =0.000, η 2 =0.401) and significant differences in scenario 6 (F=3.213, p t =0.048, η 2 =0.110) and 7 (F=3.334, p t =0.044, η 2 =0.118). Figure 9 shows heart rate reactivity in the build-up interval being similar to the overall baseline, while the strongest reactions are in the excitement phase. In the relaxation interval, reactions in the heart rate are higher than in the excitement phase but still have not reached the level of the buildup interval, pictured as a black line for each scenario. Moreover, fewer heart rate reactions can be seen in scenario 6 when compared to scenario 5, and in scenario 7 in relation to scenario 6. The scenario 7 heart rate in the relaxation interval has almost reached the level of the build-up phase.

IV. DISCUSSION
The evaluated results demonstrate that trust evolves with participants' experience with the system in the present scenarios. Differences are seen in various levels of human reactions and behaviour. Even if starting from a relatively high trust level, increasing trust in AD is evident in all sequential experiments. This uncovers ongoing learning effects for the participants while they interact with an AD system. Results on this progressive gain of trust for the given scenarios are shown in figures 5 and 9.
Moreover, trust tends to differ regarding participants' experience with AD (p int =0.119). While participants who had no experience with AD tend to show constant overall perception, participants with pre-existing experience with ADAS show a decrease from scenario 5 to 6 and an increase from scenario 6 to 7, exceeding the rating of participants who had no experience with ADAS. Regarding those two groups, we notice the possible impact of pre-existing learning effects as well as learning effects that appear during the current human-machine interaction with AD systems.
Learning effects during the current human-machine interaction as well as pre-existing effects may play an important role in the development of trust in AD systems. In contradiction to other studies [15], which have not linked previous ADAS experience to trust, we report that ADAS pre-experience might initially reduce trust in the system, but then it gains support for trust faster than without ADAS pre-experience. The reasons for the initial drop in trust remain unclear. The cause could be hidden in the participants' negative experiences. Another hypothesis is that humans need sufficient experience and knowledge about the system functionality to gain trust. Both perspectives encourage the early learning effects of future drivers while eliminating negative experiences through humancentered development of AD at the same time.
Cardiovascular reactions differ during the intervals (buildup, excitement, relaxation) and between scenarios. While we identify strong heart-rate reactions within the excitement interval, heart rate in the relaxation interval approximates to the baseline. The cardiac deceleration within the excitement interval in comparison to the build-up can be explained with participants' orienting response within the situation, as the organism reacts to foster information input. That is consistent with literature [34] [35], which sees cardiac deceleration as a part of the freeze reaction in the orienting response, which helps humans to prepare for an adequate reaction by enhanced information acquisition, stimulated by the Amygdala. The cardiac deceleration might also refer to the behavioural inhibition system (BIS), which regards an inhibition of behaviour by the hippocampus as the central structure to avoid immediate actions if the system cannot predict the result of the planned action [36]. It then inhibits the behaviour and evaluates the situation as a risk assessment by information processes. Observed cardiac acceleration in the relaxation interval refers to a regeneration of the cardiovascular system as the situation was, from the drivers' perspective, successfully coped by AD.
Differences between scenarios show less reactivity in the cardiovascular system within the excitement and relaxation intervals from scenario 5 to 7 in comparison to the build-up. This may be explained by current learning effects regarding the functionality and reliability of the system within the situation and therefore result in a gain in trust, rather than with a habituation process, as the presentation of three stimuli seems insufficient to trigger habituation processes. Results on cardiovascular reactivity, therefore, support the findings from the analysis on subjective ratings and supplement the evaluation with psychophysiological concepts.
Results gained through simulator studies lack the real driving experience. The lack of possibility to experience unsafe or harmful events and from the safety perspective, and the missing risk of injury might influence the intensity of participants' reactions in the subjective ratings as well as in physical states while the direction of the results still corresponds. Therefore simulator studies have an advantage in early-stage development, where the direction of the results is prioritised over their intensity. The differences in the intensity between the simulator and field results depend on how well the simulator replicates the movements of a real vehicle [37]. Pre-testing evaluations of the simulator can contribute to their further development to increase external validity.
It seems to be crucial to allow the participants to slip into the virtual reality and to avoid interactions from outside of this virtual world. To increase the validity of simulator studies, participants must get used to the simulator and its behaviour in the pretesting phase. Surprisingly, many participants reported in the free-text feedback that they felt more comfortable in the pretesting phase with activated platform movement than in the static phase. Data from this pretesting phase is excluded from the analysis. Another way to mentally keep participants in the simulation is to enable answering subjective questions within the cockpit (central dashboard) shortly after the situations and slip in and out of new scenarios as smooth as possible. That calls for the future studies to continue the simulation while participants are answering the subjective questions. Once they are ready to continue with the testing, the simulation software should make a smooth transition back into the next scenario. Psychophysiological methods contribute to this issue, as data acquisition happens right in the situation without interrupting the simulation or changing participants' attention in the situation.
Considering figure 5, trust into the system remains stable, while the comfort and overall perception decrease slightly as expected by confronting the participants with a more critical situation (later and more harsh braking in front of the child), whereas in the following scenario (7) with repeated early braking situation, trust, comfort and overall perception increase again and even surpass the first experience. Trust raises significantly (p t =0.049) and comfort (p t =0.012) and overall perception (p t =0.048) follow this trend. A fourth scenario in the "sporty" configuration in this series would reveal whether the third was just rated better because of an improvement or learning. In future studies the expression of trust questionnaire might be extended to a multi-dimensional questionnaire considering more aspects of trust.
In summary, our finding is that users who gain information and experience with the system, its capabilities, and events (e.g., later braking), provoke reactions with less intensity, as they rely on their learning about the system. However, when experiencing a critical event (i.e., later braking) for the first time, with no knowledge about the system, it is reasonable to expect an intense driver response. Once the driver trust evolves through learning, we expect a less intense response to AD behaviour. This is confirmed, as we observe higher acceptance of critical situations due to the trust increase. These findings bring us a step closer towards answering our research question, which aims to establish a relationship between improvement of trust and ongoing user interaction with an AD system.

V. CONCLUSION
We explore the improvement of trust through ongoing user interaction with an AD system. The study merges responses to sequence-specific questions and the objective physiological measurements to avoid bias towards one data source.
The investigated driving scenarios reveal a significant influence of driving modes (e.g., sporty) on the trust in AD. It may also be concluded that trust evolves with participants' experience with the system. That happens even with a relatively high starting trust, hence indicating ongoing learning effects for those who interact with an AD system. The insights support the need for early engagement with AD systems and human-centered development of AD.
Contradicting to other studies, we discover that pre-existing experience of driving automation might initially reduce trust in the system. However, that gain of trust then rises faster than for drivers without pre-existing experience with the technology.
Despite the effect of the disparity between the simulated and real-world on the drivers, simulator studies seem to have an advantage in early-stage development, where the direction of the results is more important than their intensity.
The data quality suggests that a small study group engaged in a customized study of a high internal validity can achieve high-quality results, which is promising for future GSR testing in DDAW and the next step of GSR. The future work is likely to be based on subjecting biometric data to machine learning methods to perform an objective assessment of user experience with subjective ratings.