Eating, Smelling, and Seeing: Investigating Multisensory Integration and (In)congruent Stimuli while Eating in VR

Integrating taste in AR/VR applications has various promising use cases — from social eating to the treatment of disorders. Despite many successful AR/VR applications that alter the taste of beverages and food, the relationship between olfaction, gustation, and vision during the process of multisensory integration (MSI) has not been fully explored yet. Thus, we present the results of a study in which participants were confronted with congruent and incongruent visual and olfactory stimuli while eating a tasteless food product in VR. We were interested (1) if participants integrate bi-modal congruent stimuli and (2) if vision guides MSI during congruent/incongruent conditions. Our results contain three main findings: First, and surprisingly, participants were not always able to detect congruent visual-olfactory stimuli when eating a portion of tasteless food. Second, when confronted with tri-modal incongruent cues, a majority of participants did not rely on any of the presented cues when forced to identify what they eat; this includes vision which has previously been shown to dominate MSI. Third, although research has shown that basic taste qualities like sweetness, saltiness, or sourness can be influenced by congruent cues, doing so with more complex flavors (e.g., zucchini or carrot) proved to be harder to achieve. We discuss our results in the context of multimodal integration, and within the domain of multisensory AR/VR. Our results are a necessary building block for future human-food interaction in XR that relies on smell, taste, and vision and are foundational for applied applications such as affective AR/VR.


INTRODUCTION
Multisensory integration (MSI) is the process that combines the information delivered by the sensory systems into a single percept. This influences our behavior and experiences [53]. In general, MSI is more straightforward when the sensory systems deliver stimuli that match with respect to their identity or meaning. This is called semantic congruency [50].
Relying on MSI, it has been shown that augmented reality (AR) and virtual reality (VR) can be used to manipulate the perceived taste of food and beverages by displaying congruent olfactory and visual stimuli (c.f. Sect. 2). Including such olfactory but also additional gustatory stimuli in AR/VR and non-immersive applications has shown potential in, for example, treatment of obesity and eating disorders [37], psychiatric conditions [44], in consumer behavior research [62], for the sense of presence in VR [21,64], in learning environments [23], when sharing emotions via smell and taste [41], or when enhancing affective qualities of applications [40].
Despite these benefits and the eagerness of prior research to investi- gate if perception can be manipulated altogether, it is not sufficiently explored how olfaction, vision, and gustation interact and influence MSI. For example, it has been shown that the perception of sweetness (e.g., Narumi et al. [33]) can be altered by additional congruent cues. However, it is unclear how vision, olfaction, and gustation interplay and influence MSI when trying to change perception beyond the basic tastes of salty, sweet, bitter, sour, and umami. Further, while it has been shown that vision dominates when participants are confronted with competing visual and olfactory cues [29,57], it is unclear how a third stimulus -in our case, a tasteless food product -impacts MSI. Thus, our objective is to further expand the understanding of MSI in multisensory AR/VR applications by investigating the following research questions: RQ1: Do participants integrate congruent visual and olfactory stimuli into a single percept while eating a tasteless food? RQ2: Are participants guided by their vision when forced to identify what they consume during visual-olfactory-gustatory incongruency?
To do this, we report on two pre-studies that we performed to find a tasteless and odorless grocery and suitable odor samples. Based on these results, we report on our main study and its three experiments where participants experienced and rated pictures, odors, and a multisensory VR environment. Our core contributions can be summarized as follows: • We present food and smell samples that can easily be reproduced and do not rely on expensive equipment.
• We present the design of our prototype "Smell-O-Spoon", a device that can be used to alter smell in VR when eating a mash or soups. • We report on the interplay of vision, olfaction, and gustation in VR and how (in)congruency influences perception.
By that, we add to the fundamental understanding of MSI -specifically, how humans react when two or more senses do not agree. We also enlarge the body of literature on whether complex flavor objects can be produced or influenced by artificial and virtual stimuli and try to reproduce and verify prior results. The remainder of this paper is structured as follows: In Sect. 2, we present related work on human-food interaction in AR/VR and MSI. Sect. 3 presents the pre-studies that derive a close-to-tasteless food product and the appropriate smell samples. Next, we briefly introduce the Smell-O-Spoon in Sect. 4. The design of the main study is outlined in Sect. 5. Following, Sect. 6 and Sect. 7 present results and discusses them. We finish in Sect. 8 with a conclusion and outlook.

BACKGROUND & RELATED WORK 2.1 Multisensory Integration (MSI)
The fusion of information delivered by various senses in a spatial and temporal relationship is called MSI and is often researched by delivering cross-modal stimuli [52]. MSI is highly important as multisensory perception has been shown to be stronger than uni-sensory perception [10] due to cross-modal summation. In addition to that, congruent stimuli also improve speed and accuracy during perception [63]. In general, MSI is influenced by the strength, spatial location, and timing of the stimuli [7].
Considering gustation, olfaction, and vision, not only the quality of the stimuli is important but also aspects like background color, background noise, color of the plateware/glassware, or scene lighting [51]. It has been shown that the flavor of an object can be changed inside and outside of VR by presenting cross-modal correspondent stimuli. For example, Sinding et al. [46] successfully changed the perceived saltiness of a food item by adding a salty-congruent odor. Similarly, Frank and Byram [13] showed that sweet-congruent odors can increase perceived sweetness. Narumi et al. [34] added chocolate and tea flavors (both sweet) to a cookie via various combinations of always congruent visual and olfactory stimuli and changed the perceived flavor in up to 80% of the cases. While these and other applications have investigated the interplay of congruent stimuli, there is still a need to fully explore the delicate mechanisms during MSI as well as for the reproduction of results, especially in more complex cases beyond the modification of isolated flavors as in the mentioned examples who change basic tastes of saltiness or sweetness.
Some authors argue that there exists only one perceptual system and integration information from different perceptual systems is therefore not necessary [55,56]. In this understanding, the perceived information is already structured but the human cannot interpret them. We discuss our results in context of this theoretical framework in Sect. 7.4.

Food, gustation, olfaction, and vision (in VR)
The domain of human-food-interaction has a long history and covers devices and approaches that facilitate and use smell, taste, electrostimulation on the tongue, haptics, and more when working with food and user interfaces. We briefly present relevant work in the domain of olfaction, gustation, and multimodal human-food interaction and put it in relation to MSI and our work.

Olfaction
So far, research has shown that smell can be used for increasing presence [38], attention management [11], navigation [4], and also to control the feeling of satiation [24]. Thus, various methods for smell displays have been presented (next to bulky and expensive commercial olfactometers): Brooks et al. [4] presented a device for digital experiences that are clipped onto the nose and can simulate smell via electro-stimulation. Dozio et al. [11] used a diffuser placed on top of a monitor to disperse the smell towards the user (essential oils) to successfully guide attention. Smell-O-Vision [30] (no evaluation) uses a similar approach and couples it with machine learning to emit essential oils that match the video that participants are watching. While these devices are either desktop-or body-mounted, Niedenthal et al. [35] present a hand-held device that can be used to disperse various smells according to the VR scene that participants experience. By that, they enhance presence. Devices that simulate smell in VR are also already commercially available (e.g., Feelreal [19] or OVR Ion [58]) and have been used in research. However, they are often expensive, bulky, and limited to the provided odor samples.

Gustation
Similar to olfactometers, gustometers are used in research, development, and clinical applications. However, these devices are again bulky, expensive, and hard to integrate with various VR and AR applications. Thus, researchers investigated alternatives to create smell samples and to integrate them into user interfaces (c.f. Vi et al. [59]).
Creating taste samples traditionally follows either a chemical or a digital approach. The chemical approach uses chemical substitutes with particular smells or tastes that participants lick or consume. For example, ideal substitutes for sweet (glucose), sour (citric acid), bitter (caffeine or quinine), salty (sodium chloride), and umami (monosodium glutamate), have been identified [59]. This approach was applied by, for example, Mainez-Aminzade [27] who injected jellybeans with the above-mentioned ingredients and served them to participants and successfully improved memory retention. The digital approach [59] creates different sensations of taste through electrical and thermal stimulation. For example, Karunanayaka et al. [22] showed that temperature increases sensitivity to certain sensations such as sweetness. Similarly, Ranasinghe et al. [40] present a well-perceived device for electrostimulation. A third approach -using real food items in user studieshas been used in human-computer interaction due to its relatively easy use (e.g., Ranasinghe et al. [40] and Narumi et al. [34]). Here, people use modified utensils, plateware, or liquid containers to consume the food. However, no structured approach for generating such samples has been derived yet. We also opt for this method because of it is ease of use and present such an approach.

Vision
An often-used approach is to use augmented reality (AR) to modify the visual appearance of food. For example, Narumi et al. [31] changed the size of a food item or the color [33] using AR. By that, they could successfully influence satiety and control nutritional intake. Nakano et al. [28] did a similar experiment and changed the flavor of noodles by overlaying machine-learning-generated images in AR. Similarly but in VR, Ammann et al. [2] investigated how a simple change in the color of a cake (yellow = lemon/sour; brown = chocolate/sweet) influences the perception of participants: Results revealed that people had more difficulties identifying the real flavor when the color was modified, indicating a conflict in the process of MSI. We expand on this study by examining the influence of smell in a similar experiment.

Multimodal
Multimodal approaches stimulate several senses via a plethora of actuators. Ranasinghe et al. enhanced flavor using electro-stimulation on the tongue and color (Taste+, [42]); electric stimulation, smell, and color (Vocktail, [43]), and electric stimulation, smell, color, and thermal stimuli [40] by using glasses and spoons equipped with electrodes, LEDs, fans, and Peltier elements: they present congruent stimuli to successfully influence the process of MSI and to change the perceived flavor of a beverage. They also highlight the need of such systems for applications like communicating smell and taste [41]. Narumi et al. present MetaCookie [33] and MetaCookie+ [34], a multisensory device that can change the taste of a cookie by overlaying an image and displaying a different smell via an AR-HMD equipped with tubes and fans. Similar to our study, they investigate if modulated vision and smell affect perception by relying on the fact that artificial congruent stimuli (vision and olfaction) can change the perception of the cookie by being the dominant stimuli during MSI. However, the influence and interplay of various combinations of congruent and especially incongruent cues have not been explored yet.
Lin et al. [25] developed a tool called "TransFork". As it was an inspiration for our setup, we present it in more detail. The device consists of a regular fork, a container with the olfactory stimulus, a fan, a battery, and an AR marker. An AR-HMD tracks the marker and by that, knows the position of the food item. Thus, the color of food can be changed. The fan disperses the smell from the container toward the user's nose. We build upon this device with our Smell-O-Spoon (c.f. Sect. 4).
The related work shows that there is ongoing research interest in understanding and formalizing how VR and AR can be used to influence MSI. So far, it has been shown that inherent properties like saltiness or sweetness can be modified by presenting quality-congruent odors. It has also been shown that these properties can be influenced while consuming a food or a beverage by applying one or a set of congruent stimuli (e.g., MetaCookie+ or Vocktail). Besides trying to reproduce these results in other settings and with other food items -a contribution in itself -we set out to investigate the relationship during MSI when visual, olfactory, and gustatory stimuli are congruent but also incongruent and when related to more complex flavor objects (e.g., zucchini to cucumber). By that, we hope to deepen the understanding of MSI and pave the way for future multisensory user interfaces. Thus, we present an experimental laboratory study where we investigate how participants react when they eat a tasteless food product and are, at the same time, presented with virtual visual and artificial olfactory stimuli.

PRE-STUDIES
In the following sections, we present the results of two pre-studies. In the first one, we derived a food product with a neutral taste and without odor. In the second one, we selected appropriate odor samples. The pre-studies were performed in accordance with the ethical guidelines of the host institution and the guidelines proposed in the declaration of Helsinki.

Identifying a neutral food product
As we want to investigate vision and olfaction, we needed a product that -at best -has no taste and odor at all. At the same time, we want to be able to process the food into a mash. We selected a mash as our gustatory stimulus as it is easy to process but is also a believable product as various food mashes already exist (e.g., for children). To get a neutral product, we tested a variety of food products regarding taste and odor but also towards consistency when mashed. A pre-selection of fruits and vegetables resulted in the inner of raw zucchini, raw potatoes, raw cucumber, and cooked tofu (inspired by Narumi et al. [34]).
For our pre-study, we invited participants to taste the products that were served as a cold mash. Participants wore sleeping masks, so they could not see the color of each mash and were told to focus on gustation and olfaction. The experimenter sampled a piece of mash on the tip of a spoon and told the participants to start eating. When a participant finished the trial, they were told to put down the mask and answer the questionnaire. The mash of each product was served one after the other, following the same order for each participant. After each trial, participants were advised to drink water to clean the oral cavity [25].

Measures
The questionnaire asks for the name of the product, for neutrality (1 = "not neutral at all" to 5 = "very neutral") and if participants are confident in identifying the product (1 = "not able to identify at all" to 5 = "clear identification"). The next questions ask for familiarity (1 = "not familiar at all" to 5 = "very familiar") and intensity (1 = "very weak" to 5 = "very strong"). The questionnaire can be found in the supplemental material [61].

Sample
We invited ten participants (6 female, 4 male, mean age 26.5 years, 21 -36 years, SD = 4.4 years). None had food allergies or intolerances. All gave informed consent, were informed about their rights, and were told that they can stop the experiment at any time. We did not undertake any specific measures for sample diversity. There was no compensation. 3), and 7 out of 10 participants could not identify the product at all. The other 3 had a vague idea but could not confidently identify zucchini. Based on these results, we chose a mash made out of the inner of zucchinis (cooked but served cold) as our close-to-neutral and close-to-odorless product.

Deriving odor samples
Having a food product that has little-to-no identifiable taste, another ingredient for our study are odor samples of food items.

Initial odor selection and sample creation process
We tested artificial odors (essential oils) from two companies called herrlan-shop.de [49] and aromakonzentrate.com [20]. We ordered the following samples: apple cherry, banana, tomato [49], pear, orange, radish, cucumber, carrot, and cabbage [20]. This selection was done based on availability.
For each chemical odor, we created a matching natural odor [15,16]. Here, we took 45 grams of the most intense part -the peel [17] -and mixed it with 15 ml of glycerin. The mixture was then conserved in a hermetic preserving jar for two weeks. Every day the glasses were shaken for 20 seconds. After two weeks of preserving, the liquid was separated from the peel and the glycerin had acquired the smell [16] (c.f. A and B in Fig. 1).
Having a set of odors, an initial pre-selection by the authors was necessary: the samples of apple, pear, and cabbage were excluded because of a weak natural smell. The samples of radish and cherry were excluded because of an aggressive and unpleasant smell.

Procedure
Participants arrived on site and were informed about their rights and the purpose of the study. Next, participants were instructed to smell the odor and to answer the questionnaire. The odors are banana, cucumber, tomato, carrot, orange and for each one, we sampled a chemical as well as a natural version. For each odor, we applied 5 drops on a separate piece of tissue. Participants experienced the odors in random order. We ensured that the testing environment was well-aired to avoid lingering odors.

Measures
We again asked for the name of the product, the perceived neutrality of the smell, confidence in the identification of the smell, familiarity, and intensity. Inclusion criteria are high values of identification and medium to high-intensity ratings. A product that smells very strong is not ideal, because the level of pleasantness decreases [6]. Moreover, the odors should reach high values in familiarity [6]. The questionnaire can be found in the supplemental material [61].

Sample
The preliminary smell experiment was conducted with ten participants (5 female, 5 male, M = 26.4 years, 21 -36, SD = 4.03 years). All reported no condition that impaired their sense of smell or taste and no food allergies or intolerances. All gave informed consent, were informed about their rights, and were told that they can stop the experiment at any time. We did not undertake any specific measures for sample diversity. There was no compensation. Chemical banana as well as natural cucumber also score high in pleasantness and familiarity. They also show an intensity above average. Chemical tomato is also rather familiar and intense but tends to be slightly unpleasant. The natural carrot was also easily identified, familiar, and intense, but less pleasant. Here, the high identification value led to us selecting it for the main study. Thus, the final odors are chemical banana, natural carrot, natural cucumber, and chemical tomato (labeled with "*" in Fig. 3).

SMELL-O-SPOON
We needed a device to display the smell while providing participants with the possibility to eat the mash while wearing the VR headset. Fig. 4 shows our Smell-O-Spoon, inspired by "TransFork" [25]. It consists of a fan (A, 15mm, 5V, 9300rpm), a metal spring for the smell samples (B), a common household spoon (C), a USB power supply (D), and a marker for a motion tracking system (E) attached to a 3Dprinted extension. We selected a wired solution (0.25mm wire) to avoid the heavy weight of a battery and avoid a non-uniform airflow due to decreasing battery life. The metal spring as well as the marker are attached via Velcro tape. We covered the metal spring with black, nonreflective tape to avoid interference with the optical tracking system. Similarly, we sanded down the spoon to minimize reflections. The fan has a USB cable connection including a switch and a potentiometer. The potentiometer was used to regulate the speed of the fan and set to 500 Ω. To minimize the vibration caused by the fan, a piece of foam was added directly below the smell container (Fig. 4, bottom/right). An OptiTrack motion tracking system [36] tracked the real spoon's position.
Following the procedure of Vi et al. [59], a little smell pad was prepared for each of the four odors. The pad consists of tissue infused with five drops of the odor sample and is wrapped in tape. We used tweezers to place the smell pad into the spring. The pads are easily interchangeable and due to the tape, no liquid contaminates the Smell-O-Spoon. Fig. 5 shows such a smell sample.
Together, the Smell-O-Spoon and the pad weighs 83 grams. For comparison, the spoon alone weighs 23 grams.

MAIN STUDY
With the odor samples and the neutral food product ready, we describe the main study in the following sections. Visual and olfactory stimuli are banana, carrot, tomato, and cucumber. In addition to that, two irritation products were included (visual only; mushroom and radish). We included these products as we assumed that participants would quickly match the smells and images. The irritation products were supposed to disrupt any pattern. Fig. 1 (D) show images of the products as seen in VR.
Our main experiment had three phases: 1. (online) Evaluation of screenshots of the products: participants received a link to the online questionnaire. In each phase, participants answered a series of questions. In general, our questionnaires are inspired by prior research from Chen et al. [5], Chrea et al. [8] and Chifala and Polzella [6] (as indicated in We adopted their questions about familiarity, pleasantness, and intensity. Chifala and Polzella investigated the contribution of odor to flavor perception. We adopted their grid-based rating system for sweet/sour and intense/mild as well (1995, cited by 12 according to Google Scholar; not in the scope of this paper). We had to make modifications to their questions to evaluate the specific aspects within our tri-modal setting that also includes gustation. None of these questionnaires have been validated. We will acknowledge this limitation but still believe that the questions are appropriate to answer our research questions.

Procedure
After they registered for the experiment, participants received a link to the online questionnaire to evaluate the screenshots. It contained the informed consent form, and the purpose of the study, and asked for demographic information (age, gender), and VR familiarity. The screenshots were presented in random order.

Measures
For each image, we asked for identification of the product, how much participants liked it (1 = "not at all" to 5 = "like it a lot", and for pleasantness (1 = "not pleasant at all" to 5 = "very pleasant"). We also asked for familiarity (1 = "not at all familiar" to 5 = "very familiar") and intensity (1 = "very weak' to 5 = "very strong").

Procedure
The second phase took place onsite in the laboratory and included an evaluation of the olfactory impulse. The experimenter handed the smell samples to the participant in randomized order.

Measures
For each sample, participants answered a questionnaire. First, they had to identify the odor. Next, they answered the same questions as in phase 1 about how much they like this product, its pleasantness, familiarity, and intensity. We use the values from phases 1 and 2 to determine if the perception of the products changes when congruent or incongruent stimuli are presented in phase 3.

Procedure
When participants had finished phase 2, they were guided to the VR lab. Here, they put on the HTC Vive Pro Wireless, standing in the door frame. The VR scene showed a replication of the actual laboratory and static objects provided passive haptic feedback (e.g., desks, computers; c.f. Fig. 6). Next, participants received training in handling the spoon  so that they could confidently put it in their mouth and back on the table (without food). As soon as they felt comfortable, the experiment started. For each trial, the food was visualized as a whole, sliced, and as mash to present salient visual cues. At the beginning of each trial, the experimenter inserted a smell pad into the Smell-O-Spoon, scooped a portion of the neutral product on the spoon, and laid it in the bowl. Then the participant was asked to pick it up and eat the product while smelling the odor from the pad and seeing the product in VR. Each participant performed 20 trials in a counterbalanced order. Each trial represents a combination of olfactory and visual stimuli banana, carrot, tomato, and cucumber (4 × 4 = 16). In addition to that, 2 trials with mushrooms and 2 trials with radish were added as irritation products (without any artificial olfactory stimulus). Note, participants always ate the raw zucchini mash but saw and smelled the selected products. We split up the 16 + 2 + 2 = 20 combinations into 4 sub-groups. Each group contained 3 incongruent pairs of olfactory and visual stimuli, one irritation product, and one where olfaction and vision are congruent. Thus, each group contains three unmatched trials, one matched trial, and one irritation product. Overall, we ensured that every participant tried every combination of the olfactory and visual stimuli under observation (16 in total). Table 1 provides an overview of the total trials in the VR condition. 30 × 4 = 120 trials were bi-modal congruent trials where visual and olfactory stimuli displayed the same product (30 participants, 4 trials per participant, 1 product) and the gustatory stimulus was incongruent. In 30 × 4 × 3 = 360 trials, all three stimuli were incongruent (tri-modal incongruency, 30 participants, 4 trials per product, 3 products). 30 × 2 × 2 = 120 of those were irritation products (2 × 30 = 60 for radish and 2 × 30 = 60 for mushroom). We added them to disrupt any patterns. Thus, one participant had to do 20 trials, 4 of those were irritation products, 4 were congruent (each product once) and 12 were incongruent.
Between each trial, participants drank pure, non-sparkling water to clean the oral cavity [25]. The laboratory was ventilated with a constant stream of fresh air.

Measures
Following each trial, participants answered a questionnaire. We again asked for the name of the product they thought they ate, pleasantness (1 to 5), and intensity (1 to 5). In addition to that, we asked participants if what they saw and what they smelled differed (yes/no).

Materials
The VR environment was built using Unity 2020.3. The Smell-O-Spoon was tracked using Motive 2.1. A view from a third-person perspective is illustrated in Fig. 1 (E). More information on the VR application is available upon request.

Sample
A total of 30 participants (15 female, 15 male, mean age is 28.83 years, max. 61 years, min 21 years, SD = 8.79 years) were recruited via a mailing list of the local university and via Facebook. The selection criteria were as follows: healthy individuals with normal vision and olfaction and no interfering history of neurological or psychiatric disorders, no synaesthesia, and no food intolerance or allergies considering fruits and vegetables. Participants were informed about the risks of body reactions and that the food products could trigger adverse effects. Besides that, a contact tracing form due to the COVID-19 situation had to be signed by the participants. All gave informed consent, were informed about their  Table 2: Percentages of people who thought they experienced congruent or incongruent visual-olfactory stimuli grouped by trials. The number in brackets is the chance at random and the deviation of our results from random.
rights, and were told that they can stop the experiment at any time. The study was executed following the guidelines of the local university, the national research organization, and the declaration of Helsinki. We did not undertake any specific measures for sample diversity. The average duration of the experiment was scheduled for about 45 minutes for each participant. There was no compensation.

RESULTS
Participants experienced congruent or incongruent visual, gustatory, and olfactory stimuli in VR. The visual stimuli were banana, cucumber, tomato, carrot, radish, and mushroom (the latter two were irritation products). Each olfactory and visual stimulus was sampled four times per participant. For the irritation products, no smell at all was presented. The olfactory stimuli were banana, cucumber, tomato, and carrot. The gustatory stimuli was always a cold served mash of zucchini. The olfactory and neutral gustatory stimuli were derived in pre-studies (c.f. Sect. 3).

RQ1: Ability to integrate congruent visual and olfactory stimuli while eating tasteless food
We split the analysis into two parts: First, we present data about whether participants were able to form a unified percept out of the bi-modal congruent visual-olfactory stimuli ("Does what you smell and see differ/is the same?"). Second, we report on whether the bi-modal congruent visual-olfactory stimuli together with gustation lead to a percept that aligns with the visual-olfactory stimuli ("What do you eat?").

Identification of visual-olfactory congruency
In the VR environment, participants had to state if what they smelled and what they saw represented the same stimulus with yes or no. Table 2 shows the percentages of people who said the stimuli are congruent or incongruent. The figure also contains the values of chance and the difference between chance and our results. Note, out of 480 total trials, the number of bi-modal congruent trials (120) was lower than the number of tri-modal incongruent trials (360). A Chi-squared test revealed a difference in distributions between congruent and incongruent trials (X 2 (1, 480) = 24.6, p < 0.0001, φ = 0.23). That means that participants answered differently in the trimodal incongruent trials compared to the bi-modal congruent trials. Responses on a perceived visual-olfactory congruency in bi-modal visual-olfactory congruent trials are not significantly different to chance (p > 0.3338), according to a binomial test. Thus, participants wereunexpectedly -close to chance (12.5% +/-1.5%; 53 and 67 out of 480 in total and out of 120 bi-modal congruent trials). That indicates that participants were not able to integrate the visual and olfactory stimuli into a single percept while eating a tasteless food product: they did not detect matching olfactory and visual stimuli.
Results of a binomial test suggest that responses on perceived visualolfactory congruency in tri-modal incongruent trials are significantly different to chance (p < 0.001). Thus, participants were -as expected -better than chance when detecting the incongruent stimuli with 59 Tri-modal incongruency; followed none Tri-modal incongruency; followed olfaction Tri-modal incongruency; followed vision Bi-modal congruency; did not follow congruent stimuli Bi-modal congruency; followed congruent stimuli Bi-modal congruency; unrelated food Fig. 7: Overview of the products participants thought they consumed grouped by tri-modal incongruency (blue, orange, gray) and visualolfactory congruency (yellow, green, red). Cucumber, carrot, banana, and tomato ( * ) were visual and olfactory stimuli. Participants actually ate zucchini ( + ).

Identification of consumed food (gustation)
Overall and considering all trials, participants mentioned a large variety of products. Fig. 7 illustrates these products. Zucchini was only mentioned 7 out of 480 times, indicating that the zucchini mash is a good neutral product whose lingering taste can be hidden by olfactory and visual stimuli. Radish and mushroom, participants mentioned 21 and 23 times. However, radish was mentioned only once while the visual cue also showed radish. The remaining times radish and mushroom were named were either during visual-olfactory congruency (where neither radish nor mushroom was displayed) or during tri-modal incongruency (where, again, neither mushroom nor radish was displayed). Note, both products acted as distractors to avoid participants from focusing on the four core products banana, tomato, carrot, and cucumber. In retrospect, given the large variety of foods that were named, these distractors would not have been necessary.
While participants mentioned cucumber, carrot, banana, and tomato most often -regardless of congruent or incongruent condition -they also mentioned other products like apple, melon, or cornflakes. Some participants were not able to name any product and gave no answer at all -even after being prompted repeatedly -or simply said vegetable or fruit. This means that participants had trouble forming a unified percept out of the multisensory stimuli. Note, this happened almost equally often in tri-modal incongruency (n = 36; 10% out of 360 trials) and in trials with visual-olfactory congruency (n = 10, 9.2% out of 120 trials). During tri-modal incongruency, many items were completely unrelated to the stimuli we displayed to them (201 out of 360, blue in Fig. 7). This means that participants mentioned a food that was neither displayed by vision, olfaction, or gustation.
For 120 trials, vision and olfaction were congruent (bi-modal visualolfactory congruency; yellow, green, and red in Fig. 7). Table 3 shows the number of correct identifications for the matched trials, grouped by stimulus under investigation. Cucumber was identified most often (20 out of 30). Besides this, more than half of the matched influences of carrot were identified as such (17 out of 30). Banana was mentioned 46.7% out of 30 matched trials and tomato was named 9 times. Out of a total of 120 matched trials, participants identified 60 products as what they saw and smelled, following the congruent visual and olfactory stimuli. Assuming that participants do not integrate the multisensory cues and considering the 480 trials, a 25.0% chance to get a congruent trial, and a 50% chance to mention either vision or olfaction (and not zucchini or a random product), and the four products to guess, the   Table 4: Number of times participants followed vision or olfaction during incongruent trials. Chance to randomly follow vision or olfaction and not gustation or neither is 18.75% each.
chance to guess the product at random is 3.13%.
Results of a binomial test suggest that the number of times participants followed the bi-modal congruent visual-olfactory cues was not significantly different to chance (p > 0.1467). Thus, our results are very close to chance with cucumber (4.17%) and carrot (3.54%) being slightly above chance whereas banana (2.92%) and carrot (1.88%) are below chance. This means that, while some participants named the product they saw and smelled, overall, the bi-modal visual-olfactory congruent stimuli did not influence multisensory perception while eating our neutral food product. Rather, results are close to chance. Table 4 illustrates the proportions of which stimulus aligned with the product identification of participants when presented with tri-modal incongruent stimuli. Here, 97 (26.9% of 360) named the visual stimulus when asked to identify the product. 54 (15.0% of 360) identified the smell. 203 times, participants named a completely different product. This means that 41.9% of 360 named a product that was represented by either the visual or the olfactory stimulus. 58.1% on the other hand mentioned a completely different product (or no product at all; c.f. Fig. 7). Assuming that participants do not integrate the multisensory cues and considering the 480 trials, a 75% chance to get an incongruent trial, and a 25% chance to mention either vision or olfaction (and not zucchini or a random product) the chance for following vision and olfaction are 18.75% each, whereas the chance for another product is 35%. According to binomial tests, participants did not mention the product they saw more or less often than chance (p = 0.413). However, they mentioned what they smelled less often than chance (p < 0.001) and mentioned other products more often than chance (p < 0.001). Thus, fewer people than chance predicts followed olfaction (-7.5%). More people than chance predicts mentioned a completely different product (+6.04%). Fig. 8 shows the data on pleasantness and intensity grouped by the three phases. In phase 1, participants were asked about how intense and pleasant an image of the product was, in phase 2 how intense and pleasant the smell was, and in phase 3 they answered the same questions about the product they thought they consumed.

Intensity and Pleasantness
We measured the intensity and pleasantness of the three phases (picture, odor, VR) to control for the general attitude of participants towards the product.
Overall, pleasantness ratings were average (M = 3.3) on a scale from 1-5. However, there are significant differences between the picture phase (phase 1) and phases 2 and 3. A possible explanation for this could be that, when humans see a product, they imagine the best taste [60]. It could also indicate that the selected odors were not perceived as pleasant and impaired the ratings. The latter assumption is backed by Fig. 3 where the pleasantness ratings of the products range from M = 2.1 (SD = 1.37) for tomato to only M = 3.6 (SD = 0.97) for cucumber. In addition to that, in the pre-studies, the scents were compared with ones that did not end up in the main study, but which were rated as less pleasant, thereby inflating the ratings of the ones that were, comparatively, better.
Regarding intensity, the VR condition was rated less intense compared to odor and picture evaluation. We believe that the cross-modal mismatch led to the low-intensity ratings because participants could not form a common percept and attribute intensity to it. They rather perceived the incongruent cues and found none of them particularly intense (c.f. Dalton et al. [10]). Thus, our results suggest that when intensity is relevant in VR-based human-food-interaction, cross-modal congruency is a factor to consider.

RQ1 Do participants integrate congruent visual and olfactory stimuli into a single percept while eating a tasteless food?
RQ1 asks for the ability of people to integrate congruent visualolfactory stimuli while experiencing gustatory incongruency. We expected that a majority of participants would detect the tri-modal incongruent and also the bi-modal congruent stimuli. To answer this question, we asked participants if what they saw and what they smelled is the same and to name the product they ate.
As expected, significantly more participants than chance predicts detected the tri-modal incongruency (see Sect. 6.1.1). Still, in 74 incongruent trials, participants thought they sensed matching stimuli -thus seemingly integrating the three incongruent cues. A closer look at the data did not reveal a specific product combination among those 74 trials. This rather large number motivates further studies on what led to their answers. We suggest adding open questions or a semistructured interview to the experimental procedure to further investigate participants' reasoning and decision processes. Grounding the research within an alternative theoretical framework might also explain these results (c.f. Sect. 7.4).
Contrary to our expectations, the number of identified visualolfactory congruent stimuli is not significantly different than chance predicts (see Sect. 6.1.1). These results suggest that people were not able to integrate the bi-modal congruent cues (vision and olfaction) while eating a neutral food product into a single percept. One reason for this could be that the texture of the real food product as well as the retronasal smelling lead to a sensory conflict and hampered MSI. The former means that, while we presented a mash, the mouthfeel of the zucchini mash was different from when consuming a mash depicted by our bi-modal congruent cues (banana, tomato, carrot, cucumber). The latter would mean that a light lingering smell of the zucchini mash -although not identifiable (c.f. Sect. 3) -creates a sensation that is perceived by the olfactory system through the oral cavity. In addition to that and despite the encouraging results of the pre-study where we selected the smell samples (c.f. Fig. 3), we can not rule out that the visual and olfactory stimuli were simply perceived as different by participants. In summary, most participants did not integrate tri-modal incongruent stimuli (as expected) and had problems integrating the congruent visual-olfactory stimuli in a unified percept (unexpected).
We further expected that MSI in the bi-modal congruent stimuli would lead to participants identifying the consumed food more often as what they saw and smelled (as it has been shown for AR with a cookie [33] and for beverages [43]).
In 60 out of 120 trials participants named the product that they saw and smelled (c.f. Table 3). Interestingly, this includes 7 identifications where participants did not think that what they saw and what they smelled was the same -but they still answered correctly. The other 60 trials did not lead to identifications that followed the congruent cues.
Overall, the results were not significantly different to chance and in opposition to other approaches such as MetaCookie+ [32], Ammann et al. [2], and Vocktail [43]. It could be that changing an inherent quality of a food (neutral to sweet or chocolate [32] or neutral to sour or citrus [2]) is more promising than changing the food category (e.g., zucchini to tomato) as the difference between the perceived cues and the actual tasted food is too large. This is backed by the assimilationcontrast theory [48] which warns of too large discrepancies during MSI. It could be that our discrepancies were too large. These specific aspects need to be explored by further psychophysiological and perceptual user studies. For example, by investigating if combined effects from Narumi et al. [32] and Ammann et al. [2] can be reproduced with mashes (e.g., change sweetness and sourness).

RQ2: Are participants guided by their vision when forced to identify what they consume during visual-olfactory-gustatory incongruency?
To answer RQ2, we analyzed the 360 incongruent trials and tested if more people than chance predicted named the product they saw.
Considering only the 360 tri-modal incongruent trials, the identification of products revealed that 97 of 360 times, participants named the product they saw, regardless of what they ate and what it smelled like. Similarly, 54 out of 360 times, they mentioned the smell and ignored what they tasted and saw. We expected that the visual stimuli are more dominant than olfaction when identifying incongruent stimuli, as vision is the dominant sense of humans [14,18].
However, our results do not support this dominance during tri-modal incongruency. While participants mentioned the smell significantly less often than chance, they mentioned what they saw not significantly more or less often than chance. This is indeed different from previous results observed in bi-modal experiments (e.g., Nambu et al. [29] and Tanikawa et al. [57]). We assume that the tri-modal incongruency (but also the combination of mouth-feel and texture and latent/faint retro-nasal smelling due to the residual smell of the zucchini) hamper multisensory integration compared to previous bi-modal conditions.
Participants often mentioned other products that were not related to any cue: in more than half of the incongruent trials (209 of 360), they selected a completely different product (c.f. Fig. 7). Here, it seems that participants were aware of the incongruent stimuli and were not able to integrate them into a unified percept. Thus, they tried to find a food product that matches all three sensations as close as possiblewithout relying on any of the presented stimuli. This, again, could be related to the assimilation-contrast theory that states that too large of a discrepancy reduces participants' ability to form a percept [45,47].

Our work in the context of ecological psychology and the existence of only a single perceptual system
Stoffregen & Brady [54,56] argue that there are no different parallel perceptual systems that perceive input from corresponding single ambient energy arrays (e.g., the optical energy array). In this case, there would be no need for imposing a structure on those inputs (a process related to multisensory integration). In consequence, they argue, intersensory conflicts do not happen or exist at all as information does not need to be integrated. They postulate that perception in the organism-environment system happens via higher-order information available in the global energy array. In this understanding, errors in perception and performance do not imply a lack of specificity but rather the need for further perceptual differentiation (learning) to properly exploit the information in the global energy array. In other words, there is no conflict between visual, gustatory, and olfactory stimulation but rather the pattern described by these three single energy arrays does not correspond to the human experience and, thus, cannot be interpreted by our single perception system (without further perceptual differentiation). This view offers alternative explanations for our observations. For example, this would explain why so many participants in our experiment mentioned completely seemingly unrelated items or had trouble settling on a single item: the mapping between the higher-order information of the global array and the perception system was simply not specified. With our setup, we present an experimental design that allows for the manipulation of individual parameters in the global array. In this interpretation, we do not manipulate the individual senses with our setup but rather expose the single perception systems with a global energy array that is composed of our manipulated single-energy arrays. Thus, we follow the reciprocal method of the one proposed by Fouque et al. [12] where we keep parts of the global array fixed and vary the individual forms of energy (e.g., within the optical array by changing the visual stimulus).
With that, a procedure similar to that of Mantel et al. [26] that objectively quantifies the contribution of our individual stimuli towards the global energy array might provide means to better interpret and understand perception in tri-modal situations. Continuing this line of thought and inspired by ecological psychology, this might result in equations that describe the human's perception process of the global array and the parameter specifying "taste" within the organism-environment system. To do this, additional measures can be applied. For example, a scale from 0-100 matching the proportion of food the mash was made out of to what you see/smell and/or a scale from 0-100 to judge the strength of the flavor of the corresponding food could help to explore this theoretical framework. By that, it is possible to test the influence of the stimuli on the perceived strength of food's flavor (whereas flavor can be seen as a higher-order variable).
To our knowledge, the perception of a global array by a single perception system has not yet been discussed within the context of AR/VR, taste/flavor, and the human-food-interaction community. However, our technical setup and our results can act as a basis for a complete set of pairwise comparisons following Fouque et al. [12]. Here, further conditions are necessary to allow for a full description of the comparison space (i.e. adding tomato, carrot, banana, and cucumber mashes) to describe the individual influence of single energy arrays on the structure in the global array and thus the higher order information perceived by humans. Thus, our work can then help to describe the organismenvironment interaction within the context of taste perception similar to Mantel et al. [26]. By that, our work can help to explore this theory and is also part of the research that tries to understand the fundamental processes of human perception and the sensitivity to structures in the global array [54,56].

Limitations
There are numerous individual factors that can affect the perception of taste and due to the scope of this research, we could only integrate a few. For example, trigeminal sensations, mouthfeel, and also the individual contribution of retro-nasal and ortho-nasal olfaction [50] are promising factors for future research. Further, we had an international sample who have various experiences with food and products -a factor that should be investigated further [39]. We also did not instruct participants to not wear any scented deodorants, aftershaves, or perfume so our environment might have been contaminated despite the ventilation system. In addition to that, our VR system had some limitations. The spoon always displayed the smell as soon as the sample has been inserted and with the same strength (the fan was always on). However, a more advanced but still low-fidelity version that uses the position of the spoon and the headset to drive the fan speed via a microcontroller could better support the bottom-up processes of MSI, such as strength and synchronicity.
A further limitation of our study is the fact that we only investigated perception in tri-modal incongruency and visual-olfactory congruency. In return, this means that we did not investigate conditions with tri-modal congruency, gustatory-visual congruency, and gustatoryolfactory congruency. However, our results still provide a significant offset to prior work. For example, we show that an effect previously observed during bi-modal incongruency, the dominance of vision [29,57], does not necessarily happen during tri-modal incongruency We also show that adding a seemingly tasteless food item disturbs MSI when trying to change food categories with congruent olfactory-visual cues (contrary to strengthening or weakening an inherent quality of a food item which is possible [1,43]) Also, investigating all 5 conditions would have meant keeping participants in the lab for a long time and risking habituation or adaptation due to a prolonged and repeated exposure to the stimuli [3,9]. Still, our interpretation of results on detecting incongruence Sect. 6.1 and the guiding sensation Sect. 6.2 are limited. For example, we cannot say if participants would have wrongly detected an incongruency in a tri-modal congruent condition, a gustatory-visual congruency, or a gustatory-olfactory congruency (i.e., Sect. 6.1). We aimed at avoiding this problem by having a rigorous selection procedure for our stimuli which ensured that participants were able to identify the single stimuli independently.

CONCLUSION AND FUTURE WORK
This research investigated multisensory integration (MSI) by modifying visual and olfactory sensations while eating real food in a virtual environment. We let participants eat a zucchini mash -a neutral food product -and presented them with congruent or incongruent visual and olfactory stimuli in VR. We were interested in whether participants can integrate the bi-modal visual-olfactory congruent cues into a single percept that differs from zucchini mash and how their sensory system reacts when vision, olfaction, and gustation are incongruent.
We first elicited a neutral food product (a cold mash of cooked zucchini) and smell samples (natural cucumber, chemical banana, natural carrot, and natural banana). We also presented the "Smell-O-Spoon", an easy-to-build device that allows eating in VR and can also display smell. Results of our main study indicate that participants are not guided by vision during tri-modal incongruency (as is the case in bimodal incongruency). We further show that trying to change complex flavor objects (e.g., zucchini to banana) seems to be more complex than changing the inherent properties of a food item (e.g., increase sweetness).
AR/VR systems that cater to multiple senses -besides vision and audio and especially including gustation and olfaction -promise to enhance aspects such as presence but also social applications in AR/VR, social eating in AR/VR, and applications for the treatment of disorders such as autism spectrum disorder or schizophrenia. To realize these objectives and to build human-food user interfaces for AR/VR, a deep and thorough understanding of MSI and perception is necessary. Considering that, our tools (Smell-O-Spoon, smell samples, neutral food) and findings (complex properties harder to change, vision not a guiding sense during tri-modal incongruency) motivate further research in the domain of multi-sensory virtual environments to allow for, enable, and support these beneficial applications.
Possible future research directions are manifold (next to those mentioned in the limitations): Our work motivates future research on MSI, immersion, and perceived intensity to further explain the our results: integrating real food mashes of carrot, banana, cucumber, and tomato would provide knowledge on how three congruent stimuli (gustation, vision, olfaction) but also other congruent combinations (only gustation and olfaction) influence MSI. In addition to that and to overcome the issue of participants being aware of incongruent stimuli, it might help to embed the experiment into a more playful or game-like setting to redirect the focus. For example, the whole experiment could take part in a restaurant setting.

ACKNOWLEDGMENTS
This work has partially been funded by the CYTEMEX project funded by the Free State of Thuringia, Germany (FKZ: 2018-FGI-0019).