Multimodal Emotion Recognition of Hand-Object Interaction

In this paper, we investigate whether information related to touches and rotations impressed to an object can be effectively used to classify the emotion of the agent manipulating it. We specifically focus on sequences of basic actions (e.g., grasping, rotating), which are constituents of daily interactions. We use the iCube, a 5 cm cube covered with tactile sensors and embedded with an accelometer, to collect a new dataset including 11 persons performing action sequences associated with 4 emotions: anger, sadness, excitement and gratitude. Next, we propose 17 high-level hand-crafted features based on the tactile and kinematics data derived from the iCube. Twelve of these features vary significantly as a function of the emotional context in which the action sequence was performed. In particular, a larger surface of the object is engaged in physical contact for anger and excitement, than for sadness. Furthermore, the average duration of interactions labeled as sad, is longer than for the remaining 3 emotions. More rotations are performed for anger and excitement than for sadness and gratitude. The accuracy of a classification experiment in the case of four emotions reaches 0.75. This result shows that the emotion recognition during hand-object interactions is possible and it may foster development of new intelligent user interfaces.


INTRODUCTION
Studies regarding touch-based communication in the context of human-human (HHI) [14,17] and human-robot (HRI) [2,6] interaction usually focus on social touch and exploit tactile data of people who are explicitly asked to perform some conventional touch gestures such as a handshake or a hug [6]. The way one enters into contact with a surface of an object (different from the other's body part) and manipulate it might also convey information revealing his/her affective state. In this line, Wang and colleagues discussed "non-symbolic touch", such that "no predefined meaning or code is needed for affect conveyance using touch" [30]. Our long-term aim is to develop emotion recognition models from multimodal data collected during natural hand-object interaction (HOI). Natural interaction in our scope refers to performing basic actions such as grasping, rotating, holding, passing and their combinations, which are elements of every day interactions with a variety of objects [27]. Previous works often have focused on: 1) interaction based on symbolic touch gestures, e.g., [8,15] (the important exception is the research on multi-touch screens and keyboards, e.g., [7,9]), or 2) non-symbolic interaction with soft objects by exploiting the data on pressure applied during the contact, e.g., [11,18,26] or by measuring directional photoreflectivity [28]. In this work, we focus on HOI, which can, but does not have to, be performed in a social context, i.e., in the presence of another person. We show that emotions can be differentiated using hand-crafted features extracted from tactile and kinematics data (without data about applied pressure), and we provide the results of 4 emotions classification. At the same time, we acknowledge that the way one interacts with an object depends on certain properties of this object (e.g., its dimension, shape, hardness etc.) as well as the activity performed. In this study, we control these two factors by choosing one specific semantically-neutral object, and by restricting the number of activities. Keeping these two factors fixed, we investigate whether the contact with the object's surface is being influenced by the portrayed emotion. When affect classification using such data become possible, this technology may be applied in several user interfaces designed for entertainment, well-being and motor rehabilitation. It can be a communication device for people with reduced verbal communication abilities [26], a general purpose affect sensor e.g., for self-monitoring [11,22] and remote communication [31], a game interface. It can also be embedded in "smart" versions of several every day objects to sense the user's emotions.

ICUBE DEVICE
iCube is a 5 cm hard cube weighing about 150 grams (see Figure 1) developed at Istituto Italiano di Tecnologia. It generates asynchronous combination of tactile (i.e., 2D tactile maps) and kinematics (i.e., angle rotations in quaternions) data. Touch sensing is based on a set of Capacitive Button Controllers enabling detection of simultaneous touches and the 4×4 cells on each of six pads of the cube. The device is wireless and equipped with a battery. It does not collect the touch pressure data. This choice permits to reduce the cost of the device and the amount of the data to be transmitted to the PC. In past research, iCube was used to analyze the multimodal patterns in an exploratory task [24,25]. The main advantage of using the iCube to collect affect-related data is its semantically-neutral and simple shape. Given its similarity to several common objects, such as small boxes and containers, it allows us to carry out natural interactions. This choice is in line with Guribye and colleagues, who state that when designing the tangible interfaces, such familiarity is useful as it exploits "users' pre-existing understanding and interaction with similar objects from their everyday world" [11].

DATA COLLECTION
We collected data of people performing a sequence of actions (grasping, rotating, passing from one hand to another, and handing over the iCube) associated with 4 emotions. We asked our participants to imagine that they feel specific emotions. We are aware that this procedure might result in more stereotyped (or exaggerated) actions, but we believe it is an acceptable simplification, keeping in mind the aim of this study. We focus on anger, sadness, excitement and gratitude. We selected them because they are placed in 4 different quadrants of the two-dimensional valence-arousal model. More specifically, the first 3 labels are mentioned in the Russell's paper [21] with anger and excitement characterized by high arousal, whilst anger and sadness by negative valence. The gratitude does not appear in [21], but it was evaluated later. For example, in [13] gratitude is positive, but the 6th lowest arousal emotion, receiving the score of "3" in a 9-point scale between 62 studied labels. Thus, it is reasonable to assume that this state is characterized by a lower arousal compared to excitement. Suitability of this two-dimensional model to describe the touch-based interactions was postulated in [1] and it is inline with the previous works on video games. For istance, in [7] the authors, after performing a pilot study, focus on 4 emotions corresponding to 4 quadrants of valence-arousal model (i.e., frustrated, bored, excited and relaxed).
The participants were asked to perform one of two assignments (assigned randomly). In the first assignment (A1), 4 scenarios were used to provide "emotional context". The task (i.e., the sequence of actions performed with hands) does not vary, while the emotion and the imaginary object mentioned in the scenario change: 1) grabbing and passing a small box of chocolates as a gift for a favor received (gratitude), 2) grabbing and passing a beloved wooden figure now broken (sadness), 3) grabbing and passing an empty packet to a confederate accusing him of stealing its content (anger), 4) grabbing and passing a closed "surprise" parcel to a confederate asking him to open it for you (excitement). Thus, all scenarios require: grasping the iCube, rotating it to find a marker placed on one of its faces, approaching the confederate, and handing the iCube over to the confederate. When defining the scenarios, we paid attention to choose the "imaginary" objects (e.g., small box of chocolates) that would match the iCube dimensions. The scenarios were written on 4 different pieces of the paper. The order was randomized and the procedure was as follows. First, participants drew one scenario and read it. They were given some time to think about the story and try to imagine themselves being the protagonist. They performed 5-6 trials. There was a 5-10s pause between trials. Next, they drew another scenario. For the second assignment (A2), the participants were instructed to perform the same task with the 4 above mentioned emotions. Unlike A1, in A2 instructions regarding what the imaginary object and the scenario could be were not given to the participants. Thus, in A2, the paper sheets contained only 4 emotion labels. Each participant performed only one assignment. Before all this, emotion definitions taken from [20,23] were given in writing.
For both assignments, participants were asked to behave naturally. No precise instructions were given on how to perform the task. This choice was made because we believe that the way the person manipulates the object contains affective information, e.g., a person may rotate the object more or faster when she is angry than when she is sad. Hence, we intentionally did not request the participants to perform exactly the same number of rotations, nor to grab the cube always in exactly the same manner. By combining the data of 2 different assignments in one dataset, we expect that the classifiers robustness might be improved. Indeed, classifiers need to recognize portrayed emotions and not specific actions related to the A1's scenarios, such as touching according to the characteristics of an imaginary object, e.g., a fragile vase. Initial positions of the participants, confederate and tables were kept always the same. When a participant faces a confederate, the iCube is placed on a table on the left of the participant, and the text scenario is placed on a table on the right. The confederate is 3-4 meters away from the participant's initial position. The confederate's position is fixed. The iCube has a marker (i.e. sticker) which symbolizes the front of the imaginary object (e.g., the opening of a box). The marker is located in a way that participants cannot see it at beginning of a trial.

FEATURES
According to the literature humans tend to perform more expansive and quick gestures when they feel high-arousal emotions such as anger, whist they may tend to slow down the same gestures when feeling sadness [5,29]. Regarding social touch, a positive correlation between rated arousal and touch action motion energy during interpersonal socio-affective touch events was revealed [17]. Emotions associated to attachment (e.g., gratitude, sadness) were characterized by longer tactile contact than rejective ones (e.g., anger, disgust) in a HRI study [2]. Hauser et al. [12] shows that the total contact area, touch duration and hand velocity can be used to differentiate portrayed emotions during the hand-forearm contact. We took inspiration from these works realized in the context of HHI and HRI to design a set of features for HOI. Our features estimate: the task duration, the amount of movement, the number of touch actions, number of touch changes and the area of contact with the cube surface.

Tactile data
Let a i jkm = 1 if a cell on intersection of i-th row and j-th column of the k-th face of iCube is touched at the time (i.e., data frame) m of a segment; and a i jkm = 0 if the same cell is not touched. A data segment in this study corresponds to the data captured from the time a participant makes physical contact with the iCube for the first time until she hands over the iCube. Touch density. The touch density estimates how large is the portion of surface of the cube engaged in a contact. For this purpose, we compute an average number of touched cells in one data frame. We introduce AV G_T D (average touch density) as: Touch variability. The touch variability is used to estimate the quantity of contact changes during the task. We compute the number of changes in touched cells between two consecutive frames, and then we compute the average value for a data segment. We introduce AV G_TV (average touch variability): where n is the number of frames in a segment. We also compute the standard deviation (SD_TV ), standard error (SE_TV ), and the maximal value (MAX _TV ) of the touch variability on the data segment.

Kinematics data
Rotation. To estimate the movement quantity we compute the total number of rotations. More specifically, first we compute the instantaneous angular variation by measuring the angle traversed over time for each of the three unitary axes orthogonal to the faces of the cube using the method described in [25]. To quantify the total amount of rotation T _T R, we compute the maximum value among three cumulative sums of the rotations. Next, we compute the average (AV G_T R), standard deviation (SD_T R), standard error (SE_T R), and the maximum value (MAX _T R) on the data of "mostly rotated" axis over the whole segment. Dominant Rotation. Using the method [25] applied to the data frame m and the approach presented in Section 4.1, we compute T DAC (total number of dominant axes changes) and ADAC (average number of dominant axes changes).

DATA ANALYSIS AND EXPERIMENTS 5.1 Statistical Analysis
11 persons participated (8 female, 1 left handed) in the data collection. This resulted in 237 trials in total (60 sadness trials, and 59 anger, excitement and gratitude trials). For each trial, we have extracted one data segment. The average segment length is 3.9s (SD = 1.48s). We run a series of Kruskal-Wallis tests with Emotion as independent variable and by considering each data segment separately. The tests show significant differences for 12 out of 17 features. A significant main effect of Emotion (F (3, 228) = 3.665, p < .001) on the segment duration (the variable T IME) was observed. Post hoc comparisons using the Dunn-Bonferroni test showed that sadness (mean = 4.83s) was significantly longer compared to anger (p < .001), excitement (p < .001), and gratitude (p < .005), whist gratitude segments (mean = 4.2s) were longer than anger (mean = 3.13s, p < .001) and excitement (mean = 3.56s, p < .01).
These results are in line with the previous studies on full-body expressive behaviors [5,29], which showed that emotions influence a gesture velocity and acceleration. The significant results were also observed for the average (AV G_T D), standard error (SE_T D) and maximum (MAX _T D) of touch density; the average (AV G_T R) and standard error (SE_T R) of touch variability; the total number of dominant pad (T DPC) and angles changes (T DAC); the average rotation (AV G_T R), standard error (SE_T R) and deviation (SD_T R), as well as maximum rotation (MAX _T R). The post hoc comparisons for all features are in Figure 1. On average a larger surface of the iCube was contacted for anger and excitement compared to sadness. At the same time, less touch variability can be seen for sadness as compared to anger and excitement. More rotations were performed for the two high arousal emotions (i.e., anger and excitement) as compared to sadness and gratitude. This finding is consistent with the results mentioned just above (please compare the graphs for AV G_T R and AV G_TV in Figure 1). From Figure  1, it can be also seen that features T DPC and T DAC show similar tendency as feature T IME, and their values are particularly high for sadness. Indeed these two features may depend on the task duration.
These results show that it can be possible to differentiate some of the targeted emotions. The effect of emotion was observed for most features. In particular, several significant results were obtained for anger and sadness, whist it might be more difficult to differentiate the pairs: excitement and gratitude, anger and excitement, as well as sadness and gratitude. At the same time, it seems that both kinematics and tactile data can be useful for emotion classification.

Classification
In this section we check whether 1) automatic emotion recognition is possible above chance level, and 2) the features from both modalities contribute to the classification. Therefore, we applied only 2 classifiers: a) SVM with RBF kernel and b) Localized Multiple Kernel Learning (LMKL) [10]. To train them, we used 12 features for which the effect of Emotion was observed in Section 5.1. Leave-one-out cross-validation was performed. SVM-RBF was chosen as it was widely used in the past to classify emotions from the tactile [15,18] and kinematics data [16,19]. Among several parameters that were tested for SVM-RBF, the best performance was obtained for C = 5 and γ = −5 when all 12 features were used; C = 3 and γ = −2 when touch features were used, and C = 1, γ = −2 when kinematics features were used. Before training SVM-RBF, z-normalization was applied. We compared the SVM-RBF with LMKL. The latter was considered, as it has shown impressive performance for many applications involving human nonverbal behaviors analysis (e.g., [3,4]). LMKL uses nonlinear kernel weights combination. In this work, for fair comparisons it was combined with SVM. As gating model, which selects the optimum kernel function locally, softmax function was used with linear kernels varying from two to seven. The best performance was obtained for C = −1 and two kernels when 12 features were used.
The results obtained from SVM and LMKL are in the Table 1. In all cases the accuracy is highly above the random guess (i.e., 0.25). The best results were obtained for sadness, which still is often confused with the second low arousal emotion, i.e., gratitude. Gratitude is often confused with the other positive emotion, i.e., excitement. The worst recognition on average was observed for excitement. To study the contribution of each modality we run 2 experiments with: 1) 5 kinematics (AV G_T R, SD_T R, SE_T R, MAX _T R, T DAC), and 2) 6 tactile features (AV G_T D, SE_T D, T DTC, SE_TV , MAX _T D, AV G_TV ) with SVM-RBF. According to the results both sets contribute to classification: the accuracy for kinematics features only was higher than the accuracy for tactile features (0.45 and 0.41). In both cases the accuracy is lower than the accuracy of SVM-RBF baseline experiment, but higher than the chance level. When SVM-RBF and LMKL are compared, LMKL performed much better, which is consistent with previous works that use both methods [3,4].

CONCLUSIONS
In this paper, we presented a pioneering work aiming to explore whether it is possible to recognize emotions from the tactile and kinematics data during the everyday hand-object interactions. To the best of our knowledge, this is the first work that proposes a computational approach to deal with affect-related multimodal data of this type. In more detail, the main contributions of this work are: 1) we presented a set of high-level easily interpretable features for tactile data; 2) we demonstrated that our tactile and kinematics features can be used to differentiate affective states; the results on tactile data are novel contribution to the field of affective interaction; 3) we showed that emotion classification using multimodal data is possible reaching the accuracy of 0.75 for 4 emotions. We also have shown that the iCube can be used to learn about how the interaction with hard hand-held objects is influenced by humans' emotions even without gathering the information about the pressure applied during the contact. Several future works are planned. To improve the performance, the effects of interpersonal differences and the assignment need to be studied and addressed within the extended data collection. To evaluate the versatility of this approach we will collect the data using objects of different physical properties (e.g., significantly smaller and lighter) and other semantically-neutral 3D shapes. More research is also needed to see whether one can distinguish emotions which are similar in terms of arousal and valence such as anger, frustration and anxiety. We will also collect data in ecological setting, e.g., games scenario, to gather spontaneous affective interactions.

ACKNOWLEDGMENTS
We are very grateful to Prof. Giulio Sandini, Antonio Maviglia, Marcello Goccia, Diego Torrazza, Elio Massa and the others who ideated and developed the iCube. A.S. is supported by a Starting Grant from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme. G.A. No 804388, wHiSPER.