The Cold Start Problem and Per-Group Personalization in Real-Life Emotion Recognition With Wearables

Emotion recognition in real life from physiological signals provided by wrist worn devices still remains a great challenge especially due to difficulties with gathering annotated emotional events. For that purpose, we suggest building pre-trained machine learning models capable of detecting intense emotional states. This work aims to explore the cold start problem, where no data from the target subjects (users) are available at the beginning of the experiment to train the reasoning model. To address this issue, we investigate the potential of per-group personalization and the amount of data needed to perform it. Our results on real-life data indicate that even a week’s worth of personalized data improves the model performance.


I. INTRODUCTION AND RELATED WORK
For the last century, psychologists have been using physiological responses to affective stimuli to broaden the understanding of human emotions. Drawing on psychologists' accumulative work, scientists from the affective computing domain started using psychophysiological signals to develop algorithms to detect, process, and adapt to others' emotions. To allow machines to learn about specific emotions, researchers must acquire extensive and comprehensive datasets that offer abundant emotions and diverse physiological signals collected in an ecologically valid context, i.e., real life. However, the field of emotion recognition from psychophysiological signals has been dominated by laboratory studies in which emotions are elicited with standardized affect induction procedures. This limitation has recently been overcome by researchers collecting everyday life emotions with wearables [1], [2] and Experience Sampling Methods -ESM [3] (also referred to as a daily diary method, or Ecological Momentary Assessment -EMA [4]- [6]).
Using embedded sensors from popular wearables like smartwatches or wrist bands makes it possible to measure the This work was partially supported by National Science Centre, Poland, project no. 2020/37/B/ST6/03806; by the statutory funds of the Department of Artificial Intelligence, Wroclaw University of Science and Technology; by the Polish Ministry of Education and Science, National Information Processing Institute -the CLARIN-PL Project. behavioral and physiological components of emotions [1], [7]. Only a few studies were trying to recognize real-life emotions, i.e., research in the field, especially [2], [8], [9] lasting a dozen days, on pupils in the classroom [10], on workers in the factory [11], or [12]- [14] focusing primarily on mood. Except for [2], [15], these studies did not try to recognize emotions in particular points in time but rather averaged over a longer period.
Some other researchers tried to distinguish emotion in only one specific shorter life context, e.g., while walking along a specific route in the city for a few dozen minutes [16]- [18], or babies playing in the limited area [19]. Schmidt et al. provided some hints for such studies in the wild [20].
The crucial still open question is how to find the real-life moments in which individuals experience short noteworthy emotions. The ESM provides high ecological validity of the repeated in-the-moment experience measurement, in which participants receive the measurements' notifications in a semirandom design. However, ESM can be further improved with the recent developments in affective computing, in which the measurement moments can be detected by physiologically or behaviorally driven pre-trained machine learning (ML) models [15], [21].
Overall, the ML models consist of the architecture/classifier and the data. It raises an additional issue -we need some initial data to train the pre-trained models. This issue is similar to the cold start problem commonly encountered and considered for recommender systems [22], [23]. The essence of the cold start problem is to prepare the system (model) to work for unknown users, for which we have not collected any prior data.
Nevertheless, if we possess data from earlier field studies, we can create an initial model. Unfortunately, there are no publicly available datasets gathered in the field, and the researchers have to rely on data acquired on their own. Alternatively, we can use data collected in the lab, which in recent years become more accessible [24]- [26]. However, the model trained on data captured in the controlled environment may perform poorly in real life [27].
Once we have the initial model, an interesting question arises -what should we do after running the study for a couple of weeks and when a sufficient number of samples is collected. Should we just add new data before retraining the model, replace some old cases, or create a new model trained only on the new data? The decision is relatively easy when the previous and new studies are similar in setup -the same participants, assessments, and apparatus. However, it is common to run a new study/iteration with new participants and/or slightly change (improve) the setup based on the feedback from the previous studies. In such a case, the model trained on the data from the previous study might not perform well because of a different set of participants (emotion recognition models are known to have poor generalization ability [28], [29]), different setup, or in general due to concept drift [30], [31].
In this work we investigate four scenarios of retraining and replacing model once the sufficient number of samples are collected: Scenario S1 -utilizing old model; S2 -replacing some old data with the new one; S3 -training model on new samples only; S4 -retraining model using all available data (old and new). All our experiments in real-life show that adding new knowledge improves the model's performance, but the best results were achieved by the model made using only new data.
To some extent, this is similar to model personalization. Except we personalize the model per group of participants rather than per a single participant. There were several attempts of per-participant model personalization in the field studies; however, they were unsuccessful due to the low number of perperson samples [28]. Per-group personalization can mitigate this problem.

II. STUDY SETUP AND DATA
A. The Emognition Framework The Emognition system [15] includes a mobile Android application with an embedded pre-trained ML model, a smartwatch application recording physiological signals, and a backend server storing all data. The smartwatch used in the framework is Samsung Galaxy Watch 3, and the smartphones are Android-based devices owned by the participants. The connection between smartwatch and smartphone is handled by the Bluetooth Low Energy module. The 45mm version of the smartwatch, equipped with a 330mAh battery, can record up to 14 hours of physiological data before running out of power, while the smaller one, 41mm version with a 240mAh battery, can work for up to nine hours. Physiological data are recorded continuously and noninvasively. The smartwatch provides raw blood volume pulse (BVP) sampled at 25 Hz, heart rate (HR) sampled at 12.5 Hz, RR-interval (RRI) sampled at 12.5 Hz, and 3-axis accelerometer data (ACC) sampled at 50 Hz. The device provides other data: 3-axis gyroscope, 4-axis rotation, pressure, and ambient light. One hour of recording produces about 8.6 MB of compressed data. The data is transferred to the smartphone in real-time, and from there is uploaded to the back-end server every hour. The upload can also be triggered by the user.
For more details regarding the Emognition system, please refer to [15].

B. Data
In recent months, we have performed two daily life studies. The studies were alike but had a different set of participants and slightly modified self-assessment. We will refer to them as Study A and Study B.
The primary goal of Study A was to collect physiological signals during emotionally intense moments in participants' everyday lives. The collected emotionally annotated signals were then used for creating an ML model recognizing intense emotions in real-time [21]. The model was further used for more efficient data gathering in Study A and Study B. Study A involved 11 participants (four females) and lasted about seven months.
The main idea behind Study B, which is currently still in progress, is the validation of several various predictive models and further data collection. Study B involves 13 participants (six females) and is designed to last two months. In the analysis, we consider only the first four weeks of Study B and only five participants (two females) with the highest number of reported self-assessments. The changes introduced in the Emognition system in Study B include shorter self-assessment and three types of assessment triggers.
Participants' emotions were collected with brief questionnaires using ESM at quasi-random times, machine learning triggered, and self-initiated reports. First, participants were asked whether they felt intense emotions (yes/no/not sure). Based on this question, we categorized emotions as intense emotions (yes) or neutral states (no). Next, participants reported valence on a slider scale from 1 (extremely negative) to 100 (extremely positive), and arousal on a slider scale from 1 (extremely sluggish) to 100 (extremely aroused). Finally, participants had the opportunity to provide some comments as a free text.
In total, 1075 (440 intense emotions and 635 neutral states) self-reports were collected throughout both studies (Tab. I). The total participants' pool of data used in analyzes consisted of 16 participants (6 female) between the ages of 18 and 54 years (M=26.86, SD = 8.29). All participants (volunteers) provided written informed consent and received no compensation for their participation. The research was approved by and performed in accordance with guidelines and regulations of the Bioethical Committee at Wroclaw Medical University, Poland; approval no. 149/2020.

III. EXPERIMENTAL SCENARIOS
We have designed four possible scenarios to choose from once the study obtains the required number of samples to create a decision model. The scenarios are visualized in Fig. 1. One of the scenarios utilizes all available data (from both studies) to train the model, whereas the other three analyze whether it is profitable to replace the previous samples in the training set with new samples. To ensure we analyze the quality of the samples, not the quantity, Scenarios S1 to S3 consider an equal number of samples.
Scenario S1 assumes training a model on data from the previous studies only (Study A). This is a classic example of validating the model's generalization ability since data in the test set come from different participants than data in the training set, which is the only possible scenario at the very beginning of a new study. S1 is based on 237 samples that were drawn from the entire Study A in a way that the number of samples in each class is equal to the number of samples in weeks 1 and 2 of Study B, i.e., 126 samples of intense emotions and 111 samples of neutral state were randomly selected. The sampling was repeated five times and the results presented in the latter part of the article are the average of the five runs. Fig. 1: Four scenarios S1-S4 of using samples to train the classification model for the field study being currently conducted.
Scenario S2 utilizes part of data from Study A and adds data from the current Study B to create a model. This scenario is possible once we obtain some new data, but the amount is still too low to create an entirely new model. S2 includes 105 samples randomly selected from Study A and 132 samples from the first week of Study B. Like in the case of Scenario S1, the sampling was repeated five times and the results were averaged.
Scenario S3, per-group personalization, considers the model trained on the new samples only. It is trained with 237 samples collected during the first and second weeks of Study B. Its advantage is the same set of participants in the training and test sets.
Scenario S4, on the other hand, makes use of all available samples, i.e., data collected in Study A (undersampled to achieve balanced data) and all the data from the first two weeks of Study B. In total, 701 samples are used to build the predictive model.
From collected signals, we extracted windows of length 140 seconds, with the emotional event in the middle. We discarded windows with more than 10% of samples missing (compared to the expected amount, based on sampling frequency). Then, all signals were resampled using resample function from SciPy [32]. Next, we extracted a window of 60s around the emotional event (30s per side, event in the middle) for each signal. The window was further divided into three parts, each of length 20s. This partition was done to allow ML models to analyze the physiology before, during, and after the event, and potentially learn shorter dependencies and relations present when we experience intense emotions.

B. Features
For some experiments, it was necessary to extract features from signals. Computed features (see Tab. II) include standard statistical features like e.g., min, max, mean values of the signal, or standard deviation. Moreover, we computed differences between consecutive parts of a window for max, min, mean, std, and variance (e.g., difference between minimum values in the first and second part of a window). Furthermore, we computed features in the frequency domain, for example, minimum, maximum, or average values in the power spectrum. Additionally, for the BVP signal, we computed the mean value in low-and high-frequency power spectra. When creating a vector of features, features for all three parts of a window were concatenated, and two more date-related features were added. In total, 746 (with ACC) or 418 (without ACC) features were supplied to classifiers/architectures.  Day of the week (0 (Monday) -6 (Sunday)), Hour (0-23) (3 window parts × 4 signals = 12 channels in total). The deep learning architectures were programmed in PyTorch [33] according to an article by Dzieżyc et al. [34]. For classical machine learning algorithms, we used implementations from scikit-learn [35].

D. Model Training and Optimization
To prepare datasets for Scenarios S1, S2, and S4, data were balanced using a random sampling technique. We treated these samples as a basis for splits of data used to tune hyperparameters (5 drawings resulted in 5 splits). Each of such splits was further randomly divided into training and validation parts. For S3, which did not require balancing, data was split into five parts as well to account for differences in training and validation splits.
The best hyperparameters were chosen based on hyperparameter optimization, which was done separately for each scenario and model. Models from scikit-learn were optimized using grid search. For deep learning models, we utilized random search, as it is more efficient [36], thus more suited for the long training process. For each classifier, its hyperparameters space was tested using five-fold validation. In all cases, the best hyperparameters were chosen based on the mean F1 macro score. The best models were retrained on the whole splits and tested on the data from Study B weeks 3+4, see Fig. 1.

V. RESULTS
The results of each scenario and model are presented in Tab. III. The highest scores for each classifier/architecture and performance measure are bolded. S3 does not have mean values as there were no random subsets of the training sets in this scenario. We consider three metrics: (1) F1 on class 1, as we aim to recognize intense emotions properly and catching all possible emotional events is more important than capturing neutral states; (2) F1 macro, to monitor the overall performance of the model in emotional and neutral states; and (3) accuracy as another overall measure.
In general, regardless of the model and feature set used, Scenario S3 performed better than other scenarios. This result shows the importance of model personalization in the emotion recognition task. The effectiveness of the predictive model gradually increases when we replace training samples from Study A (previous study) with the samples from the current Study B, see Fig. 2. This tendency is noticeable for every kind of presented approach. The best performing models were AdaBoost and SVM for feature-based classical approach, MLP with ACC for feature-based deep learning, and FCN-LSTM with ACC for e2e deep learning. The mean differences between Scenario S1 and S3, in favor of S3, are 0.09 in F1 on class 1, 0.05 in F1 macro, and 0.05 in accuracy. Particularly significant and desired is gain in F1 on class 1. A possible conclusion is that physiological traces of intense emotions are more personalized/user-dependent than physiological changes during neutral states. In several cases, models based on S4 performed better than models based on S3. This may indicate that some classifiers/architectures benefit from additional training samples, even though samples are not representative (out of the application domain). Nevertheless, in the majority of cases where the S4 model achieved higher results, the model from S3 performed within the range of the standard deviation of the S4 model.
The Friedman statistical test [37] confirmed that the model created in Scenario S3 is the top-ranked, S4 -the secondbest, S2 -the third, and S1 is ranked lowest (p = 3E−6). The Shaffer post-hoc multiple comparisons [38] indicated that the differences between the results of S1 and all other models are statistically significant. The difference between the results of other models, i.e., S2 vs. S3, S2 vs. S4, and S3 vs. S4, are insignificant. There is no clear indication, whether including accelerometer data improves the model. It definitely increases the complexity and computational requirements.

VI. CONCLUSIONS
Since emotional events happen in our everyday life sporadically, we should make every effort to increase the likelihood of capturing such cases with wrist-worn smartwatches. This includes personalized ML models recognizing the proper time to trigger self-assessments. However, creating personalized models requires a large number of per-person training samples, i.e., to overcome the cold start problem. Until the necessary quantity of cases is reached, we propose using an alternative, temporal solution, namely per-group personalization. The analysis performed on real-life data demonstrates that adjusting the model to the group of participants (Scenario S3) improves the classification quality over the general model (Scenario S1) or partially adjusted model (Scenario S2). A large number of general samples enriched with the personal samples (Scenario S4) can improve the classification over the general or partially adjusted model (Scenario S1 and Scenario S2), however because of the large portion of the general samples, is not able to outperform the adjusted model (Scenario S3). This leads us to the conclusion that not only the quantity of the training set but mostly its quality improves the models' predictive ability. Models perform better when they are trained on data from the application domain. We can also infer that human physiology can not be easily generalized to unknown participants. Hence, the cold start problem is a major concern at the beginning of a new study. The solution is to collect new subjects' data and perform models personalization as soon as possible to provide better-suited predictions. An obvious approach would be to adjust models for each participant separately. We have attempted such a scenario but did not obtain satisfactory results. The most probable reason for unsuccessful per-person model personalization is the low number of per-subject samples. The number of selfassessments (annotated samples) collected per person during the first two weeks of Study B varied from 13 to 33 (avg 23.7).
Study B described in this work is still ongoing. We plan to validate the model from Scenario S3 in real life by propagating the model to the participants. Furthermore, our next step will be to enrich the prediction of the intense emotion with the valence (positive vs. negative emotion).