An Investigation of Cross-Cultural Semi-Supervised Learning for Continuous Affect Recognition

One of the keys for supervised learning techniques to succeed resides in the access to vast amounts of labelled training data. The process of data collection, however, is expensive, time-consuming, and application dependent. In the current digital era, data can be collected continuously. This continuity renders data annotation into an endless task, which potentially, in problems such as emotion recognition, requires annotators with different cultural backgrounds. Herein, we study the impact of utilising data from different cultures in a semi-supervised learning approach to label training material for the automatic recognition of arousal and valence. Speciﬁcally, we compare the performance of culture-speciﬁc affect recognition models trained with manual or cross-cultural automatic annotations. The experiments performed in this work use the dataset released for the Cross-cultural Emotion Sub-challenge of the Audio/Visual Emotion Challenge (AVEC) 2019. The results obtained convey that the cultures used for training impact on the system performance. Furthermore, in most of the scenarios assessed, affect recognition models trained with hybrid solutions, combining manual and automatic annotations, surpass the baseline model, which was exclusively trained with manual annotations.


Introduction
High quality labelled data is of vital importance in supervised learning approaches.The increasing amount of sensors and devices permanently connected to the Internet allows the continuous collection of information.So that this data can help improving the performance of machine learning algorithms, it needs to be annotated.Data collection can, therefore, be expensive and time-consuming.This process is even costlier when it comes to affective datasets, as the gold standard being mapped to a specific sample is determined by analysing the individual labels provided by multiple annotators on the same sample.Furthermore, these annotators need specific training, and are culturally specific.The annotators should share the same culture as the users in the dataset to guarantee annotation reliability, as different cultures show emotions differently [1,2].To ease the data annotation process, researchers have investigated the use of Semi-Supervised Learning (SSL) approaches [3].
Methods using SSL have been investigated with different modalities [4,5,6,7,8,9].In the particular case of the audio modality, SSL techniques have been employed in a wide range of problems, such as automatic speech recognition [10], sound classification [11], or depression detection [12], to name but a few.In the field of affective computing, researchers investigated the benefits of SSL in the problem of emotion recognition from audio [13] and video [14,15].Previous works proposed methods to enhance the annotations inferred via SSL to mitigate the propagation of the error caused by the inference, reducing their impact to the overall system performance [16].Further studies explored cooperative [17] and collaborative [18] learning approaches, which combine expert (manual) and machine (automatic) annotations.Others investigated the benefits of using SSL in crowdsourcing paradigms to generate emotional labels [19].
The possibility to automatically annotate affective datasets, or to reduce the number of annotators needed for labelling without deteriorating the quality of the annotations themselves is the primary goal when using SSL.Despite the usefulness of SSL techniques in affect-related problems, to the best of the authors' knowledge, the limitations of SSL in cross-cultural settings has not been investigated yet.SSL-powered systems that automatically gather data from online social media platforms [20] might benefit from these investigations in order to determine whether cultural aspects need to be taken into account for improving the quality of their annotations.In this work, we aim to analyse how the cultural dependencies on conveying emotions impact the performance of affect recognition models when using SSL annotations as training material.Specifically, we focus this study on the continuous recognition of arousal from the voice, and valence from the face, assessing our models on German, Hungarian, and Chinese cultures.
The rest of the paper is laid out as follows.Section 2 presents the dataset employed, while Section 3 describes the methodology followed.Section 4 details the experiments performed and analyses the results obtained.Finally, Section 5 concludes the paper and suggests some future work directions.

Cross-cultural Emotion Dataset
The present work investigates the Cross-cultural Emotion Subchallenge (CES) dataset, an audio-visual dataset with continuous emotional annotations in the valence-arousal space [21].The dataset was released for the CES task of the 9 th Audio/Visual Emotion Challenge (AVEC) and Workshop [22], and consists of a subset of the interactions gathered in the SEWA database [23].CES captures spontaneous in-the-wild interactions between pairs of friends or relatives from German, Hungarian, and Chinese cultures, while remotely discussing a commercial they had just seen.The German and Hungarian cultures were available in the train, development, and test partitions.Interactions from the Chinese culture were only available in the test partition.
The interactions were recorded using a computer-based platform.Audio data was recorded at 48 kHz, video data at 50 Frames Per Second (FPS), and affect-related annotations at 10 Hz.The video modality always contains information from one of the two interactants, while the audio modality contains information from both.To ensure a fair use of the involved modalities, we exclusively analyse the segments of the interactions corresponding to the timestamps in which the information from both acoustic and visual modalities match; i. e. , the information from the interactant speaking is the same as the one being video recorded.Table 1 summarises the data available and used in this work.

Methodology
This section introduces the system implemented (cf.Section 3.1), illustrated in Figure 1, and describes the SSL approach followed in this work (cf.Section 3.2).

Implemented System
Three main components form our system (cf.Figure 1), which we proceed to describe in the following paragraphs.
Data Preparation.Based on the nature of the CES dataset (cf.Section 2), we first cropped the original videos selecting the timestamps in which the interactant speaking is the same as the one being video recorded.Furthermore, we compensated the delay annotators might have experienced between perceiving and reporting the emotional state of the interactant [24].Using annotation delay compensation, we shifted the affect-related annotations back in time by 2.4 seconds [25].The next step is the extraction of audiovisual features from the cropped videos.The 23 Low Level Descriptors (LLDs) of the EGEMAPS feature set [26] are extracted from the audio signals using OPENS-MILE [27].For the visual modality, we opted for extracting the intensities of 17 Facial Action Units (FAUs) using OPEN-FACE [28].Both acoustic and visual LLDs are extracted at different sampling rates.To overcome this issue, we computed their functionals, as a technique for summarising the information extracted.Specifically, we used sliding windows of 4 seconds length with a hop size of 0.1 seconds to compute the mean and standard deviation of the LLDs extracted in the corresponding time span.The window length selected ensures capturing useful affect-related information [22].The hop size used contributes to homogenising the sampling rates between the audiovisual functionals and the annotations.The functionals are finally standardised to boost the convergence when training the models.
Affective states are context-related, and, as a consequence, it is beneficial to include contextual information, as past information in the time domain, when modelling affect [29,30].This temporal modelling can be achieved using Recurrent Neural Networks (RNNs).In this work, we emulated the time annotators need to perceive affect and modelled the current annotation, y[n], with current and previous input features, Nevertheless, the current annotations do not only correlate with the features themselves, but also depend on the previous annotations.Hence, we modelled affective annotations as where N corresponds to the number of samples needed to capture 2.4 seconds of data, in concordance with our chosen annotation delay compensation factor.This many-to-one approach can be interpreted as a technique for data augmentation.
Neural Network Architecture.Affective annotations are modelled with a Gated Recurrent Unit Recurrent Neural Network (GRU-RNN) followed by three stacked Fully Connected (FC) layers.The GRU-RNN, with 32 hidden units, aims to capture the time dependencies of the input data sequence, and learns a hidden representation.The purpose of the FC layers, with 32, 16, and 1 neurons, respectively, is to progressively compress the information embedded in the hidden representation learnt with the GRU-RNN.The last FC layer uses a HardTanh activation function, so the inferred annotations belong to the range [−1, 1].The network is trained using the Concordance Correlation Coefficient (CCC) as the loss, with Adam as the optimiser.The learning rate of the optimiser was set to 1 • 10 −4 .Data from all available interactions was read at once, and we selected one in every five consecutive windows of features as training material.This way, we reduced the oversampling of the training data, and contributed to a better network generalisation.The weights of the network were updated using mini-batches of 1 000 samples.The network was trained during a maximum of 300 epochs, and implemented an early stopping method to stop training when the loss on the validation partition does not improve for 20 consecutive epochs.
As the previous gold standard annotations defined in Equation (1) are not available at inference time, the inferred annotations in previous time steps are used on the prediction of the current annotation.The buffer with previously inferred annotations is initialised with zeros at every new interaction coming to the system, and continuously updated.
Post-processing.The inferred annotations are post-processed using a median filter before the actual assessment of the models.The median filter uses a kernel size of 3 samples to post-process the annotations associated to the audio modality, and a kernel size of 33 samples to post-process the annotations associated to the video modality.These parameters were optimised for the assessment of the baseline model on the development partition.

Semi-Supervised Learning Approach
Our purpose is to assess the cultural influence on training affect recognition models with SSL annotations.Hence, interactions with SSL annotations need to be included as training material.For a fair comparison between the models, we split the interactions in the train partition into two disjoint subsets, named SM and SA.The subset SM contains half of the original interactions with their corresponding manual annotations, and is used to train a Manual model.The M model is then used to automatically annotate the interactions belonging to the subset SA, which contains the interactions excluded from SM .Next, we used the interactions belonging to SA and their corresponding SSL annotations to train an Automatic model.Finally, we combined SM and SA subsets with their corresponding manual and SSL annotations, respectively, to train a Manual + Automatic model.
In order to investigate the cultural impact on the performance of SSL, we set two different scenarios.In the first scenario, only German interactions were included in SM .In the second one, only Hungarian interactions were included in SM .We extended this analysis with a third scenario, in which SM contained half of the interactions from both German and Hungarian cultures.This splitting was performed by seeding the pseudo-random number generator and is publicly available 1 .Interactions belonging exclusively to the train partition were used to train the models assessed on the development partition.The models assessed on the test partition used interactions from both train and development partitions as training material.Thus, at this stage, the interactions belonging to the development partition were also split and included in the two disjoint subsets SM and SA, and processed as described in the aforementioned procedure.

Experimental Results
The interactions belonging to the CES dataset had been manually labelled in terms of valence and arousal.Thus, we used these manual annotations to train the baseline models for our experiments.As arousal information is considered to be stronger in the voice, while valence information, in the face [31], we focused our analysis on the automatic recognition of arousal from acoustic features (cf.Section 4.1), and valence from visual features (cf.Section 4.2).The performance of the trained models in the different scenarios outlined in Section 3.2 is assessed by computing the CCC between the inferred and ground truth annotations from all interactions belonging to each specific cultural subset in the development or test partitions.

Arousal Recognition from Acoustic Information
The results obtained on the automatic recognition of arousal are summarised in Tables 2 and 3. Table 2 compares the performance of the models when using manual or automatic annotations exclusively as training material.Table 3 compares the performance of the baseline model, which uses manual annotations from all the interactions as training material, with the hybrid models, which are trained using both manual and automatic annotations.The performance analysis of the models trained with manual or automatic annotations (cf.Table 2) indicates the suitability of the manual annotations.When only German interactions were used in SM , the trained M model achieved a better performance than the A model on both development and test partitions.In the second scenario, in which only Hungarian interactions pop-Table 4: Summary of the Concordance Correlation Coefficients (CCC) obtained by comparing the ground truth and the predicted valence annotations from visual features per culture on both development and test partitions.Specifically, we compared the performance of the models when trained using Manual or Automatic annotations as training material.For each scenario, the selection of the interactions used to train the M model was performed culture-wise.The highest CCC scores per culture in each scenario are highlighted.From the evaluation of the hybrid models (cf.Table 3), we observe that for the three cultures belonging to the test partition, the performance of the best hybrid models surpassed the baseline model.Specifically, hybrid models trained with German and Hungarian interactions in SM achieved the highest CCC scores on both the German and Hungarian interactions on the test set.On the Chinese culture, the best model was obtained when using Hungarian interactions only in SM .

Valence Recognition from Visual Information
The results obtained on the automatic recognition of valence are summarised in Tables 4 and 5. Table 4 compares the performance of the models when using manual or automatic annotations exclusively as training material.Table 5 compares the performance of the baseline model, which uses manual annotations from all the interactions as training material, with the hybrid models, which are trained using both manual and automatic annotations.
The performance analysis of the models trained with manual or automatic annotations (cf.Table 4) shows interesting results.When only German interactions were used in SM , the A model obtained the highest CCC scores in all cultures from both development and test partitions.On the other hand, when only Hungarian interactions were used in SM , the M model achieved the highest CCC scores in all cultures from both development and test partitions.Finally, when both German and Hungarian interactions populated SM , the M and A models scored the highest CCC on the German and Hungarian interactions belonging to From the evaluation of the hybrid models (cf.Table 5), we observe that for the three cultures belonging to the test partition, the highest CCC scores were obtained with the hybrid model that used German interactions to populate SM .Specifically, for the German and Chinese interactions, the performance of this model surpassed the baseline model.

Conclusions
This work assessed the impact of culture when using SSL on the continuous recognition of affect.Specifically, we focused on the automatic recognition of arousal from the voice, and valence from the face.The results obtained conveyed that the culture of the interactions used for training the models impacted the overall system performance.In most of the cases analysed when comparing M and A models, the best performances were obtained when affective models were trained using manual annotations.Nonetheless, the use of SSL annotations alone showed highly competitive results.When analysing the M+A models, we observed that hybrid solutions, combining manual and automatic annotations, surpassed the baseline, which only used manual annotations, in most of the cases investigated.These results encourage the use of automatic annotations or hybrid solutions to ease the data annotation process in affect-related problems.
Future directions to carry on this work include the crossmodal study of SSL for continuous affect recognition, and the investigation of multi-task networks in this problem in order to exploit the supplementary information embedded in the valence and arousal dimensions simultaneously.Further work can be performed towards a deep understanding of the benefits of using teacher forcing strategies in multimodal paradigms aiming at the continuous recognition of affect.

Acknowledgements
This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No. 826506 (sustAGE).

Table 2 :
Summary of the Concordance Correlation Coefficients (CCC) obtained by comparing the ground truth and the predicted arousal annotations from acoustic features per culture on both development and test partitions.Specifically, we compared the performance of the models when trained using Manual or Automatic annotations as training material.For each scenario, the selection of the interactions used to train the M model was performed culture-wise.The highest CCC scores per culture in each scenario are highlighted.

Table 3 :
Summary of the Concordance Correlation Coefficients (CCC) obtained by comparing the ground truth and the predicted arousal annotations from acoustic features per culture on both development and test partitions.The baseline model was trained using the original manual annotations.The remaining models were trained combining manual and automatic annotations (M+A model).The interactions used to infer the automatic annotations were selected culture-wise.The highest CCC scores per culture among the hybrid models assessed are highlighted.
M and A models obtained the best performances on the German and Hungarian interactions belonging to the development partition, respectively.On the test partition, the M model scored the highest CCC on the German and Hungarian interactions, while for the Chinese interactions, the best CCC was obtained with the A model.In the last scenario, which combined German and Hungarian interactions in SM , the M and A models scored the highest CCC on the German and Hungarian interactions belonging to the development partition, respectively.On the test partition, the M model obtained the best results in all the cultures assessed.From a cultural perspective, the German model obtained the best performance on the German interactions belonging to the development partition, while the multicultural model scored the highest CCC on the Hungarian ones.On the test partition, the multicultural model achieved the best performance on the German interactions, while the Hungarian model scored the highest CCC on both the Hungarian and Chinese ones.

Table 5 :
Summary of the Concordance Correlation Coefficients (CCC) obtained by comparing the ground truth and the predicted valence annotations from visual features per culture on both development and test partitions.The baseline model was trained using the original manual annotations.The remaining models were trained combining manual and automatic annotations (M+A model).The interactions used to infer the automatic annotations were selected culture-wise.The highest CCC scores per culture among the hybrid models assessed are highlighted.On the test partition, the M model obtained a better performance than the A model in all the cultures assessed.From a cultural perspective, the German model obtained the best performance on the German interactions belonging to the development partition, while the Hungarian model scored the highest CCC on the Hungarian ones.On the test partition, the German model scored the highest CCC in all cultural interactions.