MuSe 2020 Challenge and Workshop: Multimodal Sentiment Analysis, Emotion-target Engagement and Trustworthiness Detection in Real-life Media: Emotional Car Reviews in-the-wild

Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 is a Challenge-based Workshop focusing on the tasks of sentiment recognition, as well as emotion-target engagement and trustworthiness detection by means of more comprehensively integrating the audio-visual and language modalities. The purpose of MuSe 2020 is to bring together communities from different disciplines; mainly, the audio-visual emotion recognition community (signal-based), and the sentiment analysis community (symbol-based). We present three distinct sub-challenges: MuSe-Wild, which focuses on continuous emotion (arousal and valence) prediction; MuSe-Topic, in which participants recognise 10 domain-specific topics as the target of 3-class (low, medium, high) emotions; and MuSe-Trust, in which the novel aspect of trustworthiness is to be predicted. In this paper, we provide detailed information on MuSe-CAR, the first of its kind in-the-wild database, which is utilised for the challenge, as well as the state-of-the-art features and modelling approaches applied. For each sub-challenge, a competitive baseline for participants is set; namely, on test we report for MuSe-Wild a combined (valence and arousal) CCC of .2568, for MuSe-Topic a score (computed as 0.34 * UAR + 0.66 * F1) of 76.78 % on the 10-class topic and 40.64 % on the 3-class emotion prediction, and for MuSe-Trust a CCC of .4359.


INTRODUCTION
Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 is a novel Challenge-based Workshop in which sentiment recognition, as well as emotion-target engagement and trustworthiness detection are the main focus. MuSe aims to provide a testing bed for more extensively exploring the fusion of the audio-visual and language modalities. The core purpose of MuSe is to bring together communities from differing computational disciplines; mainly, the sentiment analysis community (symbol-based), and the audio-visual emotion recognition community (signal-based).
The first group -rooted in the field of Sentiment (and Opinion) Mining and specialising in Natural Language Processing (NLP) methods for symbolic information analysis -leverages the text modality, and focuses on the prediction only of discrete sentiment label categories [50]. In numerous competitions from recent years researchers from the second group -mostly rooted in the field of Affective (and Behavioural) Computing and specialised in intelligent signal processing -focused on one, or both of the audio and vision modalities, in order to predict the continuous-valued valence and arousal dimensions of emotion (circumplex model of affect), while often disregarding the potential contribution of textual information [15,30,36,47]. However, approaches by both communities now show signs of convergence, highly influenced by related, explicitly multimodal learning techniques [3,13,28]. Of note, the 2020 IN-TERSPEECH Computational Paralinguistics (ComParE) Challenge included for the first time baselines utilising both audio signal and text transcripts [35].
With this in mind, MuSe 2020 aims to attract both communities equally and encourages a fusion of modalities to demonstrate the advantages within the field of emotion specifically. Ideally, participation should strive towards the development of unified approaches applicable to each task. Tasks have arisen from different academic traditions: on the one hand, complex, dimensional emotion annotations relating to the expression of behaviour, and on the other hand, linking sentiment and emotions to topics (context), entities or aspects, as is common in sentiment analysis [38].
A second contribution of MuSe 2020 is the facilitation of a broad comparison of the merits for the three core modalities (language, audio, and visual cues), as well as various approaches of multimodal fusion under well-defined and strictly comparable conditions. In this way, establishing the extent to which the fusion of approaches is possible and beneficial, as well as advancing sentiment and emotion recognition systems to be able to deal with fully naturalistic (in-thewild) behaviour from large volumes of in-the-wild (user-generated) data. User-generated data types refers to data sourced from the target user themselves and are the new generation of data utilised for real world multimedia affect and sentiment analysis [48] and other research fields [9].
For all of the three sub-challenges, one dataset is chosen to make the comparison between each sub-challenge more easily facilitated.
For this year's MuSe 2020, we introduce the Multimodal Sentiment Analysis in Car Reviews dataset MuSe-CaR which covers the range of aforementioned topics discussed. MuSe-CaR is a large, multimodal dataset which has been gathered in-the-wild with the intention of further understanding real world Multimodal Sentiment Analysis, in particular the emotional engagement that takes place during product reviews (i. e., automobile reviews) where a sentiment is linked to a topic or entity.

CHALLENGE OUTLINE AND PROTOCOL
The major novelties discussed herein will be introduced in MuSe 2020 through three core sub-challenges, (i) Multimodal Sentiment in-the-Wild Sub-challenge (MuSe-Wild ), (ii) Multimodal Emotion-Target Engagement Sub-challenge (MuSe-Topic ) (iii) Multimodal Trustworthiness Sub-challenge (MuSe-Trust ). In the following, we will describe and highlight the aforementioned novelties of each each sub-challenge, as well as include guidelines for participation.
Individuals wishing to participate in the MuSe 2020 challenge must hold an academic affiliation. Further to this, they should download and fill out the End User License Agreement (EULA) and submit via the homepage 1 . All entries to the challenge should be accompanied with a document which describes in detail methods and results and includes a citation of this paper. To appear on the temporary, public leader board on the MuSe homepage, participants must provide predictions, a Github repository where their source code is uploaded, and a link to an arXiv preliminary technical report. The organisers do not participate in the Challenge themselves, but re-evaluate the findings of the best performing system of each Sub-challenge. There will be a double blind peer-reviewed process by the technical program committee, and only papers which meet the standards set by peer-review will be eligible for the main competition. Papers accepted for the workshop will be allocated 6-8 pages (plus references) in the proceedings of ACM MM 2020.

MuSe-Wild Sub-Challenge
In the MuSe-Wild Sub-Challenge, participants are predicting the level of affective dimensions (arousal, and valence) in a time-continuous manner from audio-visual recordings. Valence thereby is strongly linked to the emotional component of the umbrella term of sentiment analysis and is often used interchangeably [22,27,41]. Timestamps to enable modality alignment and fusion on word-, sentence-, and utterance-level as well as several acoustic, visual and textualbased features are pre-computed and provided with the challenge package. The evaluation metric for this sub-challenge is concordance correlation coefficient (CCC), which is often used in similar challenges [30,47]. CCC is a measure of reproducibility and performance, which condenses information on both precision and accuracy, is robust to changes in scale and location [18], and its theoretical properties to other regression measures, e. g., (root) mean squared error, are well understood [23]. For the baseline for the MuSe-Wild sub-challenge the mean of arousal and valence is taken.

MuSe-Topic Sub-challenge
In the MuSe-Topic Sub-challenge, participants are predicting 10classes of domain-specific (automotive, as given by the chosen database) topics 2 as the target of emotions. In addition, three classes (low, medium, and high) of valence and arousal should be predicted i. e., for each topic segment, one valence and one arousal value. These classes are the mean value of the temporally aggregated continuous labels of MuSe-Wild , which were divided into three equally sized classes (33 %) for each label For this sub-challenge, first, the weighted score combining (0.34·) Unweighted Average Recall (UAR) and (0.66·) F1 (micro) measures independently for each predictions (Valence, Arousal and Topic) are calculated. We include both these factors to keep our evaluation consistent with previous challenges, as the former was partially used to evaluate a classification task in [15], and the latter in [35]. Second, the mean of the weighted scores for Valence and Arousal (combined) is calculated. Third, to combine the mean with the topic score the mean rank over all participants ((rank of combined emotions result + rank of topic result)/2) is calculated for the final performance assessment. In case two participants should have the same mean rank, the one with the highest topic rank will be the final winner. We believe that this composite measure is most discriminative to meaningfully showcase performance improvements in emotion and topic prediction, as it places importance on precision and recall, in both a dataset-wide and class-specific manner.

MuSe-Trust Sub-challenge
In the MuSe-Trust Sub-challenge, participants are predicting a continuous trustworthiness signal from user-generated audio-visual content in a sequential manner and are provided with aligned valence and arousal annotations, which participants are encouraged to explore, in a means of understanding the relationship between emotional labels in depth and at large scale. The evaluation metric for this sub-challenge is concordance correlation coefficient (CCC).

CHALLENGE DATASET
For all of the three Sub-Challenges of MuSe 2020, the MuSe-CaR data set is utilised. MuSe-CaR is a large, extensively annotated multimodal ((spoken) language, audio, video) dataset which has been gathered in-the-wild with the intention of developing appropriate methods and further understanding Multimodal Sentiment Analysis in-the-wild. MuSe-CaR has been designed with an abundance of computational tasks in mind, including emotion and entity recognition, and dominantly with the intention of improving machine understanding of how sentiment (i. e., emotion) is linked to an entity and aspects of such reviews. The estimated age range of the professional, semi-professional ('influencers'), and casual reviewers is from the mid-20s until the late-50s. Most are native English speakers from the UK or the US, while a small minority are non-native, yet fluent English speakers. MuSe-CaR includes high voice and video quality, as everyday recording devices have improved in recent years. This enables robust learning of a high degree of novel, in-the-wild characteristics.
For the MuSe 2020 Challenge, we selected a high-quality sub-set of the MuSe-CaR dataset consisting of 36 h : 52 m : 08 s of video data from 291 videos and 70 host speakers (plus an additional of roughly 20 narrators) sourced from YouTube.
When creating the data set, it was of particular importance to find a balance between the stable and uncontrollable, 'in-the-wild' properties such as different recording devices, camera perspectives, ambient noises (car noises, music), or changing backgrounds to allow for meaningful learning with current deep learning methods. Such 'in-the-wild' characteristics of MuSe-CaR include; i) video: shot size, face-angle, camera motion, reviewer visibility, reviewer face occlusion (glasses), and highly varying backgrounds; ii) audio: ambient noises, narrator and host diarisation, diverse microphone types, and speaker locations; iii) text: colloquialisms, and domainspecific terms.
The topic of videos within MuSe-CaR is limited to vehicle reviews, with the number of vehicle manufacturers being restricted to premium brands (BMW, Audi, Mercedes-Benz) that equip their vehicles with the latest technology, thus, ensuring that discussed entities and aspects (e. g., semi-autonomous vehicle functions) occur across a board range of videos (and different manufacturers). Most of the reviewers are semi-or professional reviewers (e. g., YouTube channel 'influencers'). All YouTube channels used within MuSe-CaR have given full consent for their data to be used with the context of academic research 3 .
To avoid extremely objective reviews, during the selection process videos were rated on a scale between 0 (emotionless) and 5 (very emotional). We filtered out all videos with a score less than 3 before annotation began. Within MuSe-CaR , there are 15 annotation tiers (3 continuous dimensional, 3 partially continuous binary label, 5 categorical, and 4 automatically annotated tiers). For MuSe 2020, we utilise 3 continuous ratings, and the topic categorical ratings. Each recording has been annotated in three continuous dimensions; emotional valence (hence reflecting sentiment) and arousal according to Russell's theory [32], and additionally the novel aspect of trustworthiness, each by at least 5 independent annotators. In the case of the Trustworthiness dimension, there has been minimal research into the link between this and other emotions [1], and to the best of the authors' knowledge, it has not been utilised nor predicted using machine learning.
A gold-standard was computed on the individual annotators using an Evaluator Weighted Estimator (EWE) approach, in which inter-rater agreement is considered. EWE is described, e. g., further in [34] and has been applied to similar continuous emotion-based tasks [30], and corpora [31]. In addition to the dimensional annotations we included the categorical labelling of emotional engagement with topics, such as comfort, safety, interior, and performance.
For the MuSe 2020 Challenge, data has been partitioned in a Train, Development, and Test convention, where aspects including emotional ratings, speaker independence, and duration have been considered (cf. Table 1 for an overview). The total duration of data for each sub-challenge varies, as further pre-processing to include the most informative data only was applied. For MuSe-Wild and MuSe-Trust , all parts with an active voice or a visible face are included. We excluded non-product related video segments (e. g., advertisements) for MuSe-Wild and MuSe-Topic to minimise the distortion these could cause on the task objectives. More specifically, for MuSe-Topic , we only included sections which have an active voice based on the sentence transcriptions. To not fragment it to purely sentence segments, we fused adjacent segments if the segments cover the same topic and are less than two seconds apart. Regarding MuSe-Trust , non-product related information -for instance, advertisement -, might have a notable impact on the trustworthiness perception of the video. Therefore, segments containing advertisements for products and YouTube channels are included.

BASELINE FEATURES
For each Sub-challenge, we provide a selection of features to participants which have been extracted from language, audio (including speech-to-text), and video signals. Extracting rich features from a huge amount of video data takes days, sometimes weeks, to complete, which would cost participants valuable time. For this reason, we provide 14 model-ready audio, visual, and linguistic feature sets 4 , an amount which far exceeds the number of feature sets provided by other comparable audio-visual challenges [16,30,47,50].
In the proceeding section, the feature sets based on each modality (acoustic, vision, and language) are described. For all feature sets a hop size of 0.25 s was applied (unless otherwise stated) to be inline with the annotation sampling rate.

Acoustic
For extracting acoustic features, we utilise well-known feature extraction tools, namely openSMILE and DeepSpectrum , which have both shown success in a variety of audio processing tasks, including prominent work in speech emotion recognition (SER) [8,33]. Audio is extracted directly from the YouTube videos, normalised to -3dB and converted from stereo to mono, 16 kHz, 16 bit. For all acoustic features, we apply a window size of 5 seconds.
4.1.1 openSMILE . The freely available openSMILE toolkit [12] is utilised to extract the well-known extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS ) [11]. eGeMAPS is a handcrafted speech-based feature set, containing 88 features designed specially for Speech Emotion Recognition (SER) tasks [40]. In addition, 130 dimensional low-level descriptors are provided which have been computed with openSMILE , and include the features 1st and 2nd-order derivatives (deltas and double-deltas). LLD extraction remained at the the default openSMILE configuration and therefore a window size of 10 ms is applied for this feature set only.

DeepSpectrum .
We also include DeepSpectrum features as a state-of-the-art deep learning based approach [2]. DeepSpectrum features extract spectral images from speech instances and are then fed into pre-trained image recognition Convolutional Neural Networks (CNNs), and the resulting activations are extracted as feature vectors. For MuSe 2020, we extract features utilising the VGG-19 extraction network [37], with all other parameters remaining as default. This results in a feature set of 4 096-dimensions.

Vision
Most visual feature extractors are either designed to localise and extract specific image characteristics and sections (e. g., face), or to learn general discriminatory features for classifying (multi-class, multi-label) a large number of images into many classes (ImageNet). We provide participants with raw data (extracted faces), features focusing on human behavior (face, poses) as well as feature sets which capture the environment as a whole (Xception ) or the interaction object car (GoCaR ).

MTCNN .
To extract and localise the faces in the videos, MTCNN [51] was used. Internally, it has a cascaded structure of three stages to predict face and landmark position operating in real time. The model is trained on the data sets WIDER FACE [49] and CelebA [20]. It also provides a confidence measure which allows the false positive to false negative rate to be tuned. Because the frameworks that extract more detailed face features do not provide features for false positives, we chose not to tune the confidence threshold. For the quantitative performance analysis, we labelled a small selection of videos from each channel by hand, and calculated the intersection over union. Depending on the size of the overlap and intersection, we classified the detected bounding boxes into true and false positives. The detector achieved an accuracy of 90 %, and an F1 score of 86 % on the selection of MuSe-CaR . In addition, we visually inspected the bounding boxes to control the qualitative performance. Both performances underline the very good quality of MTCNN face extractions. These extractions were used as inputs for VGGface and OpenFace .

VGGface .
VGGface [24] is used to extract facial features from the cropped faces that were detected by MTCNN . Originally intended for face recognition tasks, it outputs a feature vector of size 512 when the top layer is removed. Its main advantage is the comparable performance to other face recognition models while using less data for training. The data set used to train the deep CNN, called VGG16 [37], is VGGface , collected by the visual geometry group of Oxford. It contains more than 2 500 identities and 2.6 million faces. While consisting of fewer identities and pose/age variations in comparison to its successor [6], the number of images is similar in scale. Compared to OpenFace, these features can be used to extract more raw facial features, e. g., to learn predictive facial movements from scratch.

OpenFace.
Facial features were also extracted from the cropped faces detected with MTCNN using OpenFace [4]. This toolkit provides a wide range of facial features. We extracted facial landmarks in both 2D (136 features) and 3D (204 features), 6 head pose features, 288 gaze positions, and the intensity and presence of 17 Facial Action Units (FAUs) each for the left side and centre.

Xception .
We use Xception [14] to provide features that capture the environment. Xception is a very deep, state of the art network using residual blocks which enable easier optimisation for large networks. This architecture won the 1st place on the ILSVRC 2015 classification task and other challenges. It is commonly used as feature extractor for general vision features. To obtain the deep representations, we extract the output of the last fully connected layer from the pre-trained Xception network. As a result, a 2048dimensional deep feature vector is provided for each frame.

GoCaR .
GoCaR [39] is an domain-specific visual feature extractor enabling the localisation of 28 car parts, such as, door, steering wheel, headlights, and infotainment with which the reviewer interacts inside and outside the vehicle. It is based on a modified YoloV3 framework [29] with a Darknet-53 as backbone and is trained with a multi-label, multi-class real-world data set containing 15 003 vehicle images of 18 different BMW models with up to 100 different feature variants, each. The coverage of a high number of feature variants is necessary to learn robust features, since cars have one of the highest possible number of product variants, e. g., the number of Mercedes E-Class equipment variations exceeds the order of 10 24 [26]. The extractor achieves a mean average precision of 67.57 % ranging from 94 % for very distinctive parts such as grills to 14 % for less distinctive ones (e. g., roof window) on 1 000 extracted and manually labelled MuSe-CaR video frames. The provided GoCaR features are converted into an array of fixed size. For this purpose, we use the 10 objects with the highest confidence, and for each object we store the class (one-hot encoded), the confidence and the localisation coordinates (x, y, width, and height). In total, this results in a feature vector of 10 * (27 + 7)-dimensions.

4.2.6
OpenPose . We extracted 18 2D pose keypoints 5 using the method proposed in [7], which yielded the best results in the COCO 2016 keypoints challenge [19]. We assume that at maximum, only one person is present in each frame. We use the pre-trained model provided by the authors in [7], trained on the COCO 2016 dataset [19]. The model consists of two branches of stacked CNNs, where one predicts 2D confidence maps for the keypoints of interest, and the other predicts Part Affinity Fields that contain information on the association of keypoints of the same individual amongst themselves. At each level, the outputs of each branch are concatenated and given as input to the higher level layer pair. In the end, we provide the 2D coordinates, as well as the corresponding confidence value of a keypoint being present, for each of the 18 keypoints.

Language
FastText [5] is a library for efficient learning of word embeddings. It is based on the skipgram model where a vector representation is associated to each character n-gram. The model is trained on the English Common Crawl corpus (600B tokens). In comparison to other traditional word embeddings, such as, word2vec [21], or GloVe [25], these sub-words chunks make it possible to calculate word representations of words which were not part of the original training corpus (out-of-vocabulary).This appears advantageous since we work with a domain-specific corpus including technical terms and model names. This a valuable function, and enables us to transform 96 % of words to word embedding vectors.

Alignment
The wide diversity of feature types from three modalities and the correspondingly different sampling rates lead to different lengths of the extracted features along the time axis. All continuous visual feature extractors (e. g., Xception , GoCaR ) sample 4 frames per second, which corresponds approximately to the 250 ms labeling and the 250 ms audio sampling of DeepSpectrum and eGeMAPS (except low-level descriptors which are sampled every 10 ms). Furthermore, Human-focused features (e. g., VGGface , Facial Action Units ) are only extracted from the frames when a reviewer is visible. Recent work [42] has shown that even when advanced alignment mechanisms are in-cooperated in a multimodal neural network, such as attention heads models, the nets are more effective when the features are first aligned during pre-processing. Therefore, we provide for each sub-challenge non-aligned, label-aligned, and, additionally for the more text-related task MuSe-Topic , FastTextaligned features. If desired, the non-aligned features can be aligned by the participants using the corresponding timestamps (or start and end time of a segment for MuSe-Topic ). The label-aligned features have exactly the same length (and timestamps) as the provided label files. We applied zero-padding to the frames, where the feature type is not present or which prevented the extraction of features under unfortunate conditions, e. g., OpenFace when no face appears or when only small faces appear in the original frame.
Only the FastText features are repeated for the duration of a word and non-linguistic parts are also imputed with zero vectors. For the FastText alignment, the features are aggregated in such a way that for a FastText feature vector only one corresponding aggregated feature of any other type exists. This preparation should enable the participants to get started quickly and at the same time allows for own imputation procedures as well as unaligned modelling.

BASELINE SYSTEMS
For each Sub-challenge, a series of state-of-the-art approaches have been applied, and for reproducibility, all resources are made freely available 6 . In the proceeding section, we describe in detail the approaches. An overview of all baseline results is given in Table 2, Table 3, and Table 4. For both Sub-Challenges MuSe-Wild , and MuSe-Trust , the paradigm is continuous prediction of emotional signals. For this, we have applied a Recurrent Neural Network (RNN) with self-attention approach, and a deep audio-to-target endto-end approach. In addition to these models, we use Support Vector Machines (SVMs), a multimodal Transformer and a fine-tuned NLP Transformer Albert to predict the classes of MuSe-Topic .

Early Fusion LSTM-RNN with Self-Attention
In order to address the sequential nature of the input features, we utilise a Long Short-Term Memory (LSTM)-RNN based architecture. The input feature sequences are input into two parallel LSTM-RNNs with hidden state dimensionality equal to 40, to encode the two corresponding query and value vector sequences. A self-attention sequence is calculated by means of a query and key dot product using a sequence-wide attention window. The attention and query sequences are then concatenated. For the continuous-time tasks MuSe-Wild and MuSe-Trust , the resulting hidden vector for each time step is further encoded by a feed-forward layer that outputs a one-dimensional prediction sequence per prediction target. For the MuSe-Topic task, we instead apply global max-pooling, to integrate the sequential information into one hidden state vector, which is then input into a feed-forward layer to provide the logits. In the former case, all the input samples are further segmented into 50 time-step sub-segments which are all used for training, whereas in the latter we pad/crop all sequences to 500 steps.

End-to-End Learning
As our end-to-end baseline we use End2You [45]; an open-source toolkit for multimodal profiling by end-to-end deep learning [43,44]. For our purposes, we utilise three modalities, namely, audio, visual, and textual. Our audio model is inspired by a recently proposed emotion recognition model [46], and is comprised of a convolution recurrent neural network (CRNN). In particular, we use 3 convolution layers to extract spatial features from the raw segments. Our visual information is comprised of the VGGface features, where we use zero vectors when the face is not detected in a frame. Finally, as text features we use FastText , where we replicate the text features that span across several segments. We concatenate all uni-modal features and feed them to a one layer LSTM to capture the temporal dynamics in the data before the final prediction.

Multimodal Transformer
As baseline for the non-sequential predictions of MuSe-Topic , we choose the Multimodal Transformer (MMT) [42]. By using aligned and unaligned vision, language, and audio features for single label prediction, it outperformed state-of-the-art methods in a more text-focused Multimodal Sentiment Analysis setting. MMT merges multimodal timeseries using a feed-forward fusion process consisting of multiple crossmodal Transformer units. At the core of this network architecture are crossmodal attention modules which fuse multimodal features by directly attending to low-level features across all modalities. To predict topics, valence, and arousal we always utilise 3 feature sets, either of our three (tri), or of only two (bi) different modalities fed into the network. We noticed that after approximately 20 epochs the network converged. The model uses 5 crossmodal attention heads and an initial learning rate of 10 −3 .

Albert
To reflect the current trend towards Transformer language models, such as Bidirectional Encoder Representations from Transformers (BERT) [10], we include one of the latest versions, Albert [17], as a purely text-based baseline model. The authors of Albert proposed parameter reduction techniques, so that the total memory consumption is lower while increasing the training speed. These models supposedly scale better than the original BERT. The architecture is able to achieve state-of-the-art results on several benchmarks, despite having a relatively smaller number of parameters. For our purposes, we found a supervised tuning on the train partition for 3 epochs and balanced class weights to have the best effect. We applied a learning rate of 10 −5 for the adjusted Adam Optimiser and set to 10 −8 . With a sequence length of 300, the batch size has to be limited to 12 samples to be trained with 32GB GPU memory.

Support Vector Machines
For the task of emotion prediction in the Sub-Challenge MuSe-Topic only, we choose also to include results obtained through the use of conventional and easily reproducible Support Vector Machines (SVMs). These experiments employ the Scikit-learn toolkit, with a LINEARSVR classifier. No standardisation or normalisation was applied to any of the reported feature sets. The complexity parameter C was always optimised from 10 −5 to 1 during the development phase, and the best value for C is reported. In contrast to our other approaches, we retrain the model on a concatenation of the train and development sets to predict the final test set result.

MuSe-Wild
We evaluated several feature sets and combinations for the prediction of the continuous arousal and valence (see Table 2 for detailed results). For the prediction of arousal, the LSTM-RNN with selfattention using LLDs as input features, achieved the best result of all applied systems with a CCC of .3088 on the devel set, and .2884 on the test set. However, the combined metric (mean of valence and arousal) is considerably lower (CCC: .1931 on devel) due to the poor efficiency on the prediction of valence. Therefore, we define the end-to-end framework utilising FastText , VGGface , and audio representations learnt from the raw audio signal as our baseline, achieving a CCC of .2431 (on test) for the prediction of valence and .2706 (on test) for the prediction of arousal. We report a combined score of .2568 (on test) for this system. Table 3 shows the results of the baseline systems on the languagecentric task of topic prediction. In line with recent research, the state-of-the-art NLP Transformer Albert, fine-tuned on the training set, achieved with 76.79 % (combined, on test) the best baseline result leaving the second best system, the Multimodal Transformer utilising FastText , eGeMAPS , and Facial Action Units features (52.98 % on test), far behind. The most successful configuration of the LSTM + Self-attention, the only not Transformer-based architecture, has another nearly 15 % performance gap (on test 37.37 %) to the MMT demonstrating the competitiveness of our baseline and the suitability of Transformers for this task. For the task of emotion (valence and arousal) prediction in the MuSe-Topic sub-challenge, we also report baseline results in Table 4. Here, the picture is more balanced with some system failing to Table 2: Reporting Arousal, Valence and Combined (0.5 · + 0.5 · ) for MuSe-Wild and Trustworthiness for MuSe-Trust , both using concordance correlation coefficient (CCC). As feature sets FastText (FT), eGeMAPS (Ge), DeepSpectrum (DS), GoCaR (Go), VGGface (VG), and Xception (X) and all visual (aV) are fed into the models. Furthermore, the raw audio signal (RA) is used in End2You, and low-level descriptors (LDD) are utilised for MuSe-Trust in order to predict trustworthiness. All utilised features of MuSe-Wild and MuSe-Trust are aligned to the label timestamps by imputing missing values or repeating the word embeddings for FastText .

MuSe-Trust
The results for the prediction of trustworthiness are depicted in Table 2. Similar to MuSe-Wild , the end-to-end baseline system using FastText , VGGface , and raw audio signals gave the best results with .4128 CCC on test. The results may improve if the valence and arousal predicted signals are incorporated during training.
This can be accomplished in three ways: i) the model from MuSe-Wild is utilised to predict arousal and valence on MuSe-Trust ; ii) the arousal and valence models can be retrained on MuSe-Trust (we provide train and devel labels); or iii) all three are predicted in a multitask-fashion (one model, 3 outputs) on train and devel, and only trustworthiness is predicted on test. We decided for option (iii). Adding these signals to the end-to-end baseline system, the predictive power of the model is similar to the previous one with CCC .3264 on the development set and .4119 on the test set.

CONCLUSIONS
In this paper, we introduced MuSe 2020 -the first Multimodal Sentiment Analysis in real media assessment challenge. MuSe 2020 utilises the MuSe-CaR multimodal corpus of emotional car reviews and comprises three Sub-challenges: i) MuSe-Wild , where the level of the affective dimensions of valence (corresponding to sentiment) and arousal has to be predicted from a ca. 35 hour data subset; ii) MuSe-Topic , where the domain-related conversational topic (10  classes) as well as three classes (low, medium and high) of valence and arousal have to be predicted from video parts containing the discussed topic; and, iii) MuSe-Trust , where the level of continuous trustworthiness has to be predicted from features and/or affective annotations. By intention, we decided to use open-source software to extract a wide range of feature sets to deliver the highest possible transparency and realism for the baselines. Besides the features, we also share the raw data and the developed code for our baselines on a public platform. Results indicate that: i) the level of affection in-the-wild is best predicted when the system is trained on the the raw audio features; ii) for MuSe-Topic , (NLP-specific) Transformers are clearly superior when it comes to the prediction of topics, and no system is clearly outperforming on the three class valence and arousal prediction; and iii), in MuSe-Trust , adding valence and arousal contours as 'signals' in addition to other features is beneficial for the prediction of trustworthiness. The baselines also show the challenge ahead in mastering multimodal sentiment analysis, in particular when data are collected in user-generated, noisy environments. In the participants' and future efforts, we expect novel exciting combinations of the modalities -potentially also such as linking modalities on earlier stages or more closely.

ACKNOWLEDGMENTS
This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No. 115902 (RADAR CNS) and No. 826506 (sustAGE), the EP-SRC Grant No. 2021037, and the Bavarian State Ministry of Education, Science and the Arts in the framework of the Centre Digitisation.Bavaria (ZD.B). We thank the sponsors of the Challenge BMW Group and audEERING.