Learning to Detect Misleading Content on Twitter

The publication and spread of misleading content is a problem of increasing magnitude, complexity and consequences in a world that is increasingly relying on user-generated content for news sourcing. To this end, multimedia analysis techniques could serve as an assisting tool to spot and debunk misleading content on the Web. In this paper, we tackle the problem of misleading multimedia content detection on Twitter in real time. We propose a number of new features and a new semi-supervised learning event adaptation approach that helps generalize the detection capabilities of trained models to unseen content, even when the event of interest is of different nature compared to that used for training. Combined with bagging, the proposed approach manages to outperform previous systems by a significant margin in terms of accuracy. Moreover, in order to communicate the verification process to end users, we develop a web-based application for visualizing the results.


INTRODUCTION
Recent years have seen tremendous increase in the use of social media platforms such as Twitter and Facebook as means of sharing news content and multimedia. The simplicity of the sharing process has led to large volumes of news content propagating over social networks and reaching huge numbers of readers in very short time. Especially multimedia content (images, videos) can rapidly reach massive audiences and become viral due to the fact that it is easily consumed and often carries a lot of entertainment value. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or a fee. Request permissions from permissions@acm.org. ICMR '17, June 6-9, 2017, Bucharest, Romania  Given the speed of the news spreading process and the competition of news outlets and journalists to publish rst, it is only natural that the veri cation of content is often carried out in a super cial manner or even neglected. This leads to the online appearance and spread of large amounts of misleading multimedia content. In particular, when a news event breaks (e.g., a natural disaster), and media coverage is of primary importance, news professionals often turn to social media to source newsworthy content. It is exactly this setting, when the risk of misleading content becoming viral is the highest. As misleading (or for the sake of brevity fake), we de ne any post that shares multimedia content that does not faithfully represent the event that it refers to. This could include: a) content from a past event that is reposted as referring to the current event, b) content that is deliberately manipulated, or c) content that is falsely used to represent an aspect of the current event. In a similar way, as real, we de ne posts that share content that faithfully represents the event in question. There are in-between cases: for instance, when a post acknowledges the misleading nature of the content it shares or refers to it with a sense of humour, it is hard to categorize as fake or real; these are out of the scope of this work.
The impact of fake content being widely disseminated can be quite severe. For example, after the Malaysia Airlines ight disappeared on March 2014 (Figure 1), numerous fake images that became soon viral on social media raised false alarms that the plane was detected. This deeply a ected and caused emotional distress to people directly involved in the incident, such as the passengers' families. Examples such as this point to the need for means of identifying and debunking fake media content on social media. One of the rst such attempts [8] used a supervised learning approach, in which a set of known fake and real tweets were used to train a model to distinguish between the two classes; experiments were conducted on a dataset around the Hurricane Sandy and a very high detection accuracy was reported. Yet, the fact that content from the same event was used both for training and testing was found to give an overly optimistic sense of accuracy, questioning its generalization ability to content from di erent events [3].
To address the limitations of state-of-the-art solutions on the problem, we present a robust approach for detecting in real time whether a media item shared by a tweet is fake or real. The proposed fake detection approach uses a variety of content-based and contextual features for the social media post in question, and leverages part of its own predictions for retraining, following a semisupervised learning paradigm, to adapt the model to unseen content. Experiments on a public annotated corpus of multimedia tweets demonstrate the e ectiveness of the proposed approach. Additionally, we propose a visualization method for communicating the result of automatic analysis to end users in an intuitive way.

RELATED WORK
Multimedia forensics. Although the eld of multimedia forensics has led to a multitude of methods for detecting digital manipulation in digital content [18,21], recent research has shown that tampered images found on the Web are very hard to detect [27,28]. Moreover, in a lot of cases, forensics techniques are insu cient, e.g. when the multimedia item is just a reposting from a past event. Indeed, past studies have demonstrated that more than half of the videos around trending topics are repostings or remixes of past content [26]. Also, the veri cation methods employed by journalists [23], e.g. looking into the Exif metadata of content or getting in touch with the person that published it, are often not applicable due to the constraints of popular social media platforms. For instance, Twitter and Facebook remove the Exif metadata from posted content.
Assessing content credibility in social media. Castillo et al. focused on automatic methods for assessing the credibility of a given set of tweets. In particular, they analysed microblog posts related to trending topics and classi ed them as credible or not credible based on a number of features [5]. A similar approach was presented by Gupta et al. [8], demonstrating high classi cation accuracy on a dataset of tweets collected around Hurricane Sandy. A thorough experimental study of information credibility on Twitter was also based on information propagation processes in the context of news events [6]. Two models were developed, one that decides whether an information cascade corresponds to a newsworthy event and another one that evaluates the trustworthiness of the cascade. In contrast with the aforementioned approaches, Martinez-Romo et al. [13] conducted a study for detecting malicious tweets in trending topics focusing on statistical linguistic analysis, taking into account exclusively the tweets without considering any information from users. Finally, O'Donovan et al. [16] performed an analysis of the utility of various features when predicting content credibility.
Veri cation services and systems. Ratkiewicz et al. developed the Truthy system [17], a web service for tracking political memes and misinformation on Twitter, focusing on political astroturf. Truthy collects tweets, detects a number of memes in them, and o ers a web interface that lets users annotate those memes they consider "truthy". In recent years, systems that are fully-automatic have been developed, such as TweetCred [7], a tool that computes credibility scores for a set of tweets, and Hoaxy [22], a platform for detecting and analysing online misinformation. Finally, semiautomatic systems have been also introduced, such as RumorLens [20], which combines human e ort with computation to detect new rumours in Twitter, and TwitterTrails [14], which lets users investigate the propagation of a given rumour.

FAKE DETECTION FRAMEWORK
The proposed framework relies on two independent classi cation models built on the training data using two di erent sets of features, tweet-based (TB) and user-based features (UB). A bagging technique is used when building both models. At prediction time, an agreement-based retraining strategy is employed (fusion), which combines the outputs of the two models in a semi-supervised learning manner, to increase the generalization capabilities of the framework given tweets from a new unknown event. The outcome of the veri cation is then visualized to end users. A corpus of labelled posts is necessary (described in Section 4) in order to build the classi cation models and to generate the visualizations. Figure  2 depicts the main components of the proposed framework. The implementation of the framework is publicly available on GitHub 1 .

Feature Extraction
The selection of features used in our framework was carried out following a thorough study of the way in which news professionals, such as journalists, verify content on the Web. Based on relevant journalistic studies, such as [12], and the Veri cation Handbook [23], we have de ned a set of features that are important for verication. These are not limited to the content itself, but also pertain to its source (Twitter account that posted the content) and to the location where it was posted. We decided to not use any image/video forensics features following the conclusion of our recent study [27] that Twitter media content is not amenable to image forensics. This was also con rmed by our recent MediaEval participations [2,4], where the use of forensics features did not lead to consistent improvement. The feature extraction process produces a set of TB and UB features for each tweet (Table 1).

Tweet-based features (TB):
We consider four types of feature related to tweets: a) text-based, b) language-speci c, c) Twitterspeci c, and d) link-based. a) text-based: These are extracted from the text of the tweet, and capture characteristics such as the length of a tweet text and the number of words in it. They also include characteristics such as the number of question and exclamation marks, uppercase characters, as well as binary features indicating the existence or not of emoticons, special words ("please") and punctuation (colon). Table 1: Overview of veri cation features. Link-based features are extracted in the TB case for external links that tweets may share, and in the UB case for the URL included in the account pro le. Features with an asterisk were proposed in [3,8] and will be denoted as Baseline Features (BF), while the full feature set (BF and newly proposed ones) as Total Features (TF).

Tweet-based Features (TB)
User-based Features (UB) text-based user-speci c #words* has ''please'' #friends* has location length of text* has colon #followers* has existing location #question marks* contains happy emoticon* follower-friend ratio* has bio description #exclamation marks* contains sad emoticon* #media content tweet ratio contains question mark* #uppercase chars* has profile image account age contains exclamation mark* has header image is verified* language-speci c has a URL* #times listed* #pos senti words* contains 1st pers.pron.* #neg senti words* contains 2nd pers.pron.* link-based (common for TB and UB) #slangs contains 3rd pers.pron.* WOT score alexa country rank #nouns readability in-degree centrality alexa delta rank twitter-speci c harmonic centrality alexa popularity #retweets* #mentions* alexa reach rank #hashtags* #URLs* has external link b) language-speci c: These are extracted for a prede ned set of languages (English, Spanish, German), which are rst detected using a language detection library 2 . They include the number of positive and negative sentiment words in the text. For English we use the list by Je rey Breen 3 , for Spanish the adaptation of ANEW [19] and for German the Leipzig A ective Norms [10]. Additional binary features indicate whether the text contains personal pronouns (in the supported languages). An additional feature is the number of slang words in the tweet. This is extracted using slang words in English 4 and Spanish 5 . For German, no available slang list was found and hence no such feature is computed. Moreover, the number of nouns in the tweet text was also added as feature, and is computed based on the Stanford parser only for English [11]. Finally, to investigate whether the readability of the tweet text is related to its veracity, we use the Flesch Reading Ease method 6 to compute a readability score in the range [0, 100], with 0 representing the very hard-to-read text and 100 the very easy-to-read text. For the tweets written in a language, where the above features cannot be extracted, we consider the corresponding values missing. c) twi er-speci c: This set contains features related to the Twitter platform. These include the number of re-tweets, hashtags, mentions, URLs and a binary feature expressing whether any of the URLs points to external (non-Twitter) resources. d) link-based: These include features that provide information about the links that are shared through the tweet. This set of features is common in both TB and UB sets, but in the latter it is de ned in a di erent way (see link-based category in UB features). For TB, depending on the existence of an external URL in the tweet, its reliability is quanti ed based on a set of Web metrics: i) the WOT score 7 , which is a way to assess the trust on a website using crowdsourced reputation ratings, ii) the in-degree and harmonic centralities 8 , computed based on the links of the Web graph, and 2 https://code.google.com/p/language-detection/ 3 https://github.com/je reybreen/twitter-sentiment-analysis-tutorial-201107 4 http://onlineslangdictionary.com/word-list/0-a/ 5 http://www.languagerealm.com/spanish/spanishslang.php 6 http://simple.wikipedia.org/wiki/Flesch_Reading_Ease 7 https://www.mywot.com/ 8 http://wwwranking.webdatacommons.org/more.html iii) four Alexa metrics (rank, popularity, delta rank and reach rank) based on the rankings API 9 .
User-based features (UB): These are related to the Twitter account posting the tweet. We divide them in a) user-speci c and b) linkbased features. a) user-speci c: These include the user's number of friends and followers, the account age, the follower-friend ratio, tweet ratio (number of tweets/day divided by account age) and a number of binary features: whether the user is veri ed by Twitter, whether there is a biography in his/her pro le, whether the user declares his/her location using a free text eld, and whether the location text can be parsed into an actual location 10 , whether the user has header or pro le image, and whether a link is included in the pro le. b) link-based: In this case, depending on the existence of a URL in the Twitter pro le description, we apply the same Web metrics as the ones used in the link-based TB features. If there is no link in the pro le, the values of these features are considered to be missing.
After feature extraction, the next steps include pre-processing, cleaning and transformation. To handle the issue of missing values on some of the features, we use linear regression for estimating their values: we consider the attribute with the missing value as a dependent (class) variable and apply linear regression for numeric features. The method cannot support the prediction of boolean values and hence those are left missing. Only feature values from the training set are used in this process. Data normalization is also performed to scale the numeric feature values to the range [-1, 1].

Building the classi cation models
We use the TB and UB features to build two independent Random Forest classi ers (CL 1 , CL 2 ), each of which is based on the respective set of features. To further increase classi cation accuracy, we make use of bagging: we create m di erent subsets of tweets from the training set, including equal number of samples for each class (some samples may appear in multiple subsets), leading to the creation of m instances of CL 1 and CL 2 (m = 9 in our experiments), as shown in Figure 2. The nal prediction for each of the test samples is calculated using the majority vote among the m predictions.

Agreement-based retraining
A key novelty in the proposed framework is an agreement-based retraining step (the fusion block in Figure 2) to improve the prediction accuracy for content associated with unseen events. This was motivated by an approach that was proposed for e ectively tackling out-of-domain sentiment classi cation [25]. We combine the outputs of classi ers CL 1 , CL 2 as follows: for each sample of the test set, we compare their predictions and depending on their agreement, we divide the test set in the agreed and disagreed subsets. The instances of the agreed set are assigned the agreed label (fake/real) assuming that it is correct with high likelihood, and are used to build a new classi er to handle the disagreed instances. To this end, we use two retraining techniques. First, we select the most e ective of the independent classi ers CL 1 , CL 2 based on their performance on the training set during cross-validation. Then, we either use just the agreed samples to train the CL classi er (denoted as CL(i)), or we use the entire set of initial training samples extending it with the set of agreed samples (denoted as CL(ii)). The goal is to adapt the initial model to the speci c data characteristics of the new event. In that way, the model can predict more accurately the values of the samples for which CL 1 , CL 2 did not initially agree. In the experimental section, we test both retraining variants.

Veri cation result visualization
The key idea for visualizing the results of the proposed veri cation process is to present the list of extracted features for the input tweet, and then for a selected feature to present its value in relation to the distribution that this feature has for real versus fake tweets, as computed with respect to the veri cation corpus (Section 4). Figure 3 illustrates a screenshot of this application, which is publicly available 11 . In terms of usage, the end user rst provides the URL or id of a tweet of interest, and then the application presents the extracted tweet-and user-based features and the veri cation result (fake/real) for the tweet in the form of a color-coded frame (red/green respectively). It also o ers the possibility of inspecting the feature values. By selecting a feature, its value distribution appears (right column), separately for the fake and real tweets (sideby-side). Moreover, a textual description informs the user about the percentage of tweets of this class (fake or real) that have the same value for this feature. In that way, the investigator may better understand how the veri cation result is justi ed based on the individual values of the features in relation to the "typical" values that these features have for fake versus real tweets.

VERIFICATION CORPUS
Our fake detection models are based on a publicly available a verication corpus (VC) of fake and real tweets. More speci cally, this consists of tweets related to the 17 events of Table 2, comprising in total 193 cases of real images, 218 cases of misused (fake) images and two cases of misused videos, associated with 6,225 real and 9,596 fake tweets posted by 5,895 and 9,216 unique users respectively.
The corpus comprises a set of tweets T that is collected with the help of a set of keywords K for each of the 17 events. The ground truth labels (fake/real) of these tweets are based on a set of online resources, which discussed and debunked images and videos widely 11 http://reveal-mklab.iti.gr/reveal/fake/ shared in the context of these events. Only resources were used that are reputable news providers and that adequately justi ed their decision about the veracity of each multimedia item. This led to a set of fake and real multimedia cases, denoted as I F , I R respectively, which were then used as seeds to create the reference veri cation corpus T C ⊂ T . This includes exclusively tweets that contain at least one item from the two sets. In order not to restrict the tweets to only those that point to the exact seed URLs, a visual near-duplicate search technique was employed [24]. More speci cally, the sets of fake and real images were used as visual queries and for each query it was checked whether each image tweet from T exists as an image item or a near-duplicate image item of the I F or the I R set. To ensure near-duplicity, a minimum threshold of similarity was empirically set, tuned for high precision. A small amount of the images exceeding the threshold were manually found to be irrelevant to the ones in the seed set and were then removed.
Several of the events, e.g., Columbian Chemicals, Passport Hoax and Rock Elephant, were actually hoaxes, hence all content associated with them is fake. Also, for several real events (e.g., MA ight 370) no real images (and hence no real tweets) were included in the dataset, since none came up as a result of the data collection.
As the aim of our work is to assess the generalization capability of the fake detection framework, we used every tweet in the corpus regardless of its language. The aim has been to use a comprehensive corpus, which contains the widest possible variety of fake tweets (even though this complicates the machine learning process due to missing feature values). Furthermore, we included content from di erent types of event. In terms of type of fake, we considered the following four categories: a) reposting of real: real photos from past events re-posted as being associated to a current event ( Figure  4 (i)); b) reposting of synthetic: synthetic digital images, such as artworks or snapshots from movies, presented as real imagery about an event (Figure 4 (ii)); c) speculations: real photos from an ongoing event, expressing speculations regarding the association of persons or actions to the event (Figure 4 (iii)); d) digital tampering: digitally manipulated photos (Figure 4 (iv)).  Table 2: List of events in VC: For each event, we report the number of unique real (if available) and fake images (I R , I F respectively), unique tweets that shared those images (T R , T F ) and Twitter accounts that posted those tweets (U R , U F ).

ID
Name From the corpus, we considered only unique posts by eliminating re-tweets. Finally, by manually checking the content of tweets, we ensured that no posts were included that featured funny/humorous content, nor posts that declared that their content is fake (both of which cases would be hard to classify as either real or fake).

EXPERIMENTAL STUDY 5.1 Overview
The aim of the conducted experiments was to evaluate the fake detection accuracy of di erent models on samples from new (unseen) events. We consider this an important aspect of a veri cation framework, as the nature of the untrustworthy (fake) tweets posted may vary across di erent events. Accuracy is computed as the ratio of correctly classi ed samples (N c ) over total number of test samples (N ): a = N c /N . The employed evaluation scheme can be thought of as a kind of event-based cross-validation: for each event E i of the 17 events in the VC, we use the remaining 16 events for training. and E i for testing. We denote each of these 17 splits as T i . All models are built using Random Forests of 100 trees.
In addition, to compare the performance of our framework, with methods that participated in the recently organized Verifying Multimedia Use task in the context of MediaEval [1], we use the split proposed by the task organizers (denoted as T 18): events E1-E11 are used for training, and events E12-E17 for testing.

New Features and Bagging
We rst assess the contribution of the new features and bagging to the method's accuracy. To this end, we build the CL 1 , CL 2 classi ers with and without the bagging technique. To create the models without bagging, we selected each time an equal number of random fake and real samples for training. We applied this procedure both for the Baseline (BF) and Total Features (TF) (cf. Table 1 caption). Table 3 presents the average accuracy for each setting.
We observe that the use of bagging led to considerably improved accuracy for both CL 1 and CL 2 . In addition, further improvements are achieved when using the TF features over BF. We see that bagging led to an absolute improvement of approximately 10% and 15% in the accuracy of CL 1 and CL 2 respectively (when using the TF features), while the use of TF features over BF to an absolute improvement of approximately 18% on both classi ers (when bagging is used). Combined, the use of bagging and the newly proposed features led to an absolute improvement of more than 24% for both CL 1 and CL 2 . Given the clear bene ts of using bagging, in subsequent experiments, all reported results refer to classi ers with bagging.

Agreement-based retraining technique
We use the entire set of features (TF) for assessing the accuracy of the agreement-based retraining approach. Table 4 shows the scores obtained separately for each split. The rst two columns present the agreement level and the accuracy of classi ers on the agreed set. We observe that on average the two classi ers' predictions (CL 1 , CL 2 ) agree in the majority of tweets (Agreement (%) column); in particular, they agree on 70.65% of the tweets on average. On this set of tweets, the average accuracy (Agreed accuracy column) is extremely high (94.06%). One may also note that the higher the agreement level, the higher is the achieved accuracy on the agreed set. The next two columns present the accuracy on the disagreed samples, when using the two variations of the retraining process (CL(i) and CL(ii) in Section 3.3). The next two columns show the results while combining the accuracy of the agreed and disagreed samples. On average, the rst retraining variation, i.e. using only the samples of the new event for training, slightly outperforms the second. For comparison purposes, the last two columns of the table present the scores of the CL 1 and CL 2 classi ers trained with the bagging technique and applied independently. Those correspond to the standard supervised learning paradigm [3,8]. Comparing the average scores of the classi ers in the two last columns (88.34% and 75.69% respectively) with those of the agreement-based retraining technique (92.13% and 91%), one can see a clear improvement in terms of classi cation accuracy (approximately 4% when compared to the best CL 1 con guration).

Performance on di erent languages
We also assessed the classi cation accuracy of the framework for tweets written in di erent languages, i.e. the extent to which the framework is language-dependent. We keep the ve most used languages in the corpus (by number of tweets). Note that in many cases no language is detected, either because the text contains no text but just hashtags/URLs or the length of the text is too small to be detected by the language detector. For this reason, we also consider this category of tweets (denoted as NO-LANG), and thus compare between the following cases: English (EN), Spanish (ES), no language (NO-LANG), Dutch (NL) and French (FR). Table 5 shows the languages tested and the corresponding number of samples.
By using the total amount of features TF, we calculate the accuracy on each split (T 1-T 18) separately on the samples of each language. Figure 5 shows the results for each split when using the agreement-based retraining technique, according to the rst and second variation of the method respectively. In most cases, it appears that fake detection accuracy remains relatively stable independent of language. The highest accuracy scores are achieved for NO-LANG followed by English and Spanish. Accuracy is somewhat lower for French and Dutch. This is an encouraging nding since it indicates that the framework is reliable even for languages, for which the language-speci c features are not de ned.

Comparison with state of the art
We also compare our method with the ones submitted to the Me-diaEval 2015 Verifying Multimedia Use task. These include the systems by UoS-ITI [15], MCG-ICT [9], and CERTH-UNITN [2]. For each of those, we only compare against their best run 12 . The comparison is done using the F1-score, which is the o cial metric of the task. According to the results (Table 6), the proposed method achieves the second best performance (F = 0.934), reaching almost  Method F1-Score UoS-ITI [15] 0.830 MCG-ICT [9] 0.942 CERTH-UNITN [2] 0.911 Proposed 0.934 equal performance to the best run by MCG-ICT [9] (F = 0.942). The latter, however, uses an approach that is tailored to the speci cs of the dataset. In particular, MGC-ICT relies on a model that rst clusters tweets into topics according to the multimedia resource that they contain. Then, it extracts topic-level features for building the fake detection classi er. It is important to note that the dataset of the task makes available a list of tweets, their associated multimedia item and label (fake/real). The way the dataset is structured makes the MGC-ICT possible to apply. However, in a realistic setting, unseen tweets do not appear in clusters (except in the case of highly popular media items that are shared concurrently by numerous di erent posts), which makes the application of such an approach much more complex and its results questionable. In contrast, our method leads to comparable performance without su ering from such limitations.

Veri cation visualization
To demonstrate the utility of the web-based veri cation application, we present an example case study where the proposed visualization approach is used on a tweet that shared fake multimedia content in the context of the March 2016 terrorist attacks in Brussels. The tweet ( Figure 6) claimed that the shared video depicted one of the explosions in Zaventem airport, but the video is actually from another explosion in a di erent airport a few years ago. Indeed, the proposed classi cation framework ags the tweet as fake and presents the features' distributions in order to get useful insights about the reasons for this decision. Three sample tweet-and userbased feature distributions are illustrated in the upper and lower part of Figure 6 respectively. For example, in the rst plot, the number of hashtags for this tweet is shown to be zero and at the same time the respective bar is highlighted. The plot informs that 63% of the overall training tweets that have this value are fake, a fact that partially justi es the classi cation result. In the two following plots that display the number of mentions and the text length, similar conclusions can be made about the veracity of the tweet. In the user-based feature value distributions, the date of creation, the number of friends and the followers/friends ratio seem to give some additional strong signals regarding the credibility of the account, and as a result the veracity of the posted tweet.

CONCLUSIONS AND FUTURE WORK
We presented a robust and e ective framework for fake multimedia detection on Twitter. Using a specially collected veri cation corpus, we provided evidence of the high accuracy of the proposed framework over a number of events of di erent size and nature, as well as considerable improvements in accuracy as a result of the newly proposed features, the use of bagging, and the application of an agreement-based retraining method that outperforms standard supervised learning. We also demonstrated the utility of a novel visualization approach for explaining the veri cation result.
Oral Session 5: Best Paper Candidate ICMR'17, June 6-9, 2017, Bucharest, Romania To use the proposed approach in real-time settings, one should be cautious of the following caveat. The agreement-based retraining method requires a number of samples from the new event in order to be applied. Hence, for the rst set of arriving items, it is not possible to rely on this improved step. Yet, the rate at which new items arrive in the context of breaking news events could quickly provide the algorithm with a sizeable set of tweets.
In the future, we are interested in looking further into the realtime aspects of fake content detection, and conduct experiments that better simulate the fake content detection problem as an event evolves. We also plan to conduct user studies to test whether the proposed visualization is understandable and usable by news editors and journalists. Finally, we also plan to extend the framework to be applicable to content posted on platforms other than Twitter.

ACKNOWLEDGMENTS
This work has been supported by the REVEAL and InVID projects, under contract numbers 610928 and 687786 respectively, funded by the European Commission.