Identification of COVID-19 related Fake News via Neural Stacking

Identification of Fake News plays a prominent role in the ongoing pandemic, impacting multiple aspects of day-to-day life. In this work we present a solution to the shared task titled COVID19 Fake News Detection in English, scoring the 50th place amongst 168 submissions. The solution was within 1.5% of the best performing solution. The proposed solution employs a heterogeneous representation ensemble, adapted for the classification task via an additional neural classification head comprised of multiple hidden layers. The paper consists of detailed ablation studies further displaying the proposed method's behavior and possible implications. The solution is freely available. \url{https://gitlab.com/boshko.koloski/covid19-fake-news}


Introduction
Fake news can have devastating impact on the society. In the times of a pandemic, each piece of information can have a significant role in the lives of everyone. The verification of the truthfulness of a given information as a fake or real is crucial, and can be to some extent learned [10]. Computers, in order to be able to solve this task, need the data represented in a numeric format in order to draw patterns and decisions. We propose a solution to this problem by employing various natural language processing and learning techniques.
The remainder of this work is structured as follows: Section 2 describes the prior work in the field of detection of fake-news. The provided data is described in Section 3 and Section 4 explains our proposed problem representation approaches while Section 5 introduces two different meta-models built on top of the basic representations listed in Section 4. The experiments and results achieved are listed in Section 6, finally the conclusion and the proposed future work are listed in Section 7.

Related Work
The fake-news text classification task [16] is defined as follows: given a text and a set of possible classes fake and real, to which a text can belong, an algorithm is asked to predict the correct class of the text. Most frequently, fake-news text classification refers to classification of data from social media. The early proposed solutions to this problem used hand crafted features of the authors such as word and character feature distributions. Interactions between fake and real news spread on social media gave the problem of fake-news detection a networkalike nature [18]. The network based modeling discovered useful components of the fake-news spreading mechanism and led to the idea of the detection of bot accounts [17].
Most of the current state-of-the-art approaches for text classification leverage large pre-trained models like the one Devlin et al. [1] and have promising results for detection of fake news [4]. However for fake-news identification tasks, approaches that make use of n-grams and the Latent Semantic Analysis [2] proved to provide successful solutions on this task (see Koloski et al. [5]). Further enrichment of text representations with taxonomies and knowledge graphs [19] promises improvements in performance.

Data description
In this paper we present a solution to the subset of the fake-news detection problem -The identification of COVID-19 related Fake News [11,10]. The dataset consists of social media posts in English collected from Facebook, Twitter and Instagram, and the task is to determine for a given post if it was real or fake in relation to COVID-19. The provided dataset is split in three parts: train, validation and test data. The distribution of data in each of the data sets is shown in Table 1.

Proposed method
The proposed method consists of multiple submethods that aim to tackle different aspects of the problem. On one side we focus on learning the hand crafted features of authors and on the other we focus on learning the representation of the problem space with different methods.

Hand crafted features
Word based Maximum and minimum word length in a tweet, average word length, standard deviation of the word length in tweet. Additionally we counted the number of words beginning with upper and the number of words beginning a lower case.
Char based The character based features consisted of the counts of digits, letters, spaces, punctuation, hashtags and each vowel, respectively.

Latent Semantic Analysis
Similarly to Koloski et al. [5] solution to the PAN2020-Fake News profiling we applied the low dimensional space estimation technique. First we preprocessed the data by lower-casing the tweet content and removing the hashtags and punctuation. After that we removed the stopwords and obtained the final clean presentation. From the cleaned text, we generated the POS-tags using the NLTK library [6]. For the feature construction space we used the technique used by Martinc et al. [8] which iteratievly weights and chooses the best n-grams. We used two types of n-grams: -Word based: n-grams of size 1 and 2 -Character based: n-grams of size 1, 2 and 3 We generated n features with n/2 of them being word and n/2 character ngrams. We calculated TF-IDF on them and preformed SVD [3] With the last step we obtained the LSA representation of the tweets.

Contextual features
We explored two different contextual feature embedding methods that rely on the transformer architecture. The first method uses the already pretrained sentence transfomers and embedds the texts in an unsupervised manner. The second method uses DistilBERT which we fine tune to our specific task.
sentence transfomers For fast document embedding we used three different contextual embedding methods from the sentence transfomers library [14]: First, we applied the same preprocessing as shown in Figure 1, where we only excluded the POS tagging step. After we obtained the preprocessed texts we embedded every tweet with a given model and obtained the vector representation. After we obtained each representation, we learned a Stochastic Gradient Descent based learner, penalizing both the "linear" and "hinge" loss parameters. The parameters were optimized on a GridSearch with a 10-fold Cross-validation on every tuple of parameters.
DistilBERT is a distilled version of BERT that retains best practices for training BERT models [15]. It is trained on a concatenation of English Wikipedia and Toronto Book Corpus. To produce even better results, we fine-tuned the model on train data provided by the organizers. BERT has its own text tokenizer and is not compatible with other tokenizers so that is what we used to prepare data for training and classification.

tax2vec features
tax2vec [19] is a data enrichment approach that constructs semantic features useful for learning. It leverages background knowledge in the form of taxonomy or knowledge graph and incorporates it into textual data. We added generated semantic features using one of the two approaches described below to top 10000 word features according to the TF-IDF measure. We then trained a number of classifiers on this set of enriched features (Gradient boosting, Random forest, Logistic regression and Stochastic gradient descent) and chose the best one according to the F1-score calculated on the validation set.. Taxonomy based (tax2vec). Words from documents are mapped to terms of the WordNet taxonomy [13], creating a document-specific taxonomy after which a term-weighting scheme is used for feature construction. Next, a feature selection approach is used to reduce the number of features. Knowledge Graph based (tax2vec(kg)). Nouns in sentences are extracted with SpaCy and generalized using the Microsoft Concept Graph [9] by "is a" concept. A feature selection approach is used to reduce the number of features.

Meta models
From the base models listed in Section 4 we constructed two additional metamodels by combining the previously discussed models.

Neural stacking
In this approach we learn a dense representation with 5-layer deep neural network. For the inputs we use the following representations:

Linear stacking
The second approach for meta-learning considered the use of the predictions via simpler models as the input space. We tried two separate methods: Final predictions We considered the predictions from the LSA, DistilBert, dbert, xlm, roberta, tax2vec as the input. From the models' outputs we learned a Stochastic Gradient Optimizer on 10-fold CV. The learning configuration is shown in Figure 3. Decision function-based prediction In this approach we took the given classifier's value of the decision function as the input in the stacking vector. For the SVM based SGD we used the decision function and for the Logistic Regression we used the Sidmoid activation. The proposed architecture is similar to the architecture in Figure 3, where prediction values are replaced by decision function values.

Experiments and results
This section describes model parameters, our experiments and the results of experiments as well as the results of the final submission.
We conducted the experiments in two phases. The experiment phases synced with the competition phases and were defined as TDT phase and CV phase. In the TDT phase the train and validation data is split into three subsets, while in the CV phase all data is concatenated and evaluated on 10-folds.

Train-development-test (TDT) split
In the first phase, we concatenated the train and the validation data and splitted it into three subsets: train(75%), dev (18.75%) and test(6.25%). On the train split we learned the classifier which we validated on the dev set with measurement of F1-score. Best performing model on the dev set was finally evaluated on the test set. Achieved performance is presented in Table 3 and the best performances are shown in Figure 4. DistilBERT comes out on top in F1-score evaluation on all data sets in TDT data split-to the extent that we feared overfitting on the train data-while handcrafting features did not prove to be successful. Taxonomy based tax2vec feature construction trails distilBERTs score but using a knowledge graph to generalize constructed features seemed to decrease performance significantly (tax2vec(kg)). Other methods scored well, giving us plenty of reasonably good approaches to consider for the CV phase.

CV split
In the second phase -the CV phase we concatenated the data provided by the organizers and trained models on 10-fold Cross-Validation. The evaluation of the best-performing models is presented in Table 4.
During cross-validation, LSA showed consistency in good performance. With similar performance were the tax2vec methods which this time scored very similarly.

Evaluating word features
To better understand the dataset and trained models we evaluated word features with different metrics to pinpoint features with the highest contribution to classification or highest variance.
Features with the highest variance We evaluated word features within the train dataset based on variance in fake and real classes and found the following features to have the highest variance: "Real" classnumbertotalnewtestsdeathsstatesconfirmed casesreported SHAP extracted features After training the models we also used Shapley Additive Explanations [7] to extract the most important word features for classification into each class. The following are results for the gradient boosting model: Generalized features We then used WordNet with a generalizing approach called ReEx (Reasoning with Explanations) 4 to generalize the terms via the "is a" relation into the following terms: "Fake" classvisual communicationactmatterrelationmeasure hypertext transfer protocolattribute "Real" classphysical entitymessageraisepsychological feature

Results
Results of the final submissions are shown in Table 5. DistilBERT appears to have overfitted the train data on which it achieved very high F1-scores, but failed to perform well on the test data in the final submission. Our stacking method also failed to achieve high results in the final submission, being prone to predict "fake" news as can be seen in Figure 5. On the other hand, the taxonomy based tax2vec data enrichment method as well as the LSA model have both shown good results in the final submission, while our best performing model used stacking, where we merged different neural and non-neural feature sets into a novel representation. With this merged model, we achieved 0.972 F1-score and ranked 50th out of 168 submissions.
In Figure 5 we present the confusion matrices of the models evaluated in the final submissions.

Conclusion and further work
In our take to tackle the detection of fake-news problems we have have exploited different approaches and techniques. We constructed hand crafted features that captured the statistical distribution of words and characters across the tweets. From the collection of n-grams of both character and word-based features to be found in the tweets we learned a latent space representation, potentially capturing relevant patterns. With the employment of multiple BERT-based representations we captured the contextual information and the differences between fake and real COVID-19 news. However such learning showed that even though it can have excellent results for other tasks, for tasks such as classification of short texts it proved to fall behind some more sophisticated methods. To overcome such pitfalls we constructed two different meta models, learned from the decisions of simpler models. The second model learned a new space from the document space representations of the simpler models by embedding it via a 5 layer neural network. This new space resulted in a very convincing representation of this problem space achieving F1-score of 0.9720 on the final (hidden) test set.
For the further work we suggest improvements of our methods by the inclusion of background knowledge to the representations in order to gain more instance separable representations. We propose exploring the possibility of adding model interpretability with some attention based mechanism. Finally, as another add-on we would like to explore how the interactions in the networks of fake-news affect our proposed model representation.