Synthetic Misinformers: Generating and Combating Multimodal Misinformation

With the expansion of social media and the increasing dissemination of multimedia content, the spread of misinformation has become a major concern. This necessitates effective strategies for multimodal misinformation detection (MMD) that detect whether the combination of an image and its accompanying text could mislead or misinform. Due to the data-intensive nature of deep neural networks and the labor-intensive process of manual annotation, researchers have been exploring various methods for automatically generating synthetic multimodal misinformation - which we refer to as Synthetic Misinformers - in order to train MMD models. However, limited evaluation on real-world misinformation and a lack of comparisons with other Synthetic Misinformers makes difficult to assess progress in the field. To address this, we perform a comparative study on existing and new Synthetic Misinformers that involves (1) out-of-context (OOC) image-caption pairs, (2) cross-modal named entity inconsistency (NEI) as well as (3) hybrid approaches and we evaluate them against real-world misinformation; using the COSMOS benchmark. The comparative study showed that our proposed CLIP-based Named Entity Swapping can lead to MMD models that surpass other OOC and NEI Misinformers in terms of multimodal accuracy and that hybrid approaches can lead to even higher detection accuracy. Nevertheless, after alleviating information leakage from the COSMOS evaluation protocol, low Sensitivity scores indicate that the task is significantly more challenging than previous studies suggested. Finally, our findings showed that NEI-based Synthetic Misinformers tend to suffer from a unimodal bias, where text-only MMDs can outperform multimodal ones.


Introduction
The proliferation of misinformation is a growing challenge in today's society, especially with the widespread use of social media and the Internet.Consequently, the automatic detection of misinformation has become an important challenge, with researchers exploring various methods for identifying false claims through natural language processing [1] and detecting manipulated images, such as DeepFakes, through computer vision techniques [2].The aforementioned challenges primarily focus on individual modalities, necessitating the use of unimodal detection models.However, multimedia content has been shown to be more attention-grabbing and widely disseminated than plain text content [3].Furthermore, the presence of an image can make a false statement more convincing to individuals [4], emphasizing the importance of multimodal misinformation detection (MMD).
MMD models, as seen in Fig. 1, are trained to identify whether an image and its accompanying caption in combination are accurate (truthful) or misleading (leading to misinformation).The image at the top of Fig. 1 shows a music festival with littered grounds, while the accompanying caption claims that the event occurred in June 2022, after a speech by environmentalist Greta Thunberg.In reality, even though the image depicts a Glastonbury festival, it was not taken in Image of a damaged railway bridge in Kursk, Russia was taken in May 2022 during the Russia-Ukraine war.

M M D
The aftermath of environmentalist Greta Thunberg's speech at the Glastonbury Music Festival in June 2022 June 2022 after Greta Thunberg's speech, but rather in 2015; years before Greta Thunberg became a public figure 2 .This case involves the manipulation of named entities (person and date) in order to frame a target public figure and her associated audience under a negative spotlight.The bottom image illustrates a railway bridge that collapsed into a body of water with the caption claiming that the event took place in Kursk, Russia during the 2022 Russia-Ukraine war.A bridge was indeed damaged that day in Kursk, however, this image was actually taken in 2020 in Murmansk, Russia 3 .This case illustrates the use of an out-of-context image, either by mistake or with the intention to exaggerate or downplay the severity of the event.From the above examples, it is evident that multimodal misinformation takes diverse forms, can be disseminated for various motives, and entails subtle cues that are challenging to discern.

(Image, Caption) pairs
Researchers have been exploring training deep neural networks for MMD [5].Considering the data-intensive nature of training deep neural networks for MMD as well as the time-consuming and labor-intensive nature of manual annotation, researchers have been investigating methods for automatically generating synthetic multimodal misinformation, which we refer to as Synthetic Misinformers.These methods include the generation of out-of-context (OOC) image-text pairs or the creation of cross-modal entity inconsistencies (NEI).OOC involves pairing an image with an incongruous caption [6] while NEI involves manipulating the named entities in otherwise truthful captions [7].Previous works have relied on random sampling [8,6] or feature-informed sampling methods [9,10] for generating OOC and in-cluster random sampling [11] or rule-based random sampling [7] for generating NEI.Examples of generated OOC and NEI misinformation can be seen in Fig. 2.
Nevertheless, despite the efforts of previous studies, those have been limited by evaluating their methods on test sets generated by their own Synthetic Misinformers instead of real-world multimodal misinformation, the only exception being the work presenting the COSMOS dataset [6].However, the authors of the latter work made use of a problematic evaluation protocol that suffers from information leakage, as seen in Fig. 3 (See section 3.1.4for further details).Additionally, prior studies did not compare their methods with other Synthetic Misinformers.Lack of comparison and evaluation on real-world data hinders the ability of the research community to assess the progress made and determine the current state-of-the-art on MMD.To this end, we replicate multiple Synthetic Misinformers, fine-tune our Transformer-based MMD model (termed DT-Transformer) on the generated data and finally compare them on the COSMOS benchmark [6] that encompasses real-world multimodal misinformation.In our comparative study, we examine whether OOC or NEI are more representative of real-world misinformation.Our hypothesis is that -despite prior research treating them as separate tasks -both OOC and NEI are crucial components of effective MMD.For this reason, we also investigate a range of hybrid Synthetic Misinformers.
The main contributions of our work can be summarised as follows: • We perform the first comparative study on Synthetic Misinformers, which offers a comprehensive evaluation of the current state-of-the-art on MMD and can provide guidance for future research in the field.
• We introduce "CLIP-based Named Entity Swapping", which demonstrates the highest multimodal accuracy among other OOC and NEI approaches.Additionally, we demonstrate that hybrid Synthetic Misinformers can further enhance detection accuracy.
• Our findings highlight that although NEI-based may outperform OOC methods, they tend to suffer from a unimodal bias, where text-only models can outperform multimodal ones.Moreover, low Sensitivity scores -or Hit Rate for 'Falsified' pairs -indicate that the problem is significantly more challenging than previous works suggested.We offer recommendations on how future studies may address these challenges.

Related Work
Studies on multimodal misinformation detection (MMD) have focused on out-of-context image-language pairs (OOCs) or cross-modal named entity inconsistencies (NEI).One common form of multimodal misinformation involves decontextualization; a legitimate image being paired with an out-of-context caption creating a deceptive impression.Consequently, researchers have used random-sampling ( [8,6]) and feature-informed sampling methods ( [9,10]) for generating OOCs.The MAIM dataset was created by randomly sampling image-text pairs collected among imagecaption pairs collected from Flickr [8].The authors developed a joint embedding with the use of deep representation learning and then calculated the image-caption consistency.Similarly, Aneja et al. [6] created the COSMOS training dataset by collecting truthful image-captions pairs from credible news websites and then matching captions with random images to create OOCs.The authors utilized self-supervised deep learning and evaluated their method on the COSMOS benchmark, consisting of real-world multimodal misinformation.However, random sampling can not ensure that the image-caption pair will bear any relation and tends to generate easy negative samples that do not resemble realistic multimodal misinformation ; capable of deceiving humans.To this end, Luo et al. [9] created the NewsCLIPings datasets by utilizing the large cross-modal CLIP model [13] along with scene-learning and person matching models in order to generate hard negative samples.Similarly, the Twitter-COMMs dataset was created by combining and applying CLIP-based sampling (to generate hard negatives) and in-topic random sampling (to resolve class imbalance) on data collected from Twitter4 , related to three topics: climate, COVID, and military vehicles [10].
On the other hand, NEI involves legitimate images being accompanied by a manipulated caption whose named entities (person, location, event) do not match with the content or the context of the image.The "Multimodal Entity Image Re-purposing" (MEIR) dataset was created by clustering image-caption pairs based on 'relatedness' -location proximity, text and image similarity -by using GPS coordinates, word2vec and VGG19 pre-trained on ImageNet respectively.Then they randomly swap named entities of the same type between the current caption and another caption taken from the same cluster [11].Similarly, the TamperedNews dataset was created by randomly replacing named entities with ones of the same type given that the replaced person is of the same gender and/or country, locations are within high geographical proximity and events belong to the same category (e.g sport competitions or natural disasters) [7].
The above works either provide internal ablation [11,7,9,10] or comparison with simple baselines [11] and do not compare their methods with other Synthetic Misinformers.Moreover, studies that used some of the above datasets have mostly focused on incremental methodological improvements [14], such as the integration of sentiment [15] or evidence [16] in MMD models.With the exception of [6], prior works have not evaluated their methods on real-world misinformation but on data generated by their own Synthetic Misinformer.Therefore, there is no clear way for the research community to assess progress in the field including the current state-of-the-art on MMD, which s the best way to generate training data for MMD and whether OOC or NEI (or both) are better representative of real-world misinformation.To address this gap, we perform a comparative study on various Synthetic Misinformers -both OOC, NEI and hybrid methods -and evaluate them on real-world multimodal misinformation.
Finally, Luo et al. [9] argued that methods utilizing named entity manipulations may introduce linguistic biases.To investigate this, the authors trained a text-only BERT [17] model and achieved similar results to the multimodal models used in [7].However, the latter extracted visual features from an ImageNet pre-trained ResNet and textual features from off-the-self fastText [18].At the time of writing (2021 [9]), BERT-like models were considered among the stateof-the-art for text-based tasks while fastText was an older architecture, rendering the training protocols significantly different and thus not directly comparable.Therefore, it is not possible to conclude definitively about the existence of unimodal bias based solely on these results.To address this gap, we re-examine whether NEI methods suffer from unimodal bias, within a controlled training framework and an evaluation on real-world misinformation.

Caption C2
The Toronto Raptors just unveiled Black Lives Matter Buses and fans are impressed.

Caption C2
Photographs purportedly showing a "trampoline bridge" in Paris is a concept design created for an architecture competition ) triplets from the COSMOS benchmark [6].Caption C2 either provides the correct description for the image (left) or an explanation of why C1 is false (right).We consider C2 to be information leakage and excluded them from the evaluation protocol.

Problem Formulation
In this study we compare numerous Synthetic Misinformers, methods for generating synthetic multimodal misinformation, and evaluate them on real-world multimodal misinformation.The problem is defined as a binary classification task, where an (I,C) image-caption pair is either truthful or falsified.Truthful captions are collected from credible news sources while falsified ones are produced by a Synthetic Misinformer.Each (I,C) is encoded by a visual encoder E V (•) and a textual encoder E T (•) that produce the corresponding vector representations v I ∈ R e×1 and t C ∈ R e×1 for the image I and caption C respectively, where e is the encoder's embedding dimension.The extracted features are concatenated and passed through the multimodal detection deep learning neural network D(•) -referred to as the Detector whose parameters have to be optimized and its hyper-parameters tuned.Finally, the predictions of the trained Detector will be evaluated against a test set consisting of real-world multimodal misinformation.In order to accurately and fairly compare various Misinformers, they need to share a common training and evaluation framework.Therefore, the (1) Encoder, (2) Detector, (3) optimization and hyper-parameter tuning and the (4) evaluation process should remain constant during the comparative study while only the Misinformer method changes.This framework ensures that any change in performance will be the result of the Misinformer and not other factors.A high-level illustration of the proposed workflow can be seen in Fig. 4. In this section, we address each aspect individually.

Encoder
After generating a training dataset with a Synthetic Misinformer, we use an Encoder for extracting visual and textual features from all images and captions that will be used to train the Detector.The first works on MMD mostly relied on convolutional neural networks pre-trained on ImageNet to extract features from images (namely VGG-19 [8,11] and ResNet50 [7,6]) and word embeddings to extract features from captions (namely word2vec [8,11] and fastText [7]).More recent approaches, have utilized large-scale multimodal and cross-modal models, namely CLIP [9,10,19], VisualBERT [9] and VinVL [19] to extract both their visual and textual features.In the aforementioned works, CLIP [13] tended to outperform other cross-modal methods (VinVL and VisualBERT) for MMD [9,10,19].
Contrastive language-image pre-training, or CLIP in short, is a cross-modal model trained to match the most relevant text to an image.Developed by Radford et al. [13] and trained on a large-scale dataset -approximately 4×10 8 image-text pairs -CLIP has proven to have powerful zero-shot capabilities; meaning that it performs well on tasks and domains that it was not explicitly trained for.In our study, we first perform an experiment using CLIP ViT-B/32 on the NewsCLIPings datasets, in order to compare our training pipeline with [9] but we also utilize the updated and improved5 CLIP ViT-L/14 version in the comparative study.CLIP ViT-B/32 produces an embeddings vector of size e = 512 while ViT-L/14 produces e = 768.We use CLIP off-the-shelf 6 and do not fine-tune it further due to computational resource constraints.Luo et al. [9] experimented with fine-tuning the whole or the top layers of CLIP-ResNet-50, but their results were mixed; fine-tuning could not consistently outperform the "frozen", off-the-shelf, CLIP in all cases.There have been proposed methods for robustly fine-tuning large-scale cross-modal neural networks [20] but they are outside the scope of this study; since we do not attempt to reach the highest possible performance but primarily focus on providing a fair comparative study of various Synthetic Misinformers.

Detector
Previous works either calculate a cross-modal similarity score [8,7,6,14] or define binary classifier on top [9,15,11,10].Few works have also added fully-connected layers to analyse the extracted features before the final classification layer [15,11].Instead, we consider the Transformer architecture [21] is an even more appropriate choice for the Detector.
Our Transformer-based Detector (DT-Transformer in short), first concatenates the encoded captions and images and pass them through a Transformer architecture consisting of L layers that have h attention heads of embedding dimension d.Its output is then passed through a normalization layer, a dropout layer, a fully connected layer which is activated by the GELU function, a second dropout layer and a final binary classification layer.

Optimization
Given that we define MMD as a binary classification task, the Detector is optimized based on the binary cross entropy loss function.Due to differences in distribution, scale, complexity and other factors, we assume that the Detector may require different hyper-parameters to perform optimally with different training datasets.To this end, we tune the Detector's hyper-parameters based on the following grid search: L ∈ {1, 4} transformer layers of d ∈ {128, 1024} dimensions, h ∈ {2, 8} attention heads, and a learning rate of lr ∈ {1e − 4, 5e − 5}.The dropout rate is constant at 0.1 and the batch size at 512.The selected hyper-parameter grid amounts to 16 experiments for each Synthetic Misinformer dataset.This is clearly not exhaustive, but adding any more options would exponentially increase the required time and computational resources.Instead, our aim is to give the chance to each method to reach an adequate and representative performance, even if it is not the globally optimal that would be possible through exhaustive optimization.The Detector is optimized by the ADAM optimizer for a maximum of 30 epoch with early-stopping at 10 epochs.At the end, we retrieve the checkpoint with the highest validation accuracy and use it for the final evaluation on the test set.

Evaluation protocol
With the exception of Aneja et al. [6], previous works on Synthetic Misinformers have not evaluated their methods on real-world misinformation.Instead, they first propose a method for generating multimodal misinformation which they apply on a body of truthful image-captions pairs.After generating the dataset, the authors split it into training, validation and test sets and report the best performance on the test set.This only tells us how a model trained on a synthetic dataset will perform on a dataset generated by the same process, not how accurately it could potentially detect misinformation "in the wild".Moreover, prior works have not provided direct comparison with other Synthetic Misinformers and therefore, we can not assess progress or the current state-of-the-art in the field.To the best of our knowledge, COSMOS [6] is the only publicly available, manually annotated benchmark for MMD 7 .The COSMOS benchmark -also used in the MMSys'21 "Grand Challenge on Detecting Cheapfakes" [22] -consists of 850 image-caption pairs from the fact-checking website SNOPES 8 and 850 truthful image-caption pairs from credible news sources.Therefore, we will use the COSMOS evaluation set for our study.
Nevertheless, it is important to highlight certain problematic aspects of the evaluation protocol used in [6].During evaluation, the authors provide a triplet of an image and two captions (I, C1, C2) and make a threshold-based decision by examining C1-C2 similarity and their overlap with the objects in the image.First, this protocol does not reflect how we encounter real-world misinformation, with an image usually being accompanied by a single caption or a small paragraph (e.g on Twitter).More importantly, C2 in falsified instances is either an explanation of why C1 is false or the truthful caption for the image.In two examples taken from the COSMOS benchmark and shown in Fig. 3, C2 reads "Toronto Raptors just unveiled Black Lives Matter Buses and fans are impressed" and "Photograph showing a 'trampoline bridge' in Paris is a concept design for an architecture competition".This is a clear case of information leakage and does not reflect how we encounter misinformation in the real world.Fact checkers are not usually presented with two separate bodies of texts and have to decide which is the correct one.Instead, they are usually presented with an image and a single body of text and they have to determine whether the text is truthful and whether the text accurately matches and describes the image.It is important to note that these are not outlier cases.We have manually examined hundreds from the COSMOS benchmark and Caption 2 suffers from the same problem.Therefore, we do not consider the 88% detection accuracy reported by the authors to be representative [6].For that reason, in this study, we only use the (I,C1) tuples from the COSMOS benchmark.

Synthetic Misinformers
We define three types of Synthetic Misinformer, namely methods that generate: (1) out-of-context (OOC) imagecaptions pairs, (2) cross-modal entity inconsistency (NEI) where certain named entities in the caption are tampered and do not correspond with the content of the image, and (3) hybrid approaches that combine both OOC and NEI misinformation.Table 1 displays the number of training samples produced by each Synthetic Misinformer (OOC or NEI) as well as the number of Truthful pairs.

Out-of-context misinformation
In order to create out-of-context (OOC) image-caption pairs, we first need a dataset of truthful pairs (I a , C a ) and then a method for sampling an OOC image I x or an OOC caption C x .In this study we make use of the VisualNews dataset [12] that consists of 1,259,732 truthful (I a , C a ) pairs collected by four credible sources (The Washington Post, USA Today, The Guardian and the BBC) regarding 159 topics, namely: art and culture, world, law and crime, international relations, science and technology sports, environment, elections and others.We use the VisualNews training set to generate training data and the VisualNews validation set to generate the validation data in order to avoid overlapping samples and information leakage.We experiment with the following OOC Synthetic Misinformer methods: • Random sampling by caption (RS-C): for every actual (I a , C a ) pair, sample a random caption C x from the whole corpus.This process was used to generate the COSMOS training set [6] but we apply it on the VisualNews dataset instead.
• In-topic random sampling by caption (RSt-C): for every (I a , C a ), sample a random caption C x of the same topic as C a (e.g.international politics, elections, environment, etc).Using candidates from the same topic can increase the chance of relevance.A similar process was used in [10] but only as a means to mitigate class imbalance.We also define in-topic random sampling by image (RSt-I): sampling a random image I x for an actual (I a , C a ) and in-topic random sampling by alternating between image or caption (RSt-alt): choose whether to sample an I x or an C x at random (determined by a random selection function). •

Cross-modal named entity inconsistency
In order to generate image-caption pairs that suffer from cross-modal named entity inconsistency we need truthful (I a , C a ) pairs, a method for sampling and swapping the named entities in C a in order to create the falsified C f .We use the VisualNews and experiment with the following methods: • In-topic random named entity swapping (R-NESt): for every (I a , C a ), identify all entities in C a and replace them with randomly sampled entities of the same type (person, location, organization, date, event etc) that belong to the same topic as C a .
• The MEIR dataset as was provided by the authors10 , but we extract features from CLIP ViT L/14 instead of using VGG19 as in the original paper [11].
• We propose in-topic CLIP-based named entity swapping (CLIP-NESt) by image-image similarity (CLIP-NESt-I), caption-caption similarity (CLIP-NESt-C) or alternating between image-image and caption-caption similarity (CLIP-NESt-alt): for every (I a , C a ) pair, identify the most similar (I x , C x ) pair based on features extracted from CLIP and swap the entities of the same type between C a and C x in order to create C f .C x should have at least one named entity of the same type as C a but be a different named entity (avoid swapping one named entity with itself), otherwise we select the next most similar pair.As candidates we define image-caption pairs from the same topic and use the cosine similarity as the metric of similarity.We use the SpaCy Named Entity Recognizer (NER) and specifically the en_core_web_trf module, which exhibits 0.90 F1-score for NER 11 .Our rationale for proposing CLIP-NESt, is that CLIP-based similarity will retrieve a semantically or thematically similar caption and as a result, their entities are more likely to be related in some aspect.Therefore, swapping entities between similar captions will create more plausible misinformation than randomly sampled ones.

Hybrid methods
We also experiment with methods that combine both OOC and NEI misinformation, which we refer to as hybrid methods.We follow the same training process but instead of binary classification, we train the Detector for multi-class classification with the use of the cross-entropy loss function.The Detector is trained to classify (I, C) pairs into three classes: Truthful, NEI or OOC.During evaluation on the COSMOS dataset, NEI and OOC predictions are set to Falsified pairs; since COSMOS is a binary dataset.In this study, we combine few of the best performing methods: (1) R-NESt + CSt-alt, (2) R-NESt + NC/I-T and (3) CLIP-NESt-alt + CSt-alt.Our motivation for exploring hybrid methods is that real-world multimodal misinformation may not be adequately represented by OOC or NEI alone, but may require a combination of both.OOC and MEI methods produce balanced datasets, since they create one falsified pair for every truthful pair.On the other hand, some hybrid methods showcase imbalanced classes, in this case, we apply random down-sampling.

Experimental Results
Before starting with the comparative study, we wanted to examine whether our training pipeline -the choice of Encoder and Detector -was valid; if it outperforms or at the very least competes with previous works.To this end, we compare our training pipeline with [9].In Table 2 we observe that while using the same Encoder (CLIP ViT-B/32), our DT-Transformer can consistently outperform [9] on all three NewsCLIPings datasets.Moreover, using the CLIP ViT-L/14 Encoder can significantly surpass ViT-B/32.We therefore proceeded our comparative study with the DT-Transformer and the CLIP ViT-L/14 Encoder.
Table 3 shows the comparative study between various Synthetic Misinformers.Commencing with the OOC Misinformers, we observe that in-topic candidates can improve random sampling (+3% improvement).Since image-caption pairs are more likely to be related if they are taken from the same topic (e.g election) than completely random topics (e.g elections and sports).Secondly, we observe that alternating between sampling (I a , C x ) and (I x , C a ) pairs improves the performance of RSt, and CSt.Previous works on OOC Misinformers were only sampling C x captions.As a result, each image would appear twice in the dataset while the captions could appear once, twice or multiple times; depending on the sampling method.Even if minor, this process could lead to certain biases and imbalances.Similarly we observe that alternating between image-image and caption-caption similarity can improve CLIP-NESt; presumably by generating more diverse image-caption pairs.Furthermore, our results show that feature-based negative sampling, including both CSt and NC, can surpass random negative sampling, due to their ability to generate hard negative OOC pairs, which more accurately reflect real-world misinformation.Finally, all multimodal OOC-based Synthetic Misinformers outperform their unimodal counterparts (image-only and text-only) and therefore do not suffer from a unimodal bias.
Shifting to NEI Misinformers, we see that both R-NESt and CLIP-NESt surpass MEIR [11].The methodology could be a contributing factor but we should also consider the difference in data scale.MEIR consists of only 82,156 truthful and Nevertheless, we also recognize certain problems and limitations.First, even the best performing methods have trouble accurately identifying the falsified pairs, scoring lower than 50% in terms of Falsified Hit Rate (Sensitivity) while having high Truthful Hit Rate scores (Specificity).This indicates that the task of multimodal misinformation detection is significantly more challenging than previous studies suggested, e.g.showcasing scores higher than 88% on the COSMOS dataset [6,14,15] while using a problematic evaluation protocol (discussed in Section 3.1.4).Finally, we found that unimodal text-only methods, such as CLIP-NESt-C and R-NESt, R-NESt + CSt-alt, can outperform their multimodal counterparts -with the latter scoring 59.1% -on a supposedly multimodal task.This suggests the existence of a unimodal bias in the dataset, which needs to be addressed in future studies.

Conclusions
In this study we address the task of multimodal misinformation detection (MMD).More specifically, we examine and compare multiple methods that generate training data (Synthetic Misinformers) for MMD, either out-of-context image-text pairs (OOC) or cross-modal named entity inconsistencies (NEI).We perform a comparative study and evaluate all Synthetic Misinformers on the COSMOS benchmark; consisting of real-world multimodal misinformation.The comparative study illustrated that NEI methods tend to -on average -outperform OOC methods on the COSMOS benchmark.Moreover, our proposed CLIP-NESt-alt method reached the highest multimodal accuracy (56.9%) among NEI and OOC methods; having a 2.15% and 2.7% advantage over the next best performing method from each respectively.Furthermore, we hypothesized that real-world misinformation is not solely captured by OOC or NEI instances separately but instead necessitate both.This is validated by the proposed hybrid approach (CLIP-NESt-alt + CSt-alt) achieving the highest multimodal accuracy (58.1%); showing a 2.47% improvement over CLIP-NESt-alt.
Nevertheless, low Sensitivity scores (Table 3) indicate that -under the corrected evaluation protocol -MMD is a significantly more challenging task than previous works suggested [6] and extensive further research is essential.Future studies could consider the integration of external evidence [16,23] or knowledge graphs [24] not only to improve detection accuracy but also to develop new Synthetic Misinformers that generate more realistic synthetic training data, and as a result produce better Detectors.Moreover, experimentation with different modality fusion techniques can further improve performance [25,26].Furthermore, our empirical results showed that NEI Misinformers tend to introduce a unimodal bias, leading to unimodal Detectors competing or even outperforming multimodal ones.Named entity manipulations could create certain linguistic patterns, biases or shortcuts that render the visual information less important.Future studies could explore developing methods for generating de-biased NEI or learning strategies for reducing unimodal bias [27].Moreover, task-specific modality fusion methods could potentially help mitigate this challenge [28].Finally, the unimodal bias may not lie with the training process but with the evaluation dataset.The COSMOS benchmark was not collected with certain criteria in place to explicitly make it difficult for unimodal architectures.Future studies could explore and define relevant rules and criteria for collecting a more robust real-world MMD benchmark.

Figure 1 :
Figure 1: Multimodal misinformation detection (MMD) models attempt to identify whether an (Image, Caption) pair is truthful or misleading.The images and captions are taken from reuters.com.

6Figure 2 :
Figure 2: Training data generated by two types of Synthetic Misinformers: one creates out-of-context (OOC) misinformation and the other produces cross-modal named entity inconsistencies (NEI).Given a truthful (I a , C a ) image-caption pair, OOC samples an image I x creating (I x , C a ) while NEI manipulates the named entities in C a to create the falsified C f .These examples were generated by CLIP-based sampling (CSt-alt) and CLIP-based named entity swapping (CLIP-NESt-alt) for OOC and NEI respectively (See Section 3.2 for more details on these methods).The images and captions are taken from the VisualNews dataset [12].

Figure 3 :
Figure 3: Examples of falsified (I, C1, C2) triplets from the COSMOS benchmark[6].Caption C2 either provides the correct description for the image (left) or an explanation of why C1 is false (right).We consider C2 to be information leakage and excluded them from the evaluation protocol.
Similarly, we implement in-topic CLIP-based sampling by caption to caption similarity (CSt-C), by imageimage similarity (CSt-I) or by alternating between image-image and caption-caption similarity (CSt-alt): calculate the most similar item (I x or C x ) based on features extracted from CLIP.As candidates we define (I a , C a ) pairs from the same topic and use the cosine similarity as the metric of similarity.• Finally, we experiment with three versions of the NewsCLIPings (NC) datasets, namely (1) NewsCLIPings Semantics / CLIP Text-Image (NC/T-I), (2) NewsCLIPings Semantics / CLIP Text-Text (NC/T-T) and the (3) NewsCLIPings Merged / Balanced (NC/Bal) as provided by the authors 9 [9].

Figure 4 :
Figure 4: High-level overview of the proposed workflow.Truthful (I a , C a ) image-caption pairs are manipulated by a Synthetic Misinformer which generates Falsified pairs.Truthful and Falsified pairs are encoded by CLIP ViT-L/14 and used to train the Detector who is optimized for binary classification when using one OOC or NEI dataset, or for multi-class classification when combining one OOC and one NEI method.Here, we showcase the hybrid Synthetic Misinformer CLIP-NESt-alt + CSt-alt.

Table 1 :
Number of instances per class for training datasets generated by different Synthetic Misinformers.

Table 2 :
Comparison between different Detectors and Encoders trained and evaluated on three NewsCLIPings datasets.We report the binary accuracy and the per-class recall for truthful and falsified image-caption pairs.Our proposed Transformer-based Detector (DT-Transformer) consistently outperforms the original NewsCLIPings Detector and using features from CLIP ViT-L/14 further improves performance.We report the overall Accuracy and Hit Rate per class (Truthful and Falsified pairs) or Specificity and Sensitivity respectively.falsified instances compared to the approximately 1,2M data falsified points produced by other Misinformers, which may be inadequate to correctly train the Detector.We also observe that the proposed CLIP-NESt-alt surpasses all other multimodal NEI and OOC methods; achieving 56.9% accuracy.Furthermore, combining R-NESt with CSt-alt achieved the same score but using NC/I-T in conjunction with R-NESt did not perform as well since NC/I-T consists of 226,564 samples and required heavily down-sampling truthful and R-NESt samples (1M each), resulting in a notably smaller dataset.Finally we observe that our hybrid CLIP-NESt-alt + CSt-alt method proved capable of achieving the highest multimodal accuracy (58.1%).

Table 3 :
Comparative study between numerous Synthetic Misinformer methods evaluated on the COSMOS benchmark.We use the DT-Transformer and the CLIP ViT L/14 encoder.We report the Accuracy of unimodal (image-only, text-only) and multimodal Detectors as well as the multimodal Hit Rate per class (Truthful and Falsified pairs) or Specificity and Sensitivity respectively.Bold denotes the overall highest accuracy while underline denotes the highest multimodal accuracy.