The InVID Plug-in: Web Video Verification on the Browser

This paper presents a novel open-source browser plug-in that aims at supporting journalists and news professionals in their efforts to verify user-generated video. The plug-in, which is the result of an iterative design thinking methodology, brings together a number of sophisticated multimedia analysis components and third party services, with the goal of speeding up established verification workflows and making it easy for journalists to access the results of different services that were previously used as standalone tools. The tool has been downloaded several hundreds of times and is currently used by journalists worldwide, after being tested by Agence France-Presse (AFP) and Deutsche Welle (DW) journalists and media researchers for a few months. The tool has already helped debunk a number of fake videos.


INTRODUCTION
Verifying images and videos posted by eyewitnesses of an event on social networks, especially during breaking news events, or debunking "fake news", misinformation, disinformation or hoaxes, has become part of the daily routine in newsrooms. But those processes remain rudimentary, time-consuming and cumbersome for Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. journalists; the latter have to manually generate screenshots, while watching the video, and use them to query reverse image search engines, to master several ever-changing tools, to scroll down endless social media users' timelines to find related information, copies or clues that allow to identify an eyewitness or an event, or to extract more knowledge about a media item.
Most of those skills and expertise are documented by several authoritative sources such as the Verification Handbook [33] 1 , or the recommendations from First Draft News 2 (a US-based nonprofit coalition aiming to provide practical and ethical guidance in how to find, verify and publish content sourced from the social Web). Nevertheless, very few tools really help journalists in their verification routines. In practice, journalists and investigators need to switch back and forth among a multitude of online tools and services, each of which addresses only a small part of the journalistic verification process.
Through design thinking methodology, observing and understanding of journalistic workflows and of the difficulties encountered by professionals when verifying information, we have released (on July 3, 2017) in "open beta" the first version of a browser plug-in, designed as a verification "Swiss army knife". The tool, which has been released as open source software 3 , provides a unified view over a number of third party services and novel technologies developed within the InVID 4 and REVEAL 5 research projects, through a single graphical user interface, aiming to help journalists to get additional related information about the content that they try to verify.

RELATED WORK
The problem of online information verification is very complex and touches upon a number of research fields, including media studies and journalism (e.g. best journalistic practices for verifying usergenerated content [33]), social network analysis (e.g. rumour spread over social networks [28]), knowledge engineering and computational fact checking [38], multimedia analysis and forensics [13,34], and social media mining [6]. In this section, we focus on the areas that are most pertinent to the development of the presented plug-in, namely video fragmentation for keyframe selection (Section 2.2), multimedia forensics (Section 2.3) and context-based multimedia verification (Section 2.4). But we first start our discussion by presenting a number of related systems and services (Section 2.1) that are currently available on the market and try to address different aspects of the video verification problem.

Verification systems and services
A lot of tools are used in the journalistic verification process; these include search engines, online translators, video players and editing software, map services (e.g. Google maps, Street view, Bing maps, Open Street Map, Nokia Here, Yandex maps, Wikimapia.org) and other web services and applications (e.g. providing historical weather information). Journalists often rely on the YouTube DataViewer 6 of Amnesty International for performing reverse image search based on the video thumbnails, or simply take screenshots while watching the video and upload them on reverse image search services such as Google images. For still images, plug-ins like RevEye 7 or TinEye 8 (linked to the respective search engines) are also used. Yet, all those tools require experience and remain rudimentary and cumbersome to use (e.g. jumping from one tool to another to check a location or landmark). To our knowledge, there is currently no integrated solution for comprehensively addressing the verification needs when dealing with user-generated content.

Fragmentation and keyframe selection
A core operation for many video analysis applications, including video annotation, summarization and detection of near-duplicates, is the identification of the temporal structure of the video. The most common approach relies on the detection of the elementary parts of the video, called shots, which correspond to sequences of frames captured without interruption by a single camera. Several methods have been proposed to address this task (e.g. [1,3,11,36,37]), which is now considered as a solved one.
Nevertheless, when dealing with user-generated videos the shotlevel fragmentation is too coarse and fails to reveal too much information about their structure, since these videos do not contain the typical video editing effects (e.g. for merging different pieces of video or adding transition effects) and are most commonly captured without interruption with the help of a single camera/smartphone, thus being single-shot videos. For this type of video a more finegrained segmentation into sub-shots is appropriate, to identify the different visually coherent parts of the video. To this direction several sub-shot segmentation algorithms have been introduced. Most of them are related to video summarization and keyframe selection (e.g. [9,15,19,27]), some of them focus on analyzing egocentric videos (e.g. [17,24,39]), others are used as a first step for detecting duplicates (e.g. [8]), or for supporting indexing and annotation of personal videos (e.g. [25]) or video rushes (e.g. [2,10,23,29]).
Driven by the needs of media experts for verifying the integrity and authenticity of video content under time-pressure (which is the typical case when verifying content about breaking news), we built a very fast method that fragments a single-shot video into sub-shots and extracts a set of representative keyframes. The latter can be then used for assessing the originality of the video content by means of reverse image search. Details about the developed sub-shot segmentation algorithm are given in Section 4.3.

Multimedia forensics
The field of multimedia forensics focuses on methods for detecting traces of tampering in multimedia and extracting information about the history of media items using both content and metadata. Image forensics is an established field, and a number of surveys and evaluations of state-of-the-art techniques are available [4,35,41]. The latter include methods that try to identify whether an image has been tampered (tampering detection), attempt to deduce where the tampering has taken place (tampering localization), and try to detect other, generic and often innocuous operations that have taken place on the image, such as recompression, rescaling or global enhancements. Of the three, the most relevant for multimedia verification is tampering localization, since generic operations are often unrelated to verification, and tampering detection algorithms are not favored by experts, as they typically do not give explanations for their conclusions but operate as black boxes instead [42].
Video forensics is a relatively younger field compared to its image-based counterpart, and has yielded more limited success [30,34]. Besides looking for tampered regions in frames, the fact that videos have a temporal aspect as well means that a significant amount of research is also devoted to detecting the addition or deletion of entire frames [7,14]. However, the overall field has not progressed enough so far, to reach potential for applications. For that reason, the presented browser plug-in leverages recent advances of image forensics algorithms only.

Context-based multimedia verification
Recent research has shown that multimedia forensics are very hard to apply and largely ineffective on content that is sourced from the Web or social media platforms. This is mainly due to the fact that the provenance of such content is to a great extent unclear and that different platforms (e.g. Twitter, Facebook) tend to transform and resave multimedia content in a way that is destructive for the forensic traces of content [40]. To this end, recent research has investigated the potential of leveraging additional signals (i.e. context) about the content of interest with the goal of determining its veracity.
Seminal works in this area have empirically tested the potential of using a supervised learning approach for detecting newsworthy and credible posts in social media (mostly on Twitter) [6,32]. In particular, different features have been considered: features related to the post (tweet) (including both general text features, e.g. n-grams, and twitter-specific ones, e.g. hashtag-and URL-based), the author/user, topic-based and network-or propagation-based features. Several works that follow a very similar approach [5,16] confirmed on different datasets the potential and high accuracy of fake post detection when using post-and author-based features. Although such methods have shown great potential in automatically distinguishing between credible and fake tweets, in practice end users (journalists) are often reluctant to rely on algorithmic outcomes for deciding on the veracity of online content. To this end, the presented plug-in only computes some of the credibility features investigated by previous works and presents them to end users without providing any automatically produced score quantifying the veracity of an input video.

INVID PLUG-IN DESIGN APPROACH
In the InVID project, we adopted -since the very beginning-design thinking as a methodology to better respond to the user needs and to implement a more iterative development process with end users. Apart from interviews with journalists dealing with user-generated content verification in their daily work, we also analyzed many reallife use cases of breaking news situations, where eyewitness content was playing a key role in those events' reporting. Particularly, the participation of InVID, through the consortium partner AFP, in the CrossCheck initiative launched by First Draft News on the French presidential election, was very valuable to observe the challenges faced by the teams of journalists trying to debunk rumors and fake news. This overall analysis was key to understanding the difficulties that journalists are facing and where an accurate usage of technology could help them save time and be more efficient. Several prototypes were made for different tools, such as Python scripts to trigger fast reverse image search on YouTube videos or to automate the advanced search on Twitter by time interval up to the minute. In the meantime, sophisticated multimedia analysis services such as video context analysis (Section 4.1), video keyframe selection (Section 4.3), and image forensics (Section 4.5) were integrated in the tool. While the goal of InVID is to develop a full knowledge verification platform to detect emerging stories and assess the reliability of newsworthy video files and content spread via social media, we decided to share our work with the journalistic community 9 by designing a browser plug-in wrapping up several tools in a single interface, implemented in HTML, CSS and JavaScript. The browser plug-in appeared as the best solution to combine the available tools, to engage with the community of end users and to provide also some long lasting modules (Twitter advanced search, Metadata reader, Magnifier, YouTube Thumbnails reverse search), locally in the browser, always at the user's fingertips.

PLUG-IN VERIFICATION MODULES 4.1 Video context analysis
This module aims to assist analysts by providing contextual information about the video, which can often be exploited for verification. Although part of the information provided by this module can be accessed already by visiting the page where the video was published (e.g. YouTube), the module isolates and aggregates verification-relevant information and presents it to the investigator 9 Media studies and media education scholars have also shown great interest in using the plug-in. in a digestible format, organized in five categories: a) video metadata, b) channel metadata, c) comment analysis, d) external search, and e) Twitter timeline analysis.
Video and channel metadata are collected from the YouTube and Facebook APIs 10 , and presented in a compact form to the investigator. Besides the video name and description, these include information such as the video and channel views count, video upload date and channel creation date (converted to GMT), plus any locations mentioned in the video description, extracted using the Named Entity Extraction functionalities of the Stanford CoreNLP library 11 (Figures 1 and 2). This contextual information is aimed at providing a first overview of the video context and to spot discrepancies between the actual metadata and the associated claims, e.g. with respect to when and where the claimed event took place versus when and where the video was uploaded, and how old, reliable, and relevant the channel appears to be.
Comment analysis aims to assist the investigator by scanning through the comments and finding the ones that are potentially related to verification. These verification-related comments are currently extracted based on a list of keywords, such as "lies", "fake", "wrong", and "confirm" 12 . Comments that contain these keywords are marked as verification comments and presented in a compact form. helping investigators quickly sift through them and see whether some user has already identified some information indicating that the video is real or fake. Overall, comments can be an important source for verifying videos since they offer a view on the observations of other users. At the same time, they have limitations; newly appearing videos may not contain any useful comments for some time, as users are still trying to verify the video. 10 In the future, more video platforms with publicly accessible API will be supported. 11 http://stanfordnlp.github.io/CoreNLP/ 12 The list is currently under review with the goal of expanding it and also translating it to additional languages.

Paper Session
MuVer'17, October 27, 2017, Mountain View, CA, USA Preliminary efforts in attempting to use comment features for detecting fake videos have shown that it may take at least 6 hours to have enough comments to come to a reliable conclusion [31]. Another functionality provided by the contextual verification module is external search using third party services. A first type of such services include reverse image search platforms, such as Google and Yandex, that provide links to near-duplicate versions of a query image on the Web (if available). This aims to help the analyst find out whether the content is actually from an older event and is being reposted under a false context. To this end, the video thumbnails provided by the video platform API are sent to the respective reverse image search services. In addition, external search includes the search for posts of the input video on Twitter. The video URL is sent to Twitter as a query, and the search returns all posts sharing the video. This allows the investigator to evaluate Twitter activity around the video. Similar to other functionalities in this module, these steps could be taken by the investigator independently. However, integrating and presenting them on the same page offers a comprehensive platform that can significantly speed up the verification process.

Twitter search
Twitter is widely used by journalists to discover new information, especially during breaking news events. In the plug-in, we enhanced the Twitter advanced search by allowing the user to query this source by time interval up to the minute. This is done through automation of the conversion of regular calendar dates into Unix timestamps; a trick that some journalists were using manually until now through the Epoch Converter 13 website, which requires to copy and paste the number strings created by the aforementioned website into the Twitter search panel using the "since" and "until" operators.
This extended functionality now offers to journalists and media scholars an efficient and fast way to go back in time to the first tweets after a breaking news event or to document the Twitter coverage of past events. It also allows to quickly change, if needed, the time interval to narrow or expand the search (Figure 3).

Keyframe selection
This module selects a set of keyframes from a single-shot video by detecting visually coherent parts of it (i.e. sequences of frames having only a small and contiguous variation in their visual content) and extracting one representative keyframe from each part. The decomposition of a single-shot video into the aforementioned fragments (called sub-shots in the following) is based on the assessment of the visual resemblance of neighboring video frames with the help of the Discrete Cosine Transform (DCT), which is similar to the applied transformation when extracting the MPEG-7 Color Layout Descriptor [18]. As shown in Figure 4, the employed algorithm initially resizes each frame to m × m dimensions (step 1) and represents it as a sum of cosine functions oscillating at different frequencies via a two-dimensional DCT (step 2), forming an m × m matrix (m = 8 in Figure 4) where the top-left element corresponds to the DC coefficient (zero-frequency) and every other element moving from left to right and from top to bottom corresponds to an increase in the horizontal and vertical frequency by a half cycle, respectively. Following, the top-left r × r part (r < m) of the computed matrix (r = 3 in Figure 4) is kept, while high-frequency coefficients are discarded, thus removing information related to the visual details of the image (step 3). Finally, a matrix reshaping process is applied to piece together the rows of the extracted r × r sub-matrix to a single row vector (step 4), and the DC coefficient is then removed (step 5), forming a row vector of size r 2 − 1 that represents the image. The visual similarity between a pair of frames is estimated by computing the cosine similarity of their descriptor vectors. This process is applied for any pair of consecutively selected frames via a fixed-step sampling strategy which keeps 3 equally distant frames per second. After analyzing the entire set of selected frames the algorithm produces a series of similarity scores, which is smoothed (with the help of a sliding mean average window of size 3) for reducing the effect of sudden, short-term changes in the visual content of the video (such as the ones introduced after camera flashlights or slight hand movement of the camera holder). The turning points of the smoothed series are then identified by computing its second derivative, and each one of them signifies a change in the similarity tendency and therefore a sub-shot boundary. Through this process the algorithm indicates both sub-shots with minor or no activity, and sub-shots with gradually, but also consistently, changing visual content. As a final processing step, the representative keyframe selected for each sub-shot of the former type is its middle frame, while for the latter type of sub-shots the frame that corresponds to the point in time where the change of visual content is most pronounced is chosen. The selected keyframes are shown to the user of the toolkit, to allow performing reverse search through the Google image search engine.

Keyframe magnifier and video metadata
Implicit knowledge from a scene depicted in a video keyframe, such as car plates, banners, signs, shop names, points of interest, etc. can be used by journalists to determine whether an image or a video keyframe is really related to the location where an event is allegedly taking place. To support the inspection of keyframes, the plug-in offers a "magnifier" feature ( Figure 5) based on the Elevatezoom 14 JavaScript library. We have added two algorithms that the user may trigger: one to enhance the sharpness of the image and the other, a bicubic enhancement filter, to increase the image size without harming the image readability. This functionality allows journalists, if the quality of the image allows it, to detect meaningful details within the image to confirm a location or identity, or to spot pixel incoherences that may alert of possible tampering.
Furthermore, the plug-in includes an Exif metadata reader based on an Exif JavaScript library for still images and the MP4Box.js library for video in mp4 or m4v format. Stored metadata within the image or the video, such as creation or modification date, codecs, caption if any, geocoordinates, resolution, duration, etc. are displayed in a table.

Forensic analysis of keyframes
The image forensics module provides analysis tools for tampering detection in images by incorporating a number of state-of-theart algorithms for tampering localization. The functionalities and interface of this module, described in detail in [42], are designed to cover the needs of news professionals for news-related image verification ( Figure 6).
The module exposes the results of seven state-of-the-art algorithms that process the image and return a number of localization maps. These include Double JPEG Quantization [22], JPEG Ghosts [12], JPEG Blocking Artifact Inconsistencies [21], Median Filtering Noise Residue, Discrete Wavelet High Frequency Noise Variance [26], Error Level Analysis [20], and a novel algorithm, named GRIDS and developed within the REVEAL project 15 , aiming at detecting JPEG Blocking Grid Inconsistencies. Figure 6 shows the analysis results for a tampered image, where several of the algorithms have returned strong indications that the face may have been edited. The algorithms are chosen to cover, as widely as possible, a range of tampering traces, and are accompanied by descriptions and instructions on how to interpret the outputs, and visual examples of detections and non-detections. The aim is to assist investigators with no prior experience in image analysis, to take advantage of these tools. The service also provides a "magnifying glass" feature which allows investigators to explore details in the image or the output maps, and export their findings in an annotated PDF report that can be used for sharing their observations.

CASE STUDIES
The presented verification plug-in has been used during two months in a professional environment by a dozen journalists at Agence France-Presse (AFP) as well as by Deutsche Welle (DW) social media journalists. The plug-in allowed, e.g. to quickly debunk a fake video allegedly on the robbery of a Manila (Philippines) casino in a hotel resort on the June 2nd, 2017. The video was in fact depicting a previous robbery in a casino perpetrated in Surinam at the end of 2011. Checking the reverse image search results on Google after extracting the video keyframes, was enough to debunk this example. Through the Twitter advanced search feature of the plug-in, it is possible to easily document breaking news events from the past and to track fake images or videos that were shared on Twitter during those events. A quick search on June 19, 2017 afternoon (using a "media" parameter as filter) returned an image from an arrest in London captured in a previous terror attack. The same image reappeared on Twitter during the Champs Elysées failed attack on that particular day. Earlier, during the already mentioned CrossCheck initiative on the French presidential election, the magnifier feature 15 https://revealproject.eu/ allowed to prove that a jet, allegedly used by a candidate to go from one meeting to another, was in fact, registered in the United States (see Figure 5).
Last but not least, according to the analytics of the Google Chrome store, a week after the "open beta" release of the plug-in the number of total current users exceeded 400 and is now reaching 500 (Figure 7). Moreover, based on the received feedback about the tool via Twitter (indicative examples in Figure 8), other social media platforms or e-mail, the tool is currently used by journalists and experts worldwide, effectively supporting them in their efforts to debunk a number of fake videos.

CONCLUSIONS
We presented the design and underlying components of a novel browser plug-in aimed at helping journalists and news professionals with the verification of user-generated video content. The plug-in, which is also available as open source software, seamlessly integrates a number of sophisticated multimedia analysis features and several third party services that are used very often within verification workflows. Through testing the plug-in in realistic settings, it has been found out that it can offer a valuable integrated tool that can considerably speed up the video verification process for journalists and media scholars, as well as non-governmental organizations dealing with video verification. The plug-in will be updated in the next months by enhancing the current features, for instance by refining the implementation of the enhancement (magnifier) feature for keyframes, by making better use of the extracted image and video metadata, and by extending the comments analysis to multiple languages.