Content-aware Detection of JPEG Grid Inconsistencies for Intuitive Image Forensics

The paper proposes a novel method for detecting indicators of image forgery by locating grid alignment abnormalities in JPEG compressed image bitmaps. The method evaluates multiple grid positions with respect to a ﬁtting function, and areas of lower contribution are identiﬁed as grid discontinuities and possibly tampered areas. An image segmentation step is introduced to diﬀerentiate between discontinuities produced by tampering and those that are attributed to image content, making the output maps easier to interpret by suppressing non-relevant activations. Our evaluations, performed both on synthetically produced datasets and real world tampering cases against seven methods from the literature, highlight the eﬀectiveness of the proposed method in its ability to produce output maps that are clear and readable, and which can achieve successful detections on cases where other algorithms fail.


Introduction
With people nowadays spending increasing time looking at screens, it is no wonder that digital images have become an integral part of everyday life and, arguably, one of the most popular ways to convey a message. Exploiting the natural human tendency to give priority to visual information, digital images are widely utilized as a means to convince audiences, engage users, augment storytelling, and provide evidence in various domains from business and marketing to journalism and law, to name a few.
However, given the proliferation of image processing tools, accessible to both professionals and non-experts, the authenticity of a digital image cannot be taken for granted. The intentions behind manipulating images can vary from simple aesthetic enhancements to malicious alterations of important constituent parts of the image with the intent of misguiding viewers. A doctored image can influence the opinion of viewers and have serious consequences on peoples' beliefs and attitudes. As visual inspection may often not be sufficient to detect forgeries, there has recently been a growing interest in developing algorithms to verify the authenticity and validate the integrity of digital images.
Image forgery detection techniques are often categorized into two classes: (i) active methods, which rely on an embedded digital signature of some sort that is encoded at the source side (e.g., automatically by the capturing device) and verified at the receiver's end; (ii) passive (blind) methods, that require no prior information but instead base their detection on the assumption that the tampering process may leave invisible but detectable traces on the image. Even though active methods can be very reliable, their use is not possible in situations where content from unknown or untrusted sources may contain important information [1]. In such cases the assessment of content authenticity is based on what has come to be referred to as intrinsic fingerprints, i.e. inherent traces left from various post-processing operations.
A closer look into the state-of-the art reveals that the challenge researchers face concerning blind image forensics originates from the multifaceted nature of the problem, since the type and salience of traces left by tampering depends on multiple factors, such as the type of tampering or the image format and compression parameters. A recent study presented a comprehensive evaluation of the state-of-the-art in splicing localization [2]. Splicing occurs when parts of the original image are replaced by alien content, and, together with inpainting and copy-moving, they constitute the most common types of forgery. The study pointed out a big discrepancy between real-world cases of tampering and the benchmark datasets that are typically used in academic literature.
In this work we are interested in extending the arsenal of currently available tampering detection tools by contributing a novel method that is relevant to a wide range of real-world image forgeries, and practical for users that have no specialized training in interpreting forensic maps. To this end, we present a novel technique that searches for JPEG blocking artifact discontinuities as a sign of possible forgery, and detects what is arguably one of the most commonly performed tampering schemes: image splicing that breaks the original grid alignment either due to its placement or due to resampling transformations (scaling, rotation, etc.) of the spliced area. The proposed method extends a JPEG grid detection algorithm from the literature [3] by introducing two novelties: • a grid alignment confidence measure designed to identify whether an image block follows the global grid pattern or violates it, either due to misalignment, distortion, or complete absence of encoding artifacts (Section 3.2); • a content-aware filtering step designed to account for grid discontinuities caused by the image content, strengthening the method's localization ability and overall output interpretability (Section 3.3). Figure 1 showcases the importance of these two novelties. By comparing the outputs produced by the proposed method (fourth row) with those produced when leaving out either of the two proposed novelties, i.e. the newly proposed grid alignment confidence measure (second row) and the content-aware filtering step (third row), it becomes clear that the accuracy and quality of the output maps improves considerably. The proposed method, hereafter referred to as CAGI (Content-Aware detection of Grid Inconsistencies), is evaluated against several state-of-the-art algorithms on three publicly available datasets, including both synthetic and real-world tampering cases, testing its classification ability, its localization effectiveness, and the readability of the produced outputs. The experimental results highlight the method's robustness over the diverse tampering scenarios: the proposed method manages to outperform the competition in terms of overall accuracy in the combined evaluation criteria (Section 5.4), while also substantially contributing in terms of successful localizations of unique cases, i.e. cases that all other methods failed to detect. Java and MATLAB implementations of CAGI have been made publicly available as part of our Image Forensics Toolbox 1 , alongside other state-of-the-art algorithms. First row: input images where the tampered area is marked with a red outline. Second row: Outputs maps produced without the newly proposed grid alignment confidence measure. Third row: Output maps produced without the content-aware filtering step. Fourth row: Output maps produced by the CAGI method.

Related Work
Many notable contributions have been made towards tackling diverse cases of image manipulation. One category of approaches includes algorithms based on machine learning, using appropriate features extracted from images and trained on samples of tampered and authentic images [4,5,6]. Others detect operation-specific traces (such as re-sampling) [7], make use of compression and coding artifacts [8,9,10], search for inconsistencies in the image traces produced by the capturing process [11,12], and search for physical inconsistencies such as illumination discontinuities [13,14]. A number of surveys present the evolution of the state-of-the-art through time [15,16,1,17,2]. Here, we focus on methods for image splicing, organized by the type of trace they attempt to analyze for detecting forgeries. For each method, a three-or four-letter abbreviation is also given and used throughout the paper, following the conventions of [2].
Methods based on JPEG attributes: The method in [8] (BLK) is probably the most closely related to ours, since it also attempts to detect forgeries by locating inconsistencies in the JPEG blocking artifact. The image is filtered based on local derivatives, weak edges are detected, and their conformance with an aligned 8 × 8 grid is measured. A feature corresponding to the local strength of the blocking pattern is extracted. The feature's variations indicate local absence or misalignment of the grid, which can be considered an indication of tampering. In [9] (ADQ1), tampering localization is achieved by exploiting the characteristics of double Discrete Cosine Transform (DCT) quantization. When splicing an object on a JPEG image, the spliced region often loses its JPEG traces, due to rescaling, rotation, filtering, or other transformations. Thus, when resaving the forged image, the unspliced part will exhibit the traces of two compressions, while the spliced part will only have undergone one.
Methods based on DCT coefficients: In [18] (DCT), a simple, fast detection method looks for inconsistencies in JPEG DCT coefficient histograms. The method in [19] (ADQ2) first estimates the quantization table used by the first JPEG compression and then attempts to model DCT coefficient histogram periodicities. The method in [20] (ADQ3) performs Aligned Double Quantization inconsistency detection using SVMs trained on the distribution of DCT coefficients for various cases of single vs double quantization. The method in [20] (NADQ) searches for Non-Aligned Double Quantization traces, that is, cases where the JPEG grid has been shifted prior to the second compression. Finally, in [21] (GHO) the image is recompressed at multiple different quantizations and subtracted from the original, aiming to detect JPEG Ghosts, i.e. traces left in parts of the image for which past recompressions were performed at a different quality compared to the unspliced image.
Methods based on CFA interpolation pattern disturbances and noise patterns: The method in [12] (CFA1) looks for disturbances in the image Color Filter Array (CFA) interpolation patterns left by the image capturing process by modelling them as mixtures of Gaussian distributions. The work in [11] presents two algorithms (CFA2 and CFA3) also exploiting CFA patterns: the first emulates the CFA filtering process and localizes regions that diverge from the expected result, while the second isolates image noise using de-noising, and compares noise variance between interpolated and natural pixels. Finally, notable approaches based on noise information include the method presented in [22] (NOI1), where the local image noise is isolated by wavelet filtering and local variance discrepancies are treated as indicative of tampering, [23] (NOI2) where the local image noise variance is modeled using the properties of the kurtosis of frequency sub-band coefficients in natural images, and [24] (NOI3), where, following extraction of the highfrequency residual using a high-pass filter, the information is modeled using a co-occurrence descriptor, and inconsistencies in the local statistical prop-erties of the descriptor are used to detect spliced regions.
Compared to the state-of-the-art, the proposed method (CAGI) aims to provide a tampering localization solution designed for robustness in realistic scenarios, while producing output maps that are easy to interpret. We specifically aim to achieve tampering localization for cases where the history in terms of acquisition, forgery, and post-forgery transformations of the images is unknown. The algorithm does not require metadata, JPEG compression parameters, or prior knowledge on the history of the image, nor does it require that the image is in raw format taken directly from the camera. It can operate on any file format, provided it has been compressed as JPEG in its past. The discrimination of the image areas that are aligned to the dominant grid pattern from those that break it is conducted through exhaustive search, taking also into account the contents of the image and their possible interference with the attempted modeling. This allows filtering out false activations and leads to overall clearer outputs. As will become apparent from the experimental study of Section 5, CAGI offers a higher level of versatility and overall performance compared to the state-of-the-art.

Method Description
Blocking artifacts appear as a regular pattern of visible block boundaries in a JPEG compressed image as a result of DCT coefficient quantization and the independent processing of the non-overlapping 8 × 8 blocks during the DCT. They are prominent in highly compressed images or images that have undergone multiple re-compressions and become more subtle as the compression quality factor (QF) increases. These artifacts ultimately lead to the formation of a block grid in the JPEG image bitmap, i.e. a pattern of weak horizontal and vertical edges recurring every 8 pixels, starting from the upper left corner of the image.
Even though the block grid should be detectable over the whole bitmap, the visual content of the image may interfere with the periodicity of the pattern, as, for instance, in image areas that i) contain strong edges (artifacts appear around high-contrast edges producing a "halo" effect), ii) overexposed areas (where the soft grid pattern completely disappears), iii) underexposed areas (where the pattern is noticeably more subtle), and iv) textured areas containing patterns that resemble a grid. Furthermore, normal sensor noise introduced during image acquisition or any type of noise embedded in the image may also hinder the grid detection.
The following sections provide a detailed description of the proposed method that localizes tampered areas in an image by correlating them to grid alignment abnormalities in a JPEG compressed bitmap, while also accounting for false alarms derived from image content interference.

Detection of blocking artifacts
To detect the JPEG block grid, we extend the method proposed by Fan et al. [3]. Although the original intention of their work was to determine whether an image had been previously JPEG compressed (and estimate the previous compression parameters), it presents an efficient and light-weight implementation that for successfully detecting gradation discontinuities across block boundaries for JPEG compression with QF up to 95.
For a given grid, the method validates its detection if a certain confidence value is met, calculated via computations of differences of pixel values within a block and across block boundaries. The image is divided into N nonoverlapping blocks of 8 × 8 pixels and for each block(i, j), the scores Z (i, j) and Z (i, j) are computed using Equation 1.
where A-H refer to pixel positions on a block as depicted in Figure 2. Then, two normalized histograms H I and H II are created based on the Z and Z scores across the image, and a confidence score K is computed using Equation 2.
where m is the number of histogram bins used in the implementation (see [3] for further details). Fan et al. [3] empirically found that for pixel values ranging from 0 to 1, K > 0.25 is an indicator of successful grid detection. We will be referring to the detected grid position using the coordinates of pixel A in block(1, 1), i.e. the default Grid Position (GP) for a JPEG compressed image is located at GP (4, 4). In case the grid has been shifted from its original position, e.g. due to image cropping, an investigation can be conducted by calculating the confidence score K for all coordinates of pixel A within the 8 × 8 block (the coordinates of pixels B-H change accordingly, keeping their relative positions).  In instance (b), all neighbouring pixels are actually sampled from innerblock regions, completely failing to detect the grid position, clearly reflected also in the histogram plot. Even though not included in this example, the same goes for sampling only from border regions (e.g., GP (8,4)). Instance (c) depicts a detection attempted at position GP (5, 4). The inner-block and border pixels are somewhat correctly sampled i.e, pixels A-D are still within the inner-block region of the grid pattern while the border samples miss the grid intersection point by only one pixel in the vertical direction and thus partially meet the cross pattern (Figure 2.c). As a result, the respective histogram plot is very similar to the one of case (a), but the respective K detection score is lower. Finally, instance (d) showcases the symmetrical properties of the applied computations. A position search for A (8,8), produces an identical plot as in case (a), only here, H I and H II are inverted, as is the sign of H I − H II .
According to [3], the highest K should reveal the grid position. However, in our experiments with tampered and untampered images, K was found to be a poor grid location indicator, mainly because it could be heavily affected by image content, especially for images of low resolution (few blocks) or high quality compression (weaker grid pattern) and even more so for tampered images containing alien misaligned splices. In order to limit the possibility of high K scores being a result of image content, we propose a new grid confidence measure, namely K . This new confidence measure does not simply rely on the highest reported K score to locate the grid position but includes an additional verification step based on an expected pattern, arising among all calculated K scores in an image, at the correct grid location. Thus, our approach looks for a specific pattern in the values of K instead of simply taking the location where it is highest. More specifically, the expected pattern suggests that if the highest K score is found at position (i, j), then an equally high K score should be present at position (i + 4, j + 4) and low scores at positions (i + 4, j) and (i, j + 4). Furthermore, the K scores of different GP investigations remain high and positive as long as A-D are actually part of the inner block, while E-H are at the borders, or high but with a negative sign, if sampled inversely. If pixels are sampled being all in the same class (all inner-block or all border pixels), the respective K scores are expected to be low and their sign uncertain. Figure 4 demonstrates this emerging pattern. Figure 4.a illustrates a grid at position GP (4, 4) and the sampling instance (out of all possible 64) that will produce the highest K score with the correct sign.  respective K-score patterns during the grid location investigation. Letters H and L stand for High and Low K scores, respectively. Positions marked with 1 indicate that the inner pixels A-D are correctly part of the inner region of the grid and E-H are along grid boundaries. Positions marked with 0 indicate the opposite. Thus, after we calculate the values of K and locate the sampling instance that produces the highest one for a given image, we include an additional step in which we also evaluate if the rest of the calculated K scores and their signs comply with the expected pattern. In Figures 4.b and 4.c the grid is shifted by two pixels in both directions. The grid detection process will locate the grid at GP (6, 6) ( Figure 4.c) and the expected K score pattern will be adjusted as shown in Figure 4.e. Due to the symmetry of the sampling process, however, the investigation instance at position (2,2) ( Figure 4.b) will also produce the same absolute K score (with an opposite sign), as well as the same K scores pattern (Figure 4.e). Given the above, we calculate an intermediate confidence score K based on Equation 3.
where K ∈ [0, 1] and K ∈ [0, 2] is calculated by Equation 2. The aim of K is to quantify the observed patterns in the values of K. Taking advantage of the pattern's symmetry, we may reduce the investigation of possible grid positions from 64 (8-by-8 window of positions) to just 16 (4-by-4 window) and identify the actual position by comparing the sign of K (i, j) to those of the original K(i, j) and K(i + 4, j + 4). For the remaining 16 possible grid positions we proceed by calculating how well they fit the expected sign patterns (see Figure 4, where 1 and 0 indicate positive and negative signs, respectively). Starting at position GP (i, j) and searching horizontally and vertically, most K signs should be positive, while for position A(i + 4, j + 4) most should be negative. Another measure S ∈ [0, 1] is used to evaluate how well each position fits this pattern, calculated as the number of positions having the correct sign divided by the total number of positions. Then, the final confidence score is formulated by Equation 4.
K is a score referring to the total image and its aim is to estimate the position of the JPEG grid. Once the position of the grid is fixed, we can also calculate the contribution K block of each individual 8 × 8 block to the overall K score. The calculations for K block follow that of K , only instead of using the normalized histograms of all sampled pixels of all blocks to calculate K (i.e. Equation 2), we compute individual K block scores for each block n as: and proceed with the calculations as above, to get the respective K block (n). K takes advantage of the lightweight implementation and effectiveness of the K measure and adds an extra level of detection robustness, while K block (n) allows the identification of image parts that present JPEG grid inconsistencies, which is the goal in detecting and localizing tampering operations. In Figure 5, for instance, the image blocks a1, a2, b1 and b2 present traces of two different grids (black for the original and orange for the result of misaligned image splicing), while blocks a3 and b3 carry only the original JPEG artifacts. The individual K block scores would not reveal the inconsistency because the sampled pixels do not happen to belong to both grids. K block however, would result in lower scores for the four tampered blocks compared to the untampered ones, since the expected pattern will not be equally strong in the respective K block -score pattern and sign evaluations.

Modelling compression artifacts
Consider the following typical case of splicing, in which a host image is JPEG compressed, an alien region is cut from another JPEG image, pasted into the host (not aligned with the original grid) and the composite image is re-compressed as JPEG. At the location where the tampering took place, the new image bitmap will carry overlapping grid artifacts.
In the ideal case, where the grids' positions of the original host image and the one caused by the final compression are known and can be detected using K , we would only need to plot the heat map of the contribution of each 8×8 block to the maximization of K for i = 4, j = 4 (standard grid position of JPEG). Blocks ranging significantly low would reveal the inconsistency in the main grid pattern, caused by the alien region. Unfortunately this is hardly ever the case, since the consistency of the blocking artifacts throughout the host image is easily disturbed from a variety of factors, such as strong edges, visual texture patterns, over/under exposure, etc.
To moderate the impact of such effects, we exploit all information gathered during the previous procedure. For each image block, we calculate the K block scores of the reduced set of the 16 possible grid coordinates (Equation 6), and then compute the mean block response (Equation 7). The final output is stored in the form of a heat map.

Input Image
Rescale and for every 8x8 block mark its class Divide Image into 8x8 blocks B3. Convert to HSV. Threshold the V channel to classify blocks as over or under exposed Help Map 3 Over exposed | Under exposed where H[k] is the Heaviside step function, and (i x , j x ) is the pair of coordinates for one of the 16 candidate grid positions within the block. . The upper A1 row shows the outputs reporting low K while the lower row shows the higher scoring K grid position searches.
All grids, even the ones with low K scores present strong responses at locations where the image contains high-contrast edges. This is to be expected, considering the halo effect (extra blockiness) that the JPEG compression introduces in such areas. On the other hand, low responses are consistent throughout the maps at the under-exposed (dark) image areas and at bright image regions (upper right corner), where the grid pattern is more subtly present. As we move from the least fitting to the best detected grid, the tampered region becomes visible as an area of lower responses, that cannot be justified solely on the basis of image content. Figures 6.A2 and 6.A3 are the calculated mean responses of all 16 grids per block and the responses of the best detected grid, respectively, after a spatial mean operation where pixel values in a non-overlapping 3 × 3 window are replaced by the window mean in order for isolated random responses (either high responses in a neighborhood of low responses or vice versa) to be smoothed out.
There may be blocks in the image which, due to their content, do not display any grid, and should not be taken into account. These blocks can be identified by the fact that they return consistently high values of K block for multiple different GP coordinates. We suppress the responses in these blocks by taking the difference between the two aforementioned maps, i.e. the mean response and the response of the best detected grid. This gives us Heat Map A (Figure 6), where areas with undetectable grids are suppressed while areas of grid pattern discontinuity are emphasized.
A second intermediate map is Heat Map B, aimed to be used later as a weighting factor in characterizing blocks as tampered or not. It is produced by inverting the best fitting grid map, so that locations of grid inconsistencies return high responses, and is depicted in Figure 6.

Content-aware filtering
Image content can severely interfere with the fragile traces of the JPEG block artifacts. It is evident while examining the heat map produced up to this stage ( Figure 6, Heat Map A) that an inexperienced user would have difficulty assessing the location of the actual tampering by inspecting the map.
In an effort to produce more reliable and interpretable outputs, we proceed with an extra computational step of coarse image segmentation based on image content. There are four types of disrupting image content we wish to be able to detect: 1. homogeneous areas, i.e. areas where the intensity gradient between neighboring pixels is near-zero, 2. over/under-exposed areas, 3. areas of high-edge contrast, and 4. areas of soft edges.
The problem with homogeneous areas is that, when JPEG encoding is applied on image parts of solid or near-solid colors that span multiple image blocks, the grid pattern is exceptionally weak or non-existent even for low quality encodings. Thus, to help us determine whether grid discontinuities (including complete absence or significantly weaker artifacts) are signs of tampering or simply homogeneous areas where the gradation of intensities between neighbouring pixels is very smooth, we produce a specialized map, depicted in Figure 6 as Help Map 1, in which we mark blocks that score consistently low (near-zero) over all 16 GP.
The classification of an area (in our case an 8×8 block) as over-or underexposed is easily achieved by converting the image into the HSV space and using upper and lower thresholds, respectively, in the Value (V) channel. In our implementation, we empirically found that mean values that are higher than 95% of the channel maximum value (i.e. 0.95 · 255 = 242 in our case) can be securely classified as over-exposed, while values lower that 5% can be classified as under-exposed. (Help Map 3 - Figure 6).
With respect to detecting areas of high-edge contrast and soft edges (points 3 and 4), we employ a novel efficient edge extraction scheme inspired by [26] that is able to adaptively classify the detected edges as salient or soft. To ensure consistent computational times and results, the input image is resized to the largest dimension scaled to 960 pixels (the smallest is scaled near-proportionately, but ensuring it is a multiple of 8, to allow block-based tiling and filtering). The rescaled image is then tiled into non-overlapping 8 × 8 blocks that are independently processed by a set of 2-dimensional 8 × 8 edge detection kernels. The kernels are an adaptation of the kernel masks presented in [26]. In our implementation, the kernels are binary masks consisting of two regions (a dark and a light), defining edges in 12 orientations on 15 • increments. For each of these orientations, an appropriate number of instances represents all possible positions (2-pixel shifts) of the edge within the region of the kernel, resulting in a total of 58 kernels (Figure 6.B1).
Each image block B(i, j) is then processed by all 58 kernels in order to calculate an edge confidence score based on Equation 8.
where M w and M b the number of white and black pixels in the kernel, respectively, andk z (i, j) is the bitwise NOR for position (i, j) of kernel k z , z ∈ [1, 58]. When all blocks have been processed by all kernels the highest confidence score is stored for each block. In order to discriminate block edge responses into salient or soft, a thresholding step takes place. The image is divided into six areas (A-F), each of which is further divided into six sub-regions (a 1 , a 2 ...a 6 , b 1 , b 2 ...b 6 , ..., f 1 , f 2 ...f 6 ) as illustrated in Figure 7.
To determine a threshold value for each one of the smaller regions (second level regions), we calculate (i) threshold T img to be the mean confidence score over the whole image, (ii) thresholds (T A , T B , .., T F ) to be the mean confidence scores of the tiles belonging to each first level region, and (iii) (T a 1 , ...T a 6 , T b 1 , ...T b 6 , ..., T f 1 , ..., T f 6 ) to be the mean confidence scores of each second level region. Then, the threshold for each second level region is selected to be the largest among the one calculated from the second-level region, the one calculated from the containing first-level region, and the overall image threshold. For instance, in the case of sub-region a 1 , we would set T a 1 = max(T a 1 , T A , T img ).
This thresholding process is important because it evaluates strong edges, not by an absolute number but locally, taking into account local image statistics. Applying the thresholding is crucial for the quality of the output maps, because these maps have scaled value ranges: this means that, in the absence of high-contrast edges, low-strength edges would be dominating the output heat map and would falsely indicate possible forgery. The proposed adaptive thresholding scheme scales the produced thresholds in relation to the overall contrast of the content and overcomes the issue.
The bottom part of Figure 6 illustrates the content-aware filtering part of the method. Specifically, Figure 6.B2 depicts the color-scaled illustration of the highest confidence scores C k per block. Help Map 2, depicts the example maps resulting after the classification of the blocks as containing soft and strong edges, respectively. The first map presented under Help Map 3 shows the map of under-exposed blocks and the second, being flat, informs us that in this particular image no over-exposed blocks were found.

Extracting the final output map
The final step of the method aims at producing a readable output, with clear contrast between tampered and untampered regions. To this end, it utilizes all intermediate information, i.e Heat Maps A,B and Help Maps 1-3 ( Figure 6).
In Heat Map A, blocks with high values probably belong to over/underexposed image regions, homogeneous regions or tampered regions, while blocks with low values are most likely unsuppressed responses of strong edges. Since the tampered region is expected to exhibit high values, we mark all blocks that range under the heat map mean as non-tampered (by temporarily setting them equal to zero). Next, we use Help Maps 1 and 3 to also mark blocks classified as homogeneous and over/under-exposed as non-tampered. The visualized output of this process is illustrated in Figure 6.C1.
The resulting map is then weighted by the inverse heat map of the best grid (Heat Map B) resulting in the heat map depicted in Figure 6.C2. This map could itself serve as the final output of the algorithm, as the highest values are expected to correspond to the tampered region. However, zero is an arbitrary choice of value with respect to the original value distribution of the heat map. As heat map visualizations are always relative in scale, if the original map values were high, the presence of zeros may result in an output that is almost binary, with zeroed regions on the one end, and all other blocks, tampered and untampered alike, on the other. To mitigate this issue, at the final step we replace all zeroed blocks with the mean value of those soft edge blocks (Help Map 2) that are not classified as homogeneous (Help Map 1). We have experimentally found this value to serve as a good approximation to the value range of untampered, non-zeroed regions. Zeroed and non-zeroed untampered block values are now brought to roughly the same range ( Figure  6.C3), which should make the tampered region visually stand out in the heat map. The final output map, shown in Figure 6, is produced by mean filtering (Figure 6.C3).

Inverse discontinuity detection
The proposed method, as described in section 3.2, assumes that the discontinuities will appear as areas of lower ranging responses, in terms of K , in relevance to the rest of the image's responses, during the search for the best fitting grid. The relative strength of the responses is, however, very much affected by the compression Quality Factor of the host (QF h ), the QF of the alien splice (QF s ) and the final compression QF of the composite image (QF f ).
Consider, for instance, the following relatively common scenarios: i) QF h is high (weak artifacts), the splicing comes from an image with QF s < QF h and for the final compression quality, we have QF f > QF h > QF s , and ii) the host image is compressed losslessly (QF=100), the splice is JPEG compressed, and the composite image is again saved in lossless format. In both of these cases, discontinuities will appear as areas of high K response in relevance to the overall low responses calculated in the image.
In order to account for cases like these, we introduced an additional branch to the method that produces a complementary output map. More specifically, at the last stage of the algorithm when extracting the final output map, instead of filtering (marking as zero) the blocks in Heat Map A that range under the map's mean, we now filter those that range over that value. As before, we also mark the homogeneous and over/under-exposed blocks and proceed by assigning the mean value of the soft edge blocks (Help Map 2) that are not classified as homogeneous (Help Map 1).
The complementary output produced by this straightforward, inverted filtering can be presented to end users along with the original output allowing them to choose the most appropriate result based on visual inspection. We refer to this output as inv-CAGI.

Evaluation
We evaluate CAGI through a number of experiments, which provide insight into its potential for blind tampering localization. The first set of exper-iments demonstrate the effectiveness of the method for controlled scenarios on a synthetically tampered dataset from the literature. Next, we evaluate CAGI on more realistic scenarios, where details concerning the image history and applied transformations are unknown. In all evaluations the method is directly compared to seven methods from the state-of-the-art (Table 1).
With respect to the methods described in Section 2, ADQ1 was selected to represent approaches that base their detection on double quantization. ADQ1 has the advantage of being able to operate on images that had been compressed as JPEG, and were then decompressed and stored in PNG, which is the case in some datasets. In contrast, ADQ2, ADQ3 and NADQ can only operate using JPEG images as input, since they require specific information derived from the JPEG file, such as the decompression rounding residue or the quantization matrix used for the last compression. Since some datasets contain PNG images which carry the traces of past JPEG compressions but have already been decompressed, that information is essentially lost and these algorithms cannot work. Also, GHO is not part of the selected methods because it produces several output maps per case, requiring thus manual investigation to trace the changes between the different maps to locate the forgery. Finally, given the expected limited applicability of methods that search for disturbances in CFA patterns, only results from CFA1 are presented as indicative of such methods.

Acronym
Description DCT [18] Looks for inconsistencies in the JPEG DCT coefficient histograms to detect possible tampering. BLK [8] Identifies possible tampering by locating inconsistencies in the JPEG blocking artifacts. ADQ1 [9] Tampering localization is achieved by exploiting the characteristics of double DCT quantization. NOI1 [22] Models image noise using wavelet filtering and treats localized variances as possible forgeries. NOI2 [23] Models image noise using the properties of the kurtosis of frequency sub-band coefficients in natural images. NOI3 [24] Computes a local co-occurrence map of the quantized high-frequency component of the image and locates inconsistencies in the local statistical properties. CFA1 [12] Models the Color Filter Array interpolation patterns as a mixture of Gaussian distributions and locates tampering based on detected disturbances.  Table 2 lists the employed datasets. The first dataset employed in this study is the synthetic dataset by Fontani et al. [25]. It contains 4800 original and 4800 tampered images, which were generated by automatically extracting a fixed-size square from the center of the image and replacing it in the image, emulating the effects of a splice (e.g. removing the traces of JPEG compression, or changing the JPEG grid alignment). The tampered images of the dataset are split in four distinct classes, each containing a different type of forgery (Table 3). Thus, depending on the class, a forgery should theoretically be detectable by different combinations of Non-Aligned JPEG quantization, Aligned JPEG quantization and JPEG Ghost, while other algorithms may also be able to localize certain forgeries.

Datasets
Next, we employ the First IFS-TC Image Forensics Challenge training set [27], a dataset containing user-submitted forgeries and their ground-truth masks. The dataset was designed to serve as a realistic benchmark (different types of tampering, unknown image history and possible post-tampering transformations). While images in this set are saved as PNG, it is likely most of them were originally in JPEG format, since they exhibit traces of past compressions (e.g. blocking artifacts or DCT coefficient histogram periodicities). Therefore, splices may be detectable using JPEG-based methods.
Finally, we experiment with the Wild Web Dataset [28] that contains 78 cases of real-world forgeries. As the forgeries have been circulating various websites and social media platforms, there exist multiple versions of each forgery, due to resavings, croppings, and other transformations. The Wild Web Dataset was formed by collecting a large number of different versions from each forgery, resulting in a set of 10,646 images.

Evaluation Metrics
Each of the tested methods produces an output map (in the form of a heat map) that can be used to localize tampered areas in the image. For Table 3: Fontani et al. [25] synthetic dataset classes.

Class 1
Region is cut from a JPEG image and pasted, breaking the 8x8 grid, into an uncompressed image; the result is saved as JPEG. Traces: Misaligned JPEG compression Class 2 Region is taken from an uncompressed image and pasted into a JPEG image; the result is saved as JPEG. Traces: Double quantization, JPEG ghost Class 3 Region is cut from a JPEG image and pasted into an uncompressed image in a position multiple of the 8x8 grid; result is saved as JPEG. Traces: JPEG ghost Class 4 Region is cut from a JPEG image and pasted (without respecting the original 8x8 grid) into a JPEG image; the result is saved as JPEG. Traces: Misaligned JPEG compression, Double quantization, JPEG ghost our first test, we evaluate the methods' ability to correctly classify tampered images following the methodology proposed in [28].
The datasets provide binary ground truth masks for all tampered images, while for untampered images we use an artificial ground truth mask for each untampered image similar to [21] and [25], which corresponds to a block of size 1/4 of each dimension, placed in the image center. The Kolmogorov-Smirnov (KS) statistic is used to compare the value distribution for the two regions of the masks (tampered/untampered).
where C 1 (u) and C 2 (u) are the cumulative probability distributions inside and outside the mask, respectively. If KS surpasses a threshold, a positive detection is declared. ROC curves are calculated by shifting the threshold for each algorithm, and evaluating how many images return positives in the tampered and untampered subsets. This methodology is appropriate for datasets that contain both tampered and untampered images, and sets a baseline against overestimation of a method's ability to localize tampering.
A far more precise metric, in terms of evaluating localization quality and output readability is based on the pixel-wise agreement between the reference mask and the produced heat map of each method. In the latter case, only tampered images are evaluated, while the quality of the response is measured in terms of the achieved F-score (F1). This methodology requires the output maps to be thresholded prior to any evaluation. Since the range of values of the output maps for each algorithm varies, and in an effort to be fair, we first normalize all maps in the [0, 1] range and proceed by successively shifting the binarization threshold by 0.05 increments, calculating the achieved F1 score for every step. The performance is presented in the form of F1 curves.

Experimental Results
This section includes the experimental results per dataset, evaluated with both the KS and F1 metrics. To keep the presentation compact and to the point, we focus more on three of the reference methods that yield overall good results, while producing some of the most clear tampering localization heat maps. These include blocking artifact discontinuities (BLK), aligned double quantization (ADQ1) and SpliceBuster (NOI3). The experimental evaluation and comparison with the rest of the reference methods will be given more concisely in section 5.4, where we discuss the overall performance. Outputs in the form of heat maps produced by all methods employed in this paper, on various images from the realistic datasets, are available in Figure 14.

Results on the synthetic dataset by Fontani et al.
The dataset by Fontani et al. is synthetically generated, allowing to test the effectiveness of methods on different types of forgery. Figure 8(a) presents the experimental results using the first evaluation methodology over the whole collection. CAGI is overall one of the best performing methods together with BLK, achieving approximately 70% true positive rate at a 5% false positive rate. Figure 9 presents the per-class results for the CAGI, BLK, NOI3 and ADQ1 methods. In Classes 1 and 4, where the tampered images carry traces of misaligned JPEG compressions, i.e the principle which CAGI is designed around, the method demonstrates competitive results and is only outperformed by ADQ1 in Class 4, where double quantization traces are also present. Interestingly, however, CAGI also manages to rank among the best performing methods for the other two classes.
Along with the class of tampering, a second factor comes into play concerning the robustness of the detection: the QF of the host image in relation to the final compression QF. The host images of this dataset were acquired in lossless format and (depending on the class) were compressed with varying compression qualities QF 1 (40 − 80). After the splicing operation, the resulting images are recompressed into JPEG.
In CAGI, discontinuities of the image grid appear as lower responding areas in the heat map of the best responding grid and the heat map of the mean response of all tested grids. Class 4, is completely in-line with CAGI's design. The misaligned JPEG splice can be generally traced easily. For Class 1, the localization of the misaligned patch is also relatively easy to achieve when the host is compressed with a low QF. However, as the QF of the final compression increases, the area that was initially uncompressed (host) only gets light artifacts after compression, while the double pattern within the spliced region is also degraded. This makes the detection vulnerable to responses derived from content.
CAGI is much more robust for tampered images of Class 2, since it can rely on artifacts that are already present in the host. The tampered area has lower responses due to the higher final QF 2 compression. Misses occur only in cases of extreme content-related responses that the employed content aware process fails to account for.
Class 3 is the most challenging for CAGI. The tampered patch is aligned to the grid created by the final compression and thus, for lower QF, both mean and best grid responses will highlight the tampered region with higher values. The method is this case is producing inversed maps, compared to what it was designed for. Depending on the QFs and the image content, this may not be an issue; tampered and untampered region will only appear inversed in the final output. In many cases, however, the operations that take place next, implemented with the intention of suppressing responses corresponding to high frequency content, may falsely treat the detection as edges.
Inversed CAGI (inv-CAGI), as described in Section 3.5, was implemented to account for such cases, which are in fact quite common. The curves in Figure  10 attest the added value of the inv-CAGI variant of our method.
Moving on to the evaluations of the localization and readability quality of the maps, Figure 8(b) presents the mean F1 scores per binarizarion step over the whole Fontani et al. collection. The achieved localization is evaluated by the maximum mean F1 score for each method (at its respective best performing binarization threshold). CAGI achieves once more one of the best reported performances.
A good indicator of a method's result interpretability is when, for a wide range of binarization threshold values, the achieved F1 remains high. This suggests that the tampered and untampered image regions are characterized by significantly different values in the output maps. ADQ1, which essentially produces almost binarized outputs by design, is an indicative example of good readability. In the Fontani et al. dataset, ADQ1 manages to achieve good localization (mostly due to the very high performances in Classes 2 and 4) making it the best performing approach in the dataset. CAGI is a close second in terms of readability. On the other hand, BLK, which was the most competitive method in the previous evaluation, has significantly lower F1.

Results on the First IFS-Challenge dataset
The Challenge dataset, being the first attempt to produce a realistic benchmark, is much harder to tackle by any single method. The performance evaluations presented in Figure 11(a) are indicative of the above statement; there are very few detections for most algorithms at a 0% false positive rate, and even when relaxing the threshold, the true positive detection rate increases slowly. Thus, any contribution in terms of unique detections and/or readable outputs is of great importance in this dataset (and any other realistic dataset). Figure 11(b) presents the calculated mean F1 scores on this dataset for all competing methods. Again, in comparison with the rest of the methods, CAGI reports one of the highest F1 scores as well as readability quality as it achieves high F1 scores over a wide range of thresholds. Table 4 reports the best localized detections achieved per method. The detection threshold was set to 0.7 and the search was performed for the best binarization step for each method. Unique corresponds to the number  of detections exclusively achieved by that method. NOI3 has the greatest contribution in this dataset, followed by CAGI and BLK.

Results on the Wild Web dataset
As the Wild Web dataset does not contain untampered images, the evaluations can only be performed based on the pixel-level localization accuracy on the tampered images. Figure 12 reports the mean F1 scores calculated over the whole collection (10,646 images). Even though the values of F1 are very low for all methods one should take into account the fact that the collection consists entirely of actual forgeries sourced from the Web. The dataset is organized into 78 cases of confirmed forgery. For each case, reverse-image search engines (Google and TinEye) were used to collect as many near-duplicate instances as possible from the Web. This means that the number of instances for each case varies. Some cases have as little as 7 instances, while others more than 200. When a case that has many instances is not detectable by a method, it severely affects the calculated F1 score. Thus, the F1 curves should be seen as a general comparison tool for evaluating the localization and readability quality of the methods. In this light, CAGI is once more the top performing method.  The performance of methods in the Wild Web set is also evaluated in terms of achieved detections and contribution with unique cases. As in [28] a detection is classified as correct when at least for one instance of a given sub-case the method produces an F1 score higher than a set threshold. Table 5 reports the correctly localized case detections for F 1 score >= 0.7 and F 1 score >= 0.8. Detections corresponds to the number of cases detected by the respective method, Unique corresponds to the number of cases detected exclusively by that method, and PENS (Perfect ENsemble Sum) corresponds to a theoretical perfect ensemble, where at least one method achieved detection (i.e. essentially summing the total number of cases detected out of the initial 78).
The contribution of overall detections as well as unique detections for the CAGI method (and its variant inv-CAGI) is clearly highlighted by the results. Moreover, the results indicate that the detections (i.e. F1 scores exceeding the threshold) remain prominent for a wider range of thresholds compared to competing methods. This means that in the output maps of CAGI, the value difference between the tampered and the untampered area is greater, making the visual output more striking and easy to interpret by non-experts. Figure 13 summarizes the recorded performance of all methods on the three employed evaluation criteria; i) the ability of a method to retrieve true positives of tampered images at a low level of false positives (KS@0.05); ii) the ability to achieve good localization of the tampered region within the image (F1), and; iii) the readability of the produced heat map, i.e. a high distinction of assigned values for pixels belonging to tampered versus untampered regions, expressed as the range of different binarization thresholds that result in high F1 scores (> 70% of the respective maximum F1 score).

Overall performance
At this point some overall remarks concerning the two last criteria should be made: The proposed method is not only performing among the top methods concerning the F1 score in all three datasets, but has also a good (wide) range of possible binarization thresholds that lead to high F-score for all tested datasets. This attribute is of great importance since it could be leveraged within an automated binarization process. For instance, CAGI can be expected to produce close to optimum F1 score (localization) for a threshold of 0.4 or 0.5 and inv-CAGI for higher values of 0.8 and above. Other methods, e.g. BLK, have ranges that fall into completely different values in the three datasets, or have limited binarization levels to choose from (e.g NOI2, NOI3).  ) and calculate the average of the three metrics per method, so as to get an indication of the combined performance (detection, localization, readability) of all methods against oneanother, per dataset. Additionally, Table (c) reports the ranking results (winners, scores, ordered lists) as calculated using three popular aggregation methods from the literature; Borda count 3 , Copeland 4 , and Kenemy-Young 5 .
3 Borda voting is a widely used scheme used to rank candidates. Suppose there are k candidates. Voters submit their rankings and candidates receive k − 1 points for every first place, k−2 points for every second place, and so on. Since, every point a candidate receives may be considered a head-to-head vote against some other candidate, Borda scores are equal to the total number of head-to-head votes a candidate receives. 4 Copeland voting follows a paired-comparison scheme where candidates are scored by their win-loss record across all head-to-head competitions. The winner(s) are those that win the most pairwise runoffs 5 Kenemy-Young voting emphasizes decisive wins over smaller majority margins. More specifically it finds the preference sequence that maximizes the Kemeny-Young score. For a sequence (K1, ..., KN ), the Kemeny-Young score is the sum of C(Ki, Kj) over all 1 <= The proposed method is steadily among the top (first or second) in the averaged results, and the winner in six out of nine rankings (and a close second in the three remaining rankings). Overall, the results of Figure 13 attest the robustness and versatility of the proposed method. The method manages to maintain high performance in all three tested metrics across all datasets. The method achieves a very good balance between detection and localization of forgeries, as well as high readability of its outputs. Another advantage of the method is that it does not require parameter selection (since a reasonable choice for the binarization threshold works very well across all dataset). This makes it ideal for use in practical settings by non-experts. In fact, the method has been integrated in a web-based image forensics service that has been co-designed and tested in realistic settings by journalists and media experts [29].

Conclusion
The paper presented a novel tampering localization method based on JPEG blocking artifacts discontinuities for detecting splices. The key design goals for the method have been high robustness over a variety of forgery cases, achievement of successful detections in cases where other algorithms fail, and the generation of "clean" outputs that are easy to interpret by non-experts.
Experiments were performed on both synthetic datasets and realistic/real tampering cases, and the proposed method was directly compared to seven state-of-the-art techniques, representing different classes of forensic analysis. Experimental results across all datasets demonstrated that the method is robust in terms of localization accuracy and readability of the produced outputs.
More importantly, since the reported detections contributed by our method during the experimental evaluation include many unique cases (i.e. where other algorithms fail) we conclude that including it in an ensemble forensics analysis system would significantly improve its detection performance. This is a direction we plan to explore in the future.  Figure 14: Heat maps produced by the methods: Input images in columns 1-4 are taken from the Challenge dataset and columns 5-7 from the Wild Web dataset. For the proposed method, the outputs shown for input images 1-4 and 6 are produced by CAGI, while for images 5 and 7 they were produced by inv-CAGI. The tampered part in each case is drawn using a white outline on all heat maps.