LANBIQUE: LANguage-based Blind Image QUality Evaluation

Image quality assessment is often performed with deep networks that are fine-tuned to regress a human provided quality score of a given image. Usually, this approach may lack generalization capabilities and, while being highly precise on similar image distribution, it may yield lower correlation on unseen distortions. In particular, they show poor performances, whereas images corrupted by noise, blur, or compression have been restored by generative models. As a matter of fact, evaluation of these generative models is often performed providing anecdotal results to the reader. In the case of image enhancement and restoration, reference images are usually available. Nevertheless, using signal based metrics often leads to counterintuitive results: Highly natural crisp images may obtain worse scores than blurry ones. However, blind reference image assessment may rank images reconstructed with GANs higher than the original undistorted images. To avoid time-consuming human-based image assessment, semantic computer vision tasks may be exploited instead. In this article, we advocate the use of language generation tasks to evaluate the quality of restored images. We refer to our assessment approach as LANguage-based Blind Image QUality Evaluation (LANBIQUE). We show experimentally that image captioning, used as a downstream task, may serve as a method to score image quality, independently of the distortion process that affects the data. Captioning scores are better aligned with human rankings with respect to classic signal based or No-reference image quality metrics. We show insights on how the corruption, by artefacts, of local image structure may steer image captions in the wrong direction.


INTRODUCTION
In the past years, models able to generate novel images by implicit sampling from the data distribution have been proposed [16]. While these models are extremely appealing, generating, for example, photo realistic faces [22] or landscapes [35], they are hard to be evaluated. Often anecdotal qualitative examples are presented to the reader with little quantitative and objective evidence, and evaluation of generative models is still undergoing a debate regarding how to perform it. The idea of using a computer vision classifier to evaluate the veracity of a generated images was first proposed in Reference [39]. The authors propose the Inception Score (IS), which is obtained applying the Inception model [42] to every generated image to obtain the conditional label distribution p(y|x ). Realistic images should contain one or few well-defined objects, therefore leading to a low entropy in the conditional label distribution p(y|x ). An improved evaluation metric, named Frechét Inception Distance (FID), has been proposed by Reference [18]. The authors show that FID is more consistent than Inception Score with increasing disturbances and human judgment. FID performs better as an evaluation metric, since it also exploits the statistics of the real images.
Recently, References [6,25,41] have specifically addressed methods to evaluate Generative Adversarial Networks (GANs). Reference [41] proposed two methods that evaluate diversity and quality of generated images using classifiers trained and tested on generated images. In Reference [5] the authors trained an Auxiliary Classifier GAN to generate new distorted samples to train a shallow quality evaluator to solve the lack of data in the standard datasets. In Reference [6] a discussion of 24 quantitative and 5 qualitative measures for evaluating generative models is provided, including IS and FID, image retrieval, and classification performance.
Apart from generating new images, GANs can be effectively used to enhance the visual quality of images that have been subjected to some degradation, such as noise or compression. In this use case the generator network is conditioned with the degraded input, and it produces an enhanced version. In Reference [25] it is observed that many existing image quality assessment (IQA) algorithms do not correctly assess GAN generated content, especially when considering textured regions; this is due to the fact that although GANs generate very realistic images that may look like the original one, they match them poorly when considering pixel-based metrics. The proposed metric, called SSQP (Structural and Statistical Quality Predictor), is based on the "naturalness" of the image.
Subjective metrics, such as Mean Opinion Score, are obtained by presenting images to several human evaluators and asking for a subjective score on the image quality. Such mean of measuring image quality is possibly the best choice but has the obvious drawback of human annotators need and the related cost in terms of time and money to rank a high volume of data.
Regarding the evaluation of image enhancement methods, only recently semantic computer vision tasks have been proposed for image quality assessment. The motivation behind this choice is twofold. On the one hand, since images are often processed by algorithms, it is intrinsically interesting to evaluate the performance of such algorithms on degraded and restored images; to this regard, it has to be noted that MPEG leads an activity on Video Coding for Machines (VCM), which aims to standardize video codecs in the case where videos are consumed by algorithms. On the other hand, we assume that semantic computer vision tasks lead to a more robust evaluation protocol. In previous works, object detection and segmentation have been used to assess image enhancement [13,14,51].
In this article, we introduce a novel image quality assessment method based on language models. To the best of our knowledge, language has never been used to evaluate the quality of images. We refer to the new approach as LANguage-based Blind Image QUality Evaluation (LANBIQUE). Figure 1 shows the gist of the proposed approach: The effects of image compression lead to a wrong captioning of the image on the left with respect to the original high-quality image on the right; captioning an image that has been obtained enhancing the compressed image with a GAN-based approach (center) leads to a caption that is very similar to the caption of the high-quality image. The main contributions of our work are the following: • LANBIQUE shows consistency across different captioning algorithms [2,11] and language similarity metrics. Interestingly, improving the language generation model also improves the correlation between our score and MOS. • Experiments shows that LANBIQUE does not suffer from drawbacks of common Fullreference and No-reference metrics when evaluating GAN enhanced images and keeps a high accordance with human scores for compressed and for images restored via deep learning.
In this extended version, we propose the following improvement with respect to Reference [15]: • We show that LANBIQUE can be used also for distortions different from JPEG compression.
• We tested LANBIQUE on the larger and more diverse PieAPP dataset, showing strong results against learning and non-learning-based methods. • Finally, the basic version of LANBIQUE is extended to make it possible to work also without a reference image. To get to this goal, we employ a blind restoration GAN, which can restore images without the knowledge nor the intensity of the distortion, to recover a pseudoreference image.
The rest of the article is organized as follows: In Section 2, we describe the related works. In Section 3, we briefly discuss about prior GAN-based image restoration approaches. In Section 4, we describe LANBIQUE in detail. In Section 5, we show experimental results of LANBIQUE on different settings and datasets. In Section 6, we draw the conclusions about our approach.

RELATED WORK
Full-reference quality assessment. When dealing with image restoration tasks, a reference image is often available to perform evaluation. Full-reference image quality assessment is an evaluation protocol that uses a reference version of an image to compute a similarity. Popular metrics are Peak Signal-to-Noise Ratio (PSNR) and Mean Squared Error (MSE). However, these metrics have been often criticized, because they are not consistent with human-perceived quality of images [49]. SSIM, a metric of structural similarity, has been proposed to overcome this limitation. Unfortunately, as will be shown in the following, even SSIM is too simplistic to capture human perceived quality of images; moreover, distortion metrics have been shown to be at odds with high perceptual quality. Blau and Michaeli [4] propose a generalization of rate-distortion theory that takes perceptual quality into account, and they study the three-way tradeoff between rate, distortion, and perception. The authors show that aiming at obtaining a high perceptual quality leads to an elevation of the rate-distortion curve and thus requires to make a sacrifice in either the distortion or the rate of the algorithm.
No-reference quality assessment. No-reference image assessment techniques are devised in the realistic scenario in which image quality must be estimated without accessing an original highquality or uncompressed version of the image itself. Recent No-reference image quality assessment methods are based on natural scene statistics (NSS), computed in the spatial domain. Instead of extracting distortion specific statistics such as the amount of blur or ringing in an image, they look at the statistics of locally normalized luminance to estimate the loss in image naturalness. These metrics are designed and optimized to be highly correlated with human subjective metrics. Pei and Cheng [36] train a random forest for IQA using the features extracted from the difference of Gaussian (DOG) bands and demonstrate it highly correlates with human visual system. Lukin et al. [29] fuse the outcome of several quality assessment systems by training a neural network. Kim and Lee [24] propose a Full-reference framework that aims to learn the human visual sensitivity by leveraging distorted images, objective error maps, and subjective scores. Bosse et al. [7] propose a learned approach for image quality assessment that incorporates an optional joint optimization of weighted average patch aggregation implementing a method for pooling local patch qualities to global image quality. In Reference [28] Liu et al. address the problem of the lack of data in the standard IQA datasets with a siamese network that learns from rankings. This approach obtains impressive results. In the past few years with the advent of new large datasets [19,37] for Image Quality Assessment, No-reference and Full-reference transformer-based approaches were deployed, obtaining very high performances [10,52].

IMAGE RESTORATION
Even if this work does not propose novel image restoration approaches, to make the article selfcontained, here, we formalize the image restoration or enhancement task. The main motivation that led us to work on an alternative to image quality assessment is the poor performance of standard IQA methods on images that have been enhanced by GANs, e.g., for denoising [23,44], deblurring [43,53], or compression artefact removal [14,30,45]. Furthermore, we leverage image restoration as a tool to extend the capabilities of LANBIQUE to evaluate those images that lack an uncorrupted high-quality counterpart, extending our approach to the No-reference scenario, as shown in Section 4.3.
Problem formulation. Given some image processing algorithm D, such as JPEG image compression, a distorted image is defined as I LQ = D (I H Q ), where I H Q is a high-quality image undergoing the distortion process; image enhancement aims at finding a restored version of the image I R ≈ G (I LQ ). In this work, we use two image enhancement networks: one that is specific for JPEG artefacts [14], and a more generic approach, which can work without prior knowledge of the degradation [47].
In Reference [14] Galteri et al. try to learn a generative model G that, conditioned on the input distorted images, is optimized to invert the distortion process D so G ≈ D −1 . Their generator architecture is loosely inspired by Reference [17]. They employ LeakyReLU activations and 15 residual layers in a fully convolutional network. The final image is obtained by a nearest-neighbor upsampling of a convolutional feature map and a following stride-one convolutional layer to avoid grid-like patterns possibly stemming from transposed convolutions.
The set of weights ψ of the D network are learned by minimizing: where I H Q is the uncompressed or high-quality image, I R is the restored image created by the generator, and I LQ is a compressed image. The generator is trained combining a perceptual loss with the adversarial loss: where L adv is the standard adversarial loss: which rewards solutions that are able to mislead the discriminator, and L p is a perceptual loss based on the distance between images computed projecting I H Q and I R on a feature space by some differentiable function ϕ and taking the Euclidean distance between the two feature representations: They employ a generator inspired by Reference [17], with a residual architecture using LeakyReLU activations, Batch-Normalization [20], and Nearest-neighbor upsampling layer is used Fig. 3. Overview of LANBIQUE. An image is first processed by an object detector, each box feature is then fed to a captioning model [2,11]; then a metric for captioning evaluation is used to score the quality of the image. In this example, a highly corrupted JPEG image yields a low CIDEr score of 0.631.
to recover original size [33], and a fully convolutional Discriminator. In Reference [14] it has been shown that using a GAN approach instead of direct training of the network for image enhancement results in improved subjective perceptual similarity to original images and, more importantly, in much improved object detection performance. Qualitative examples of GAN and direct training method are shown in Figure 2.
Real-ESRGAN [47] is a more recent approach that has the advantage of not requiring to know the type of distortion nor the intensity of it in advance to restore an image. In Reference [47] Wang et al. introduce a high-order degradation modeling process to better simulate complex realworld degradations. Differently from Reference [14] they use a U-Net discriminator with spectral normalization to increase discriminator capability and stabilize the training dynamics. As in ESR-GAN [48] the generator is built by several residual-in-residual dense blocks (RRDB).

EVALUATION PROTOCOL
Classic Full-reference image quality evaluation methods rely on the similarity between an image that has been processed by some algorithm D and a reference undistorted image. Considering the use case of image enhancement of an image that was compressed, GANs are a good solution, since they are great at filling in high frequency realistic details in image enhancement tasks; in this case the resulting enhanced image is compared to the reference. Unfortunately, when using classical MSE-based Full-reference metrics such as SSIM and PSNR GAN restored images yield lower performance, as can be seen in Table 2, although they appear as "natural" and pleasant to human evaluators, as also seen in examples of Figure 2. For this reason, in References [13,14] semantic tasks are used to evaluate the quality of restored images. Measuring the performance of a semantic task such as detection on restored images gives us an understanding of the "correctness" of output images. Given some semantic task (e.g., object detection), a corresponding evaluation metric (e.g., mAP) and a dataset, the evaluation protocol consists in measuring the variation of such metric on different versions of the original image. Interestingly, this evaluation methodology gives hints on what details are better recovered by GANs.
In certain cases, detection is a task describing scene semantics in a very approximate fashion; Usually detectors do not degrade for object classes that are clearly identifiable by their shape, since even high distortions in the image are not able to hide such features. The gain in image quality provided by GANs, according to object detection-based evaluation, resides in producing high-quality textures for deformable objects (e.g., cats, dogs).
In this article, we advocate the use of a language generation task for evaluating image enhancement. The idea is that captioning maps the semantics of images into a much finer and rich label space represented by short sentences. To be able to obtain a correct caption from an image many details must be identifiable.

Evaluation with Reference Captions
We devise the following evaluation protocol for image enhancement: We pick an image captioning algorithm A. Image captioning is the task of generating a sequence of words, possibly grammatically and semantically correct, describing the image in detail. Given a set of reference captions S and the caption generated from an input image A (I ), we want to measure their similarity with a language metric D: We look at the performance of a captioning algorithm A on different versions of a dataset (e.g., COCO): compressed, original, and restored. The pipeline of this evaluation approach is depicted in Figure 3.
In particular, we analyze results from two highly performing captioning methods [2,11] that combine a bottom-up model of visual entities and their attributes in the scene with a language decoding pipeline. Both methods are trained over several steps incorporating semantic knowledge at different levels of granularity. In particular, the bottom-up region generator is based on Faster R-CNN [38], which is based on a feature extractor pre-trained on ImageNet [12] and then finetuned to predict object entities and their attributes using the Visual Genome dataset [26]. In Reference [2], further knowledge is incorporated into the model by training the caption generation model using a first LSTM as a top-down visual attention model and a second-level LSTM as a language model. Meshed memory transformers [11] share the exact same visual backbone as Reference [2] but exploit a stack of memory-augmented visual encoding layers and a stack of decoding layers to generate caption tokens.
No matter how captioning models are optimized, our results show that the behavior of the captioning model for image quality assessment is consistent over several metrics, as shown in Table 1.
Captioning is evaluated with several specialized metrics measuring the word-by-word overlap between a generated sentence and the ground truth [34], in certain cases including the ordering of words [3], considering n-grams and not just words [27,46], and the semantic propositional content (SPICE) [1]. These metrics evaluate the similarity with respect to a set of reference captions S, which is usually composed of five references.

Evaluation without Reference Captions
Unfortunately, in most of the cases reference captions are not available, as they often must be collected with great expense of effort and resources; in fact, standard datasets used for image quality evaluation do not include captions. However, it is possible to evaluate any kind of test image with our language-based approach by modifying the pipeline. The idea is that the reference image is enough high quality to provide a valid caption for the evaluation of LANBIQUE. We caption the reference image I H Q using the same captioner A we use for the test image I , then we maintain the same procedure we previously described: This evaluation approach is represented in Figure 4. Since we change the evaluation pipeline with respect to the previous case, we argue that there may be a drawback with respect to the original version of the approach. As a matter of fact, modern captioners provide just one description per image and this means that the computation of D metric is done just between two sentences. However, this does not affect the performance of our approach significantly, provided that the A generates high-quality captions.

No-reference Evaluation
In this section, we show how our approach can be extended to work in a No-reference setting. In many occasions, we may not have a high-quality image available to be compared with the one to be tested. For this reason, we modify our language-based pipeline by adding an additional blind restoration module R. We assume that the images to be tested are corrupted by one or a combination of unknown distortions that are responsible for a global reduction of the visual quality. In this extended model, our aim is to restore corrupted input image I to use the enhanced version as the reference image. After this operation is completed, we are able to feed both the corrupted image and the restored one to the same captioning module, hence, we generate their text descriptions, and finally, we calculate the ultimate score based on some language metric D: LANBIQUE-NR(D, A, R; I ) = D (A (I ), A (R (I ))).
This No-reference approach is depicted in Figure 5. Typically, image distortions are not known a priori, so it may be a difficult task to train many networks capable of handling all the possible combinations of corruption processes and then select the best one for a specific restoration. For this reason, we choose to train a single network following a degradation model so it can restore most types of distorted images and recover their original quality as best as possible. To ensure a good output quality, we employed Real-ESRGAN [47] as the restoration module. We have modified the original model by adding JPEG2000 in the training procedure, then we have fine-tuned a pre-trained version of such network with the new introduced distortion.
In most of the cases, recovered images represent a solid reference for our evaluation model, as they are very close to real images from the point of view of human perception. In this setup, our LANBIQUE-NR assigns high scores to slightly distorted images, as their reconstruction is likely very perceptually close, and the captions generated are pretty close. However, highly distorted images are transformed into better-quality data that differ significantly from input. In this case, the captions between the two versions may differ much more, thus leading to lower scores of language metrics.

Subjective Evaluation
In this evaluation, we assess how images obtained with the selected GAN-based restoration method [14] are perceived by a human viewer, evaluating in particular the preservation of details and overall quality of an image. In total, 16 viewers have participated in the test, a number that is considered enough for subjective image quality evaluation tests [50]; no viewer was familiar with image quality evaluation or the approaches proposed in this work. A Single-Stimulus Absolute Category Rating (ACR) experimental setup has been developed using avrateNG, 1 a tool designed to perform subjective image and video quality evaluations. We asked participants to evaluate images' quality using the standard 5-values ACR scale (1 = bad, up to 5 = excellent). A set of 20 images is chosen from the COCO dataset, selecting for each image three versions: the original image, a JPEG compressed version with QF = 10 (high compression quality factor), and the restored version of the JPEG compressed image with QF = 10 compressed image; this results in a set of 60 images. Each image was shown for 5 seconds, preceded and followed by a grey image, also shown for 5 seconds. Considering our estimation of test completion time, we chose this amount of images to keep each session under 30 minutes, as recommended by ITU-R BT.500-13 [21].
To select this small sample of 20 images to be as representative as possible of the whole dataset D for the captioning performance, we operate the following procedure: Let μ * (v) and σ 2 * (v) be the mean and variance of a captioning metric score (in this case, we used CIDEr) for a given version v of the image i. We iteratively extract 20 random image IDs, yielding set D * out of the whole 5,000 testing set from the Karpathy split, without repetition. We attempt to minimize and by iterative resampling images until we find e μ and e σ 2 such that e μ ≤ 10 −3 and e σ 2 ≤ 10 −4 . V i is the set of different versions of an image i in the smaller dataset D * , namely: JPEG compressed For each metric, we denote higher(↑) or lower(↓) is better. JPEG q indicates a JPEG compressed image with Q F = q (e.g., 10), while (REC q) indicates the corresponding reconstruction using Reference [14]. Captions created from reconstructed images obtain a better score for every metric.
at QF = 10 (referred to as JPEG 10 in the following), its GAN reconstruction, and the original uncompressed image; and μ and σ 2 are the mean and variance of the considered captioning metric computed on the whole set of images D. The selected images contain different subjects, such as people, animals, man-made objects, nature scenes, and so on. Both the order of presentation of the tests for each viewer and the order of appearance of the images were randomized.

Results on JPEG Artefacts
First, we study in detail the behavior of LANBIQUE on a single distortion. This way, we can easily control the amount of image corruption and evaluate the behavior of our metric on GAN restored images.
Results with reference captions. To use a dataset of images with a set of associated captions, we selected the 5,000 images testing set from the Karpathy split of COCO dataset [9]. The images have then been compressed at different JPEG Quality Factors (QF), and then they have been reconstructed using the GAN approach of Reference [14]. In Table 1, we report results of LANBIQUE using various captioning metrics D. Interestingly, all metrics show that captions over reconstructed images (REC rows) are better with respect to caption computed over compressed images (JPEG rows). This shows that image details that are compromised by the strong compression induce errors in the captioning algorithm. However, the GAN approach is able to recover an image that is not only pleasant to the human eye but recovers details that are also relevant to a semantic algorithm. In Figure 1, we show the difference of captions generated by Reference [2] over original, compressed, and restored images. A human may likely succeed in producing an almost correct caption for highly compressed images, nevertheless, state-of-the art algorithms are likely to make extreme mistakes that are instead not present on reconstructed images.
In Figure 6, we show the different performance of captioning algorithms in terms of CIDEr measure on the same split of test of compressed and restored images, considering different quality factors of JPEG. The captioner proposed in Reference [11] outperforms Reference [2] as expected, Fig. 6. CIDEr scores using Reference [2] (purple) and Reference [11] (yellow) on compressed and restored images for different QFs from MS-COCO. Fig. 7. Bottom-Up detection process of captioning on two images: (left) JPEG compressed; (right) GAN reconstruction. Note that several mistaken detections on the left image are avoided in the right one. In particular, on the left, "surfboard" is missed and "white floor" and "blue wall" are wrongly detected. These two indoor details are the ones that likely misled the captioning. but interestingly, we may observe that the range of CIDEr values of Reference [11] is significantly higher than Reference [2]. We argue that this could be considered a strong feature of our evaluation approach, as a wider range of value may imply that a good captioner is able to predict the image quality in a finer manner than other weaker captioning algorithms. Figure 7 shows the bottom-up captioning process performed on an image used in the subjective evaluation. The left image shows the JPEG 10 version, while the right one shows the GAN reconstruction. The images show the bounding boxes of the detected elements. In the first case the wrong detections of indoor elements such as "floor" and "wall" are likely reasons for the wrong caption, as opposed to the correct recognition of a "white wave" and "blue water" in the GANreconstructed image.
Results without reference captions. A common setting that is used to evaluate image enhancement algorithms is Full-reference image quality assessment, where several image similarity metrics are used to measure how much a restored version differs with respect to the uncorrupted original For each metric, we denote higher(↑) or lower(↓) is better. JPEG q indicates a JPEG compressed image with Q F = q (e.g., 10), while (REC q) indicates the corresponding reconstruction using Reference [14]. NIQE and BRISQUE rate better GAN images than the ORIGINAL. SSIM always rate restored images worse than compressed. PSNR shows negligible improvement. Reference [11] and CIDEr have been used by LANBIQUE-NC, respectively, as language model and language metric.
image. This kind of metrics, measuring pixel-wise value differences, is likely to favor MSEoptimized networks, which are usually prone to obtain blurry and lowly detailed images. In certain cases, it is not possible to use Full-reference quality metrics, e.g., if there is no available original image. This kind of metrics typically evaluates the "naturalness" of the image being analyzed. In the same setup we used previously, we perform experiments using NIQE and BRISQUE, which are two popular No-reference metrics for images. Interestingly, these metrics tend to favor GAN-restored images instead of the original uncompressed ones. Most surprisingly, NIQE and BRISQUE obtain better results when we reconstruct the most degraded version of images (QF [10][11][12][13][14][15][16][17][18][19][20], but these values increase as we reconstruct less degraded images. We believe that BRISQUE and NIQE favor crisper images with high-frequency patterns that are distinctive of GAN-based image enhancement and they are typically stronger when reconstructing heavily distorted images.
In Table 2, we report results on COCO for Full-reference and No-reference indexes. In this setup, we compress the original images at different QFs and then we restore them with a QF-specific artefact removal GAN. We use the uncompressed image generated caption as ground truth, as in Table 3. The results show that, for restored images, PSNR accounts for a slight improvement, while SSIM indexes lower than the compressed counterparts. This is an expected outcome, as in Reference [14] it is shown that state-of-the-art results on PSNR can be obtained only when MSE is optimized and on SSIM if the metric is optimized directly. Nonetheless, as can be seen in Figure 2, GAN enhanced images are more pleasant to the human eye, therefore, we should not rely just on PSNR and SSIM for GAN restored images. LANBIQUE, using Reference [11], is in line with LPIPS [54]. Unfortunately, LPIPS, as shown in Table 3, has low correlation with scores determined by human-perceived quality.
Correlation with Mean Opinion Score. In Figure 8 (left) are reported subjective evaluation results as Mean Opinion Scores (MOS) as box plots, showing the quartiles of the scores (box), while the whiskers show the rest of the distribution. The plots are made for the original images, the images compressed with JPEG using a QF = 10, and the images restored with the GAN-based approach of Reference [14] from the heavily compressed JPEG images. The figure shows that the GAN-based network is able to produce images that are perceptually of much higher quality than the images from which they are originated; the average MOS score for JPEG images is 1.15, for the GAN-based approach is 2.56, and for the original images it is 3.59. The relatively low MOS scores obtained also by the original images are related to the fact that COCO images have a visual quality that is much lower than that of dataset designed for image quality evaluation. To give better insight on the distribution of MOS scores, Figure 8 (right) shows the histograms of the MOS scores for the three types of images: orange histogram for the original images, green for the JPEG compressed images, and blue for the restored images.
We further show that our language-based approach correlates with perceived quality using a IQA benchmark test on the LIVE dataset [40] that consists of 29 high-resolution images compressed at different JPEG qualities for a total of 204 images. For each LIVE image a set of user scores is provided, indicating the perceived quality of the image. However, no caption is provided in this dataset. For this reason, we consider the output sentences of captioning approaches over the undistorted image as the ground truth to calculate the language similarity measures, following the LANBIQUE-NC protocol presented in Section 4.2. In Table 3, we show the Pearson correlation score of different captioning metrics and other common Full-reference quality assessment approaches. The experiment shows an interesting behavior of our approach in terms of correlation. First, we can observe that each captioning metric has a correlation index that is higher or at least comparable with the other Full-reference metrics. In particular, METEOR and CIDEr perform better than the other metrics independently of which captioning algorithm is used. In the following experiments, LAN-BIQUE, LANBIQUE-NC, and LANBIQUE-NR have been computed using CIDEr metric. Moreover, we observe that the correlation metric significantly improves if we employ a better performing captioner. In this case, the visual features used by the two captioning techniques are exactly the same; the main difference lies in the overall language generation pipeline of the approaches. Hence, we argue that language is effectively useful for quality assessment, and the more a captioning algorithm is capable of providing detailed and meaningful captions, the better we could use the generated sentences to formulate good predictions about the quality of images.
To better understand what metric could be used instead of human evaluation, we computed the correlation coefficient between BRISQUE [31], NIQE [32], the proposed LANBIQUE, and MOS for all versions of the images. As shown in Table 4, it turns out that using a fine-grained semantic task as image captioning is the best proxy (highest correlation) of real human judgment. Figure 9 shows a captioning example from the COCO images used in the subjective quality evaluation experiment. On the left, we show a sample compressed with JPEG with a QF = 10; in   [14]; and on the right, we show the original one. It can be observed that the caption of the restored image is capable of describing correctly the image content, on par with the caption obtained on the original image. Instead, the caption of the highly compressed JPEG image is completely unrelated to image content, probably due to object detection errors.

Results on All Distortions
We further show the performance of our approach in full reference image quality assessment on other types of distortion. In this experiment, we keep using LIVE dataset, as it contains images corrupted with other processes, such as Gaussian blur, fast-fading, JPEG2000, and white noise, but we add also a recent large-scale PieAPP dataset.

Results on LIVE.
We repeat the same experiment done for JPEG images on LIVE dataset, first considering each distortion separately and then all the distortions together. In Table 5, we show the Pearson score for LANBIQUE and several Full-reference approaches. As we can see, our approach seems to underperform on each distortion except for JPEG, while SSIM and LPIPS are consistent despite the diversity of decaying processes. This is somehow expected, as blur and white noise tend not to harm detection significantly unless they are used with high intensity. Fast fading, however, is to be considered as local distortion. For this reason, objects may not be corrupted at all, thus leading to unchanged detection performances and consequently low correlation scores for our assessment approach. As expected, LANBIQUE-NR obtains a lower score than LANBIQUE-NC: In fact, LANBIQUE-NC is an upper bound for the No-reference version, since this latter protocol would require a perfect blind restoration method capable of obtaining the reference images to obtain the same score.
However, we experience a totally different scenario when the distortions are evaluated all together. We can see that for each IQA approach we have tested, there is a significant drop in the correlation coefficient with respect to single distortion experiments. We argue this is due to the fact that the scores for single distortion types are well correlated but considering the scores for multiple distortion classes, there is a bigger discrepancy between them that leads to a decrease of the total score. However, our approach does not suffer from this phenomenon, as the performance we measure in these conditions is consistent, if not higher, with single distortions. Moreover, our language-based approach slightly overperforms the other measures on the same data and at the same conditions.

Results on
PieAPP. Finally, we use a more recent large-scale dataset [37]. Prashnani et al. collected a very large dataset increasing the number of distortions with respect to existing IQA benchmarks. Moreover, they designed the testing procedure differently. Specifically, instead of collecting multiple subjective scores from a set of users, they rely on the fact that for humans it is easier to tell which of two distorted images I A , I B is closer to a reference undistorted one I R . Then images are labelled by the percentage of users that preferred an I A with respect to I B . If there is an even split between these two populations, then it means that both images are equally different from the reference I R . Starting from 200 reference images and combining a diverse set of 75 distortions, with a total of 44 distortions in the training set, and 31 in the test set that are distinct from the training set, the PieAPP dataset accounts for a total of 77,280 pairwise comparisons for training (67,200 inter-type and 10,080 intra-type). In Table 6, we report results in term of Kendall's Rank Correlation Coefficient (KRCC): KRCC = 1/ n 2 i <j sign(x i −x j )sign(y i −y j ); Pearson's Linear Correlation Coefficient (PLCC or ρ (X , Y ), as defined in Equation (9)) and Spearman's Rank Correlation (SRCC), ρ (R(X ), R(Y ) where R(X ) are the ranks of sample X .
Interestingly, both image and type of distortions do not overlap between training and testing. In Table 6, we show how our LANBIQUE-NC approach (using CIDEr and Reference [11]) ranks with respect to non-learning (top) and learning-based (bottom) approaches. We refer to non-learning methods when the algorithm is not relying in any way on any kind of supervision for the IQA task. Our approach exploits learned deep networks and features but those are not the result of training on PieAPP or on any other IQA dataset. Instead, the lower portion of the Table reports methods [7,8,24,29] that are specifically trained to score image similarity. Very interestingly, our LANBIQUE-NC approach is consistently better than any non-learned image similarity metric and outperforms References [7,37], with Reference [7] being a close comparison.

CONCLUSION
In this work, we propose LANBIQUE, a new approach to evaluate image quality using language models. Existing metrics based on the comparison of the restored image with an undistorted version may give counter-intuitive results. However, the use of naturalness-based scores may in certain cases ranks restored images higher than original ones.
We show that instead of using signal-based metrics, semantic computer vision tasks can be used to evaluate results of image enhancement methods. Our claim is that a fine-grained semantic computer vision task can be a great proxy for human-level image judgment. Indeed, we find out that employing algorithms mapping input images to a finer output label space, such as captioning, leads to more discriminative metrics.
LANBIQUE is capable to evaluate the quality of images corrupted by different distortions, and its performance is comparable to other image quality assessment methods. Moreover, we have . LANBIQUE-NC has better KRCC with respect to all non-learningbased methods and is also better than most of the methods that exploit some sort of supervision to perform IQA. modified our evaluation pipeline to transform our original solution into a No-reference method, and we have demonstrated that it keeps performing fair on standard benchmarks. Finally, we have tested LANBIQUE an a large-scale dataset that contains unknown distortions. Despite the lack of learning and of knowledge on data, our approach outperforms every baseline that does not use learning for the evaluation, and it is comparable to most of the learned approaches on the same data.
As a final note, we would like to remark that our approach will continuously improve, thanks to the advancement of image captioning and enhancement networks. Indeed, we have shown that without changing the visual features, switching to a better captioning algorithm, we get a higher performance. Moreover, since image enhancers are gaining quality, we can consider LANBIQUE-NC as an upper bound for LANBIQUE-NR, the gap between the performance of these two methods will shrink.