Visual Interestingness Prediction: A Benchmark Framework and Literature Review

In this paper, we report on the creation of a publicly available, common evaluation framework for image and video visual interestingness prediction. We propose a robust data set, the Interestingness10k, with 9831 images and more than 4 h of video, interestigness scores determined based on more than 1M pair-wise annotations of 800 trusted annotators, some pre-computed multi-modal descriptors, and 192 system output results as baselines. The data were validated extensively during the 2016–2017 MediaEval benchmark campaigns. We provide an in-depth analysis of the crucial components of visual interestingness prediction algorithms by reviewing the capabilities and the evolution of the MediaEval benchmark systems, as well as of prominent systems from the literature. We discuss overall trends, influence of the employed features and techniques, generalization capabilities and the reliability of results. We also discuss the possibility of going beyond state-of-the-art performance via an automatic, ad-hoc system fusion, and propose a deep MLP-based architecture that outperforms the current state-of-the-art systems by a large margin. Finally, we provide the most important lessons learned and insights gained.


SIFT
Scale invariant feature transform, SVM Support vector machines, SMR Supervised manifold regression, VOD Video on demand, VSEM Visual-semantic embedding model.

Introduction
Recent advances in automatic analysis of multimedia information go beyond the annotation and prediction of concrete, tangible and objective concepts, such as the presence of specific objects or scene understanding. Motivated by the richness of computer applications where human interaction is central, researchers now also concentrate on the prediction of subjective concepts, related to human behaviour and perception, such as visual memorability, Squalli-Houssaini et al. (2018), induced emotions, Mo et al. (2018), or visual aesthetics, Carballal et al. (2019). When addressing human reactions and perception assessment of multimedia content, an important role is played by the person himself, personal preferences, individual personality, cultural backgrounds and many more subjective factors. This is an additional challenge to devising automatic machine learning algorithms, as it requires ground truth data specifically adapted to this human-oriented task.
In this work, we address this challenge and discuss resources and approaches for one of the most popular subjective concepts of visual information, namely visual interestingness, Constantin et al. (2019). Interestingness has been defined and studied for some time, starting with Berlyne's works in psychology Berlyne (1949), who classifies interest as a defining factor for human motivation and behaviour. Later, Berlyne (1960Berlyne ( , 1970 identifies factors that induce or influence interest, including novelty, complexity, uncertainty and conflict. The high degree of subjectivity associated with interestingness is visible from some of its definitions, a crucial role in determining interestingness being assigned to the observer. For example, situational interest is defined by Hidi and Anderson (1992) as "the appealing effect of an activity or learning task on an individual". Chamaret et al. (2016) define interestingness as "the quantification of the ability of an image to induce interest in a user".
In some psychological studies, interestingness has been considered as an emotion, Silvia (2005Silvia ( , 2009, and included in the knowledge emotions category that is related to the comprehension process. Interest has been shown to be a product of two appraisal structures: novelty-complexity (interest shown for new and complex events) and coping potential (the ability to understand an event). Further studies have also revealed subjective differences between the perception of interestingness, based on personality traits, e.g., subjects that had high values for their openness were more influenced by the novelty-complexity appraisal structure, McCrae (2007).
In automated, computational approaches, the concept of interestingness is projected in two perspectives, Constantin et al. (2019): visual interestingness, which is related to the aforementioned definitions, and social interestingness, which is related to social media concepts such as popularity, virality, number of likes on social platforms, shares, etc. These concepts, although they may seem correlated, depending on the use case and data, proved to be, in fact, weakly correlated at best, typically negatively correlated, Hsieh et al. (2014). Items with high impact on social networks are not necessarily interesting from a visual perspective.
In this context, this work focuses on the concept of visual interestingness and proposes a publicly available, common evaluation framework, for the prediction of image and video visual interestingness. Proposed resources include large annotated data (the Interestingness10k data set) and evaluation protocols, as well as an in-depth study of benchmark and state-of-the-art approaches, with the objective of providing relevant baselines for a complete practitioner's guide. To disambiguate the information need, we adopt a real-world, Video on Demand (VOD), use case scenario, employed by Technicolor. 1 A computational system should be capable of automatically selecting movie images/parts which are considered to be the most interesting ones for the underlying movie, Demarty et al. (2016). The proposed resources have been validated during the 2016 and 2017 MediaEval Benchmarking Initiative for Multimedia Evaluation. 2 We strongly believe that this type of overview contribution that creates useful insights into its field has a significant impact and helps shape the research directions. We follow the best practices from the literature, like the evolution of PAS-CAL Visual Object Classes data set 3 in Everingham et al. (2015), ILSVRC benchmark 4 in , TRECVid 5 shot boundary detection track in Smeaton et al. (2010), TRECVid content-based video copy detection benchmark in Awad et al. (2014), ImageCLEF 6 automatic medical annotation data sets in Deselaers et al. (2008), multimodal person discovery in broadcast TV benchmark in Poignant et al. (2017), ImageCLEF biomedical image retrieval systems in Kalpathy-Cramer et al. (2015).
Some of the most important insights to takeaway from our study can be summarized with the following: (i) Inter-estingness entails a high degree of annotator subjectivity; (ii) What is interesting in an image? analysis of annotator data reveals some specific patterns such as colored and aesthetic frames, and presence of people; (iii) System performance for prediction is much lower than for more objective tasks, such as object detection or scene classification. Even humans, while significantly surpassing machine performance, do not achieve perfect prediction; (iv) Current state-of-the-art deep neural networks, while achieving good performance, they are not the top prediction performers; (v) What deep neural networks learn? Grad-CAM analysis shows an explicit focus on the main subject, but also on the area around. The presence of people triggers activation also around the faces; (vi) Late fusion and ensemble systems represent a good option with implicit higher performance than single systems of any type.
The remainder of the article is structured as follows. Section 2 presents the state of the art and positions our contribution. Section 3 describes the proposed data set, including the annotation protocol. Section 4 presents the recommended evaluation protocol. Section 5 presents an in-depth analysis of benchmark and state-of-the-art systems: overall capabilities, employed descriptors, prediction methods, generalization capabilities, and reliability analysis. Section 6 investigates the performance of several state-ofthe-art deep neural networks on the proposed data. Section 7 discusses the possibility of boosting performance by building an ad-hoc system on top of existing baselines and proposes a deep MLP-based solution. Section 8 concludes the paper and discusses future perspectives.

Previous Work
We review the relevant literature on the resources available to benchmark and develop visual interestingness prediction algorithms. For a comprehensive study of computational approaches for interestingness prediction, the reader is referred to our previous contribution, Constantin et al. (2019). Interestingness data sets have been created with the goal of predicting either image or video interestingness. A summary is presented in Table 2.
For instance, the Scene categories data set created by Gygli et al. (2013) is built on top of the Oliva and Torralba Oliva and Torralba (2001) data set. The authors use the original 2688 images, initially selected for scene recognition, and added binary (yes/no) interestingness annotations via crowd-sourcing on Amazon Mechanical Turk 7 . On average, each image was annotated by 11.9 subjects. Another relevant example is the visInterest data set, Soleymani (2015). It is composed of 1005 images covering different topics 7 https://www.mturk.com/. and extracted from real-world photos from Flickr 8 . Annotations were also carried out via crowd-sourcing on Amazon Mechanical Turk. Besides interestingness, these data also come with the annotation of other subjective concepts, e.g., quality, comprehensibility.
For videos, Jiang et al. (2013) propose a data set consisting of 420 YouTube 9 advertisement videos extracted from 14 different categories and 1200 Flickr videos for 15 different categories. The average duration across all the videos is around 53 s. Grabner et al. (2013) create a webcam-based data set from publicly available webcam streams. It contains visual scenes from highways, public squares, urban scenes, etc. The data set consists of 20 different webcam sequences recorded at 1 frame/second, with 159 images each. The interestingness annotations were carried out by 46 trusted annotators and the interestingness score is assimilated to the fraction of people who marked it as interesting.
Another relevant initiative is the gifInterest data set developed by Gygli and Soleymani (2016). It addresses the prediction of GIF media interestingness and is based on the Video2GIF data set, , and the Tumblr data set, Bakhshi et al. (2016). In total, it proposes 2739 image sequences with an average duration of 4.25 s (at 11 frames/second). Annotations were computed via crowdsourcing on Amazon Mechanical Turk.
Although existing resources are definitively valuable and address several useful use case scenarios, we propose a more comprehensive collection of resources, i.e., both annotated data and baseline systems, which were already validated in benchmark campaigns.
We identify the following main contributions over the current state of the art: (i) We release publicly a consistent annotated data set, i.e., Interestingness10k, composed of 9831 images and 9831 short videos (up to 4 h), annotated for visual interestingness by trusted annotators. Apart from image and video visual interestingness, the data also allow the study of the correlation between the two. To the best of our knowledge, this is the most complete common evaluation framework available so far; (ii) We provide an in-depth analysis of the crucial aspects of visual interestingness prediction algorithms by investigating the capabilities and evolution of existing systems (e.g., analysis of relevant approaches from the MediaEval benchmark and from literature, influence of the employed features and fusion techniques, influence of deep learning approaches, generalization capabilities). This is again the first comprehensive study covering all these core aspects. Fig. 1 Evolution of the number of published research papers referring to visual interestingness (search made via Google Scholar using "visual interestingness", "image interestingness", "video interestingness", "media interestingness" and "interestingness prediction" as keywords) It is a practitioners' guide for best practice in this field and also a strong baseline; (iii) We investigate the possibility of creating automatic, adhoc systems, based on existing baselines that would allow to boost state-of-the-art performance. In this context, we propose a new deep MLP-based fusion scheme that exceeds current performance by a large margin.
We analyzed the importance of this particular topic in the research community by quantifying the amount of papers published on this subject between 2010 and 2019. Results are presented in Fig. 1. The search was conducted via Google Scholar 10 using the following keywords: "visual interestingness", "image interestingness", "video interestingness", "media interestingness" and "interestingness prediction". Results were filtered out to remove irrelevant articles. Although not exhaustive, it is a good approximation of the general trend. Since 2016, the first year of the interestingness prediction task at MediaEval, the number of research papers published on the subject has grown substantially, and remained high in 2019 even though the task ended. This shows the positive impact of these data, as well as the increased interest in this subject.

Relation to Previous Work
Some preliminary contributions of this work have been published and readers can refer to those works for more information: Demarty et al. (2016Demarty et al. ( , 2017a short papers presenting briefly the data, metrics and evaluation methodologies for the 2016 and 2017 MediaEval benchmark campaigns; Demarty et al. (2017b) book chapter presenting and analyzing the results of the 2016 MediaEval benchmark.

Interestingness10k Data Set
We present the proposed data set, its composition and annotation, inter-annotator agreement analysis and the precomputed content descriptors provided with the data.

Composition
Interestingness10k 11 is a large-scale collection of images and video sequences extracted from Creative Commons 12 Hollywood-like movie trailers and excerpts, that allow redistribution. Trailers provide a high diversity of content with a good balance between interesting scenes and common scenes, which are typically alternating to increase the excitement. Therefore, they are more effective for generating benchmark data. Not least, having the data publicly available is a requirement for a useful benchmark. This would not be possible with data extracted from copyrighted movies.
The data are divided into two parts: (i) one for image visual interestingness prediction which consists of keyframes extracted from video shots 13 (middle frames), and (ii) one for video visual interestingness prediction which consists of individual video shots. Although images and video sequences are issued from the same data, predicting interestingness for images and videos are different tasks. Motion is characteristic for video content and affects differently the visual perception compared to a static image. Being composed this way, the data will allow the analysis of the correlation between the two. Each datum is also divided into a development set (devset) intended for training the methods and a test set (testset) for the actual evaluation. An overview of the data is presented in Table 1. For the 2016 data, all samples are collected from 78 movie trailers. The devset consists of 5054 images and 5054 videos extracted from 52 trailers. The testset consists of 2342 images and 2342 videos extracted from 26 trailers. The 2017 data are built incrementally on the 2016 data. The devset data are the full 2016 data set, i.e., 7396 images and 7396 videos extracted from 78 trailers. The testset data consist of 2435 images and 2435 videos extracted from 26 trailers and 4 full movie excerpts. We decided to include also longer segments, e.g., the video samples extracted from the 4 movie excerpts are on average 11.4 seconds long compared to around 1-2 seconds for the others. Interestingness10k provides a total of 9831 images and 9831 short videos extracted from 104 trailers and 4 movie excerpts. In Table 2, we compare our data with the most relevant data sets from literature (see also Sect. 2). For the image data, Interestingness10k has the advantage of providing the greatest number of images. Also, the annotations are performed by trusted annotators. For the video data, Interestingness10k provides the greatest number of sequences as well. The average duration of the video samples is slightly shorter than for the other data but it is consistent with the task. Interesting-ness10k is also the only data set to provide annotations for both image and video predictions.
Initialization: assign items randomly in a matrix; Processing: repeat Perform single annotation round with multiple annotators according to the item pairs given by the square (across rows and columns); Compute BTL scores for the new annotations; Re-arrange the matrix so that items are ranked according to their BTL scores, and placed in a spiral. This arrangement ensures that similar items are compared row-wise and column-wise; until convergence; Algorithm 1: Proposed adaptive square design annotation approach.

Annotations
Annotations were performed manually by trusted human assessors, i.e., experts with good understanding of the required task. Annotations are binary, i.e., either the content is interesting or not. Given the fact that the image visual interestingness prediction is different than video interestingness prediction, the two annotation tasks were carried out separately.

Annotation Protocol
We employed a pair-wise comparison approach, i.e., the human assessors were provided with two competing samples at a time, rather than annotating individual items, a method well suited for gathering subjective annotations in similar scenarios, as presented by Salesses et al. (2013). This provides several advantages. Firstly, it is more reliable as the annotator is asked to do a relatively easy cognitive task, i.e., simply comparing two items. In theory, assigning an absolute rating for a single item requires the annotator to compare to the full set of previously seen items, or at least to keep in mind some complicated set of decisions, Yang and Chen (2011). Secondly, for independent items, different annotators may use different scales and the assessments are not easily comparable, Ovadia (2004). Finally, it has been shown that pairwise comparisons are less influenced by the order in which the annotations are displayed compared to a direct rating, Yannakakis and Hallam (2011). To comply with the underlying use case scenario, annotators were instructed to select the image/video that would be defining for making him watch the entire source movie.
The main drawback of a pair-wise comparison approach is the impossibility of exploring all possible combinations of two items, especially when dealing with such a large data set. There are however several approximations possible which converge to similar results. We started from the adaptive square design method, Li et al. (2013), where the items are placed in a square and only pairs on the same row or column are compared. This reduces the number of comparisons from n(n−1)/2 for all pairs, to n( √ n −1), where n is the number of items. The Bradley-Terry-Luce (BTL) model (Bradley and Terry 1952) was used to convert the paired comparison data to a scalar value. We modified the original adaptive square design setup so that comparisons were made by many users simultaneously until all the required pairs had been annotated. The proposed algorithm 14 is depicted in Algorithm 1.
For the annotations, we used 5 rounds which proved to be sufficient to achieve good convergence. The final interestingness decision was based on sorting the BTL values and finding a threshold value. We used a heuristic rule to find the boundary between the interesting and non-interesting items, i.e., normalizing the BTL values for each movie separately and using the assumption that the BTL distribution is a sum of interesting and non-interesting sample distributions. For more details about the protocol, see Demarty et al. (2017b).

Annotation Statistics
The image data set was annotated by 270 annotators (average age 25.2±9) for which 70.9% were males, and 29.1% were females. Annotators came from 17 different countries around the world, mainly from Europe (79.6%) and Asia (18.5%). On average, each annotator annotated 1976 different image pairs. The video data set was annotated by 526 annotators (average age 30.3±12.5). The gender distribution was similar to the one for the image data, with 66.7% males and 33.3% females. Annotators were spread over 35 countries, distributed slightly different compared to the image annotators, namely 74.5% came from Europe, 15% from Asia, 8.7% from America and 1.7% from the rest of the world. On average, each annotator annotated 1030 video pairs, which is approximately half the number of image pairs. The reason is the significantly longer time required to visualize the videos.
Given the high subjectivity of the task, it is interesting to assess the annotators' agreement. To do so, there is a high diversity of metrics available, e.g., Percent Agreement, Krippendorff's alpha, Fleiss' Kappa, Randolph's kappa, Hayes and Krippendorff (2007). Depending on the data type and size, their characteristics, the number of raters per sample, not all metrics are suitable and equivalent. In our case, we have a large collection of annotations with 533,520 pair annotations for images and 541,780 pair annotations for videos. Not all pairs were viewed by the same annotators, but all of them had votes from at least two different annotators. Inter-rater agreement's measures such as Fleiss' kappa or Randolph's kappa are particularly appropriate in such configuration. Furthermore, the annotations are not equally spread between the two categories, i.e., interesting and not interesting. We observed a bias towards the not interesting class for both images and videos, with only a few samples with high interestingness levels. In the adopted pair-wise comparison protocol, there were no constraints adopted to attempt to equally spread the data into the two classes.
In such cases, where raters don't know a priori the number of cases that should be distributed into each category, Randolph's kappa proved to be a good alternative to the fixed-marginal multirater Fleiss' kappa, Randolph (2005). Marginals are considered to be fixed when raters know a priori the quantity of samples that should be distributed into each class. In that sense, Randolph's kappa is seen as a free-marginal multirater kappa, adapted to a non-symmetric distribution of the data between classes.
The computation of Randolph's kappa, when considering two annotators per pair, led to a value of 0.556 for the image data set and 0.519 for the video data set. Randolph's kappa is in the range of [−1; 1], with 1 being a perfect agreement and negative values meaning no agreement between raters (other than what would be expected by chance). Therefore, we reach a reasonable agreement on both the image and the video data sets. For the sake of comparison, we also computed the Percent Agreement and obtained 76.9% for the image data set and 75% for the video data set. This reconfirms a reasonable inter-rater agreement for both data sets, considering the high subjectivity of the interestingness concept.
In Figs. 2 and 3, we illustrate several examples of both images and videos, annotated as interesting as well as noninteresting. Interesting content is visibly more colored, better centered on pleasant people, less blurred and containing interesting actions.

Content Descriptors
To address a broader community, the data come with several pre-computed, general purpose, content descriptors for visual and audio information.
Visual information We propose the following visual descriptors: Dense SIFT, Lowe (2004), HoG, Dalal and Triggs (2005), LBP, Ojala et al. (2002), GIST, Oliva and Torralba (2001), Color Histogram, AlexNet layers, Krizhevsky et al. (2012), and C3D layers, Tran et al. (2015). Dense SIFT features were computed using densely sampled frame patches instead of point of interest detectors, with a codebook of 300 codewords used in the quantization process, as described in Lazebnik et al. (2006). HoG descriptors were computed over densely sampled patches and following the work of Xiao et al. (2010) were concatenated in order to cre-  Each video is depicted with a key-frame ate a higher dimensional feature. From the AlexNet model we extracted the fc7 and prob layers, according to the work of Jiang et al. (2015) and from the C3D model the fc6 layer.
Audio information We propose the MFCC features, computed over 32ms windows with 50% overlap, where cepstral vectors are concatenated with their first and second derivatives.
Mid-level information To account for a higher-level description, we propose a human presence detector. Face detection was computed via HoG and tracking was done via the approach proposed by Danelljan et al. (2014).

Evaluation Methodology
For benchmarking image/video visual interestingness prediction, we recommend certain metrics. These were used during the 2016 and 2017 MediaEval benchmark. Of course, the data are not restricted to those ones, but they provide a solid baseline. There is also an official split between the training data (devset) and testing data (testset). It is presented in Table 1. This would allow systems to be compared under the same conditions. The systems should train their parameters on devset and perform the actual evaluation on the testset.
We expect the systems to predict a confidence score corresponding to the degree of visual interestingness for each item. The higher the score, the more interesting it should be. Inline with this, we recommend two related metrics: the overall mean Average Precision (mAP) and the mean Average Precision over the 10 highest ranked items (mAP@10). MAP is a widely used metric for retrieval tasks, proven to be stable in such scenarios (Buckley and Voorhees 2017). It is computed as the mean value over the average precision scores for each source trailer in the testset. This metric fits the VOD use case where images/ videos should be selected to be the most interesting for representing the underlying content. mAP@10 was proposed to better reflect the selection of a small set of candidate images/videos. The metrics are computed using the standard treceval software tool. 15

Baseline Systems
We provide an in-depth analysis of various systems, both from the MediaEval benchmark, as well as state-of-the-art systems from literature which were evaluated on Interesting-ness10k. We reference the different year data as Y ear.T ype, where T ype is the modality and Y ear the specific year, e.g., 2016.Image refers to the 2016 image prediction data. The data were presented in Table 1. We overview a total of 192 systems, as presented in Table 3.
Systems are evaluated using the official devset-testset split and also the official metrics, i.e., mAP for the 2016 data and mAP@10 for the 2017 data. For comparison between different data sets, we use general mAP.
We provide an analysis of overall performance and system evolution, employed descriptors, prediction methods, generalization capabilities and, finally, analyze the reliability of the system ranking results. Statistical significance of the main hypotheses are tested using the Mann-Whitney-U test Mann and Whitney (1947).

Analysis of the Overall Performance
We analyze the general trends and performance of existing approaches. A boxplot representation of the results is presented in Fig. 4.
The first observation is the fact that no methods stood out as outliers, i.e., with significantly higher or lower performance, compared to the others. There is then an obvious trend of increasing performance from 2016 to 2017. The best mAP performance for image prediction increases by 25.75%, from 15 https://trec.nist.gov/trec_eval/. For video prediction, the improvement is similar, namely 22.75%, from a mAP value of 0.1815 on 2016.Video data (Almeida 2016), to 0.2228 on 2017.Video data . The median mAP for 2017.Image and 2017.Video, 0.2550 and 0.1877, respectively, both surpass the maximum values recorded for 2016.Image and 2016.Video. The observation is also true when analyzing only the runs that were officially submitted as part of the MediaEval benchmark, with the image and video prediction registering a growth of 31.63% and 15.37%, respectively (Mann-Whitney-U p < 0.001 for images and videos). The reason behind this could be the improvement of the systems, their increased specialization for an interestingness related task, the effect of a bigger number of samples in the training data sets and better annotations.
Comparing the prediction of visual interestingness for images and videos, results show that images allow to achieve higher mAP, but also a wider spread of the results meaning more diversity, i.e., standard deviation is 0.0264 for Image.2016 data vs. 0.0120 for Video.2016 data, and 0.0476 for Image.2017 data vs. 0.0134 for Video.2017 data.
To stress the upper limit performance, we also assess the results of three human runs, obtained via the human annotators (see the red dots in Fig. 4). To compute those, we followed the annotation protocol described in Sect. 3.2.1. The best achieved results are: for the 2016.Image data a mAP of 0.5058, for the 2016.Video data a mAP of 0.4066, for the 2017.Image data a mAP@10 of 0.5403 and a mAP of 0.6661, and for the 2017.Video data a mAP@10 of 0.4140 and a mAP of 0.4897. What is interesting to notice is that these human assessors did not lead to 100% precision, as the overall aggregated annotations would do. This clearly indicates the high subjectivity of such a task and inherently a variation in the perception of the data.

Analysis of the Employed Features
We further analyze the impact of the employed content descriptors and fusion schemes on the performance.
Single modality features. Overall, 72% of the analyzed systems (139 systems) use only one modality. Some have achieved very good performance. For instance  achieve the best overall performance on the 2016.Image data set, with mAP 0.2485. The authors use a combination of standard visual features (Datta et al. 2006;Li and Chen 2009;Ke et al. 2006) with early and late fusion schemes. Motion features were best overall performers on the 2016. Video data. Almeida (2016);  achieves a mAP of 0.1815 using histograms of motion patterns (Almeida et al. 2011) in different learning-to-rank strategies, thus taking into account the full spatio-temporal representation of the videos.
Conceptual features are a special class of descriptors that represent higher level concepts, positively or negatively correlated with interestingness. Even though few systems have implemented such features, only 12% (23 systems), they achieve some of the top results. Examples are features capturing emotions, e.g., SentiBank, Borth et al. (2013) employed by Xu et al. (2016), features representing the visual-semantic space, e.g., image-captioning based, Kiros et al. (2014), employed by Berson et al. (2017). The best results on 2016.Image data from the MediaEval benchmark was achieved by Liem (2016) who uses HSV histograms augmented with the presence and areas of faces. It achieves a mAP of 0.2336. On the 2017.Video data the best mAP@10 is 0.0827, achieved by Ahmed et al. (2017) at MediaEval. The authors use genre as a predictor for movie interestingness, developing a system that creates genre predictors based on layers extracted from deep neural networks like VGG-16 (Simonyan and Zisserman 2014) and SoundNet (Aytar et al. 2016). The proposed system uses the MovieScope data set (Sivaraman and Somappa 2016) as additional training information.
Deep features are now the state of the art in many classification tasks. They were also widely used, both as unimodal features or part of multimodal, fusion approaches, accounting for 59% of the analyzed systems (114 systems). Examples are the use of AlexNet fc7 and prob layers in Erdogan et al. (2016), or last layers of VGG in Lam et al. (2016). Overall, several deep feature-based systems achieved the best performance, either individually or in multimodal combinations. The highest mAP on 2016.Image data achieved during the MediaEval benchmark, is 0.2336 and is obtained by Shen et al. (2016). The authors employed fc7 layer features from CaffeNet (Jia et al. 2014), where data are re-sized and center cropped to preserve the aspect ratio. The authors also performed a mean image subtraction for normalization. Another example is the approach of Parekh et al. (2018), which achieved the best overall mAP@10 on 2017.Image data, i.e., 0.156. The authors use the fc7 layer of AlexNet as input for their DNN ranking. Other approaches use deep features in fusion schemes. The best mAP@10 on 2017.Image data achieved during the MediaEval benchmark is 0.1385. Permadi et al. (2017) employed standard visual features like LBP and HoG in combination with AlexNet fc7 features. The best overall result on the 2017.Video data, mAP@10 0.093, is achieved by Wang et al. (2018) fusing standard visual features (color histogram, denseSIFT, GIST, HoG and LBP), audio features (IS10, Eyben et al. 2010) and layers of deep networks (AlexNet, C3D and InceptionV3).
A type of feature fusion was employed by 54% of the analyzed systems (104 systems). Early fusion was used in 41% of the cases (78 systems), while late fusion was used in 25% of the cases (48 systems). Some systems use a combination of these approaches. Dimensionality reduction schemes have been used in 18% of the cases (34 systems). Some notable performance is achieved including PCA, in Rayatdoost and Soleymani (2016), with the third best result on 2016.Image data at MediaEval (mAP 0.1710), and NMMP and SMR, in Liu et al. (2017), with the second best mAP result in 2017.Image at MediaEval(mAP@10 0.1369). For a detailed analysis of the impact of dimensionality reduction on the prediction, the reader can refer to .
Temporal feature aggregation was also explored. Several methods were tested for creating video level descriptors from individual frame descriptors. Overall, 67 out of the total 115 systems (58%) dealing with video interestingness use this type of feature aggregation. However, most of them were traditional statistical methods, such as average and standard deviation. For instance (Liu et al. 2017) obtains the second best mAP during MediaEval for 2016.Video with a mAP of 0.1735. Median is used in , obtaining a mAP@10 of 0.0732 on 2017.Video data. This is the third best   (2017), who achieved a mAP@10 of 0.0628 on 2017.Video data, or the use of temporal integration via LSTM Hochreiter and Schmidhuber (1997) architectures in Shen et al. (2017), with a mAP of 0.1706 on 2016.Video data.

Analysis of the Prediction Methods
The next experiment is to analyze the employed techniques and their capabilities. There are, of course, numerous approaches that have been experimented. However, we can identify some trends. We propose an analysis at two different levels of detail, the methods being classified: (i) according to the problem formulation, and (ii) according to the specific class of techniques. While some of the classes defined in the following section may not be mutually exclusive, our intention here is not their classification but to identify the most prominent approaches and understand their performance and general trends.
Classification Fig. 6 Analysis of the employed methods: Year.Type represents the year of the data (2016 or 2017) and their type (Image or Video). We plot mAP for all methods. We provide two levels of details: (i) per problem formulation, and (ii) per specific method. We represent both the participating systems from MediaEval benchmark as well as stateof-the-art approaches from literature (marked with a red circle)

Problem Formulation
We identify the following main approaches: (i) classification, (ii) regression, (iii) ranking, and (iv) hybrid, i.e., combining more than one approach. Results are presented in Fig. 6. Overall, more than 52% of the approaches use classification. These systems tend to achieve the highest performance on the image data.
SVM was the most popular choice among the analyzed approaches, representing 30%. It is used by two of the top runs on the 2016.Image data, and the top run on the 2017.Video data. For instance, Shen et al. (2016) built a simple yet efficient system that uses visual features (CNN features from the last layer of the CaffeNet) classified with SVMs. It achieves a mAP of 0.2336 on the 2016.Image data during the MediaEval competition. This score was further improved by  using SVMs to learn the association between various image description techniques (related to subjective properties, such as aesthetics, style, image composition) and interestingness. The system is boosted via a late fusion approach, outperforming the best results from the MediaEval benchmark with a mAP of 0.2485. On the 2017.Video data, the top score at MediaEval is obtained by Ahmed et al. (2017) using deep audio-visual features to generate mid-level concepts representing movies genres, i.e., action, drama, horror romance, and sci-fi. These genre distributions served as the input for a binary SVM classifier achieving a mAP@10 of 0.0827.
DNNs represent 28% of the number of analyzed approaches, being the second most used approach. Surprisingly, none of the best MediaEval benchmark systems are using DNNs. However, there are state-of-the-art approaches which outperform the best results. For instance, Parekh et al. (2018) provides the best overall results on 2017.Image data, achieving a mAP@10 of 0.156. The authors train a DNN network that takes as input pairs of CNN representations of images, to predict which one is more interesting from the pair. The process is carried out for all possible pairs within each video followed by a ranking algorithm.
Ranking approaches account for 13% of the analyzed approaches. Almeida (2016) uses a set of learning-to-rank algorithms for predicting the interestingness of videos via only visual feature representations (HMP). The classification is performed with a majority voting scheme over the prediction of 4 pairwise learned rankers, namely: Ranking SVM, RankNet, RankBoost, and ListNet. It achieves the best results in the MediaEval competition on the 2016.Video data, i.e., mAP 0.1815.
Regression approaches account for 12% of the analyzed systems, while also accounting for some top runs. For instance, Permadi et al. (2017) achieves the best overall results on the 2017.Image data, a mAP@10 of 0.1385. The authors use a logistic regression trained on an early fusion representation of various features, i.e., Color Histogram (HSV), LBP, HoG, GIST, denseSIFT, Alexnet features and contextual descriptors.
Hybrid approaches, combining more than one type of methods, account for almost 6% of the total analyzed systems. While these methods did not achieve notable results during the MediaEval benchmark, some of the state-of-theart approaches provide notable results. Wang et al. (2018) provide the best overall results on 2017.Video data, with a mAP@10 of 0.093. The authors investigate the use of a learning-to-rank DNN via a Siamese network, and a reinforcement ranking based on a Markov decision process. To boost the results, descriptors are aggregated using early fusion: visual descriptors (GIST, LBP, HoG, Color Histogram, denseSIFT), deep features (AlexNet, InceptionV3, C3D), and acoustic features (energy, pitch, jitter and shimmer). A late fusion is finally used to aggregate the decisions of the two ranking models.
Shallow NN-based methods are less used and account for almost 6% of the analyzed systems. While in general less effective than the other approaches, one approach stood out. Berson et al. (2017) uses semantic and contextual information via CNN features and image-captioning based features with metadata extracted from IMDb. 16 The authors investigate different combinations of features trained via a simple MLP network, achieving a mAP@10 of 0.1054 on the 2017.Image data.
Distance-based approaches account for 4% of the total number of analyzed systems. For instance, Liem (2016) employs a heuristic approach based on the occurrence of people in video shots. The author assumption is that clear human faces should attract viewers' attention causing larger empathy. The classification quantifies the average of the histogram intersection between the HSV histograms of the detected faces, the mean HSV of all frames with detected faces within a shot, and the area of the detected faces' bounding boxes. The scores are then sorted followed by thresholding. It achieves a mAP of 0.2336 on the 2016.Image data and 0.1558 on the 2016.Video data.
Ensemble learning approaches are poorly represented. We find only one approach tested on the 2016.Image data but without any notable results.
Similarly, statistical approaches, e.g., Markov decision based, were used by only one system on the 2017.Video data, but without any notable results. Table 5 presents an analysis of the average mAP achieved for the categories of methods presented in the previous sections. For the image data, approaches based on DNNs and shallow NN stand out, with average mAP scores of 0.2460 and 0.2405, respectively. This result is particularly interesting as the best performing type of method, DNN, has also a high number of runs. While the most used approach is SVM, it is outperformed by many of the other approaches. On the other hand, for the video data, hybrid approaches and SVM-based approaches stand out as the best performers, with average mAP scores of 0.1867 and 0.1822, respectively. Unlike the image data, it appears that hybrid systems are the best performing type of methods, which could be the result of the inherently multi-modal nature of videos.

Generalization Capabilities
Interestingness has been proved to be either positively or negatively correlated to other subjective concepts, Constantin et al. (2019). It is therefore interesting to study whether systems are able to generalize well from other concepts or data, and even between images and videos. In this experiment we analyze these aspects.

Concept Generalization
We analyze how visual interestingness prediction generalizes between different concepts and, therefore, type of data. We identified the following situations: (i) no generalization, i.e., the systems were trained solely on the Interestingness10k data, without the use of other external data; (ii) pre-trained extractors, i.e., systems are trained on data unrelated to interestingness, like object recognition data sets, and used directly, usually as features in a classifier, to predict interestingness; (iii) fine-tuned systems, i.e., systems are firstly trained on data unrelated to interestingness and then retrained on the Interestingness10k data to predict visual interestingness; (iv) correlated systems, i.e., systems are trained on other data from positively or negatively correlated domains, e.g., memorability, aesthetics, emotion prediction, and then used to predict interestingness, either directly or via finetuning. Pre-trained extractors with 88 systems (45.8%), represent the most common type of system, even more popular than systems that do not use any kind of generalization (42.2%). Several deep neural network architectures were used by these extractors, including AlexNet, VGG and C3D.
Fine-tuned systems were mainly employed by finetuning popular deep neural networks, accounting for 17 systems in total (8.9%), 8 of them addressing image interestingness and 9 of them video interestingness. For instance, Erdogan et al. (2016) achieves a mAP of 0.2125 on the 2016.Image data. The authors fine-tune the AlexNet model. The last softmax layer is replaced with a regression layer, using Euclidean loss. Training is carried out for 2000 epochs and only the weights of the final fully connected layer are updated during this process. Ahmed et al. (2017) achieves the best results on 2017.Video data with a mAP of 0.2094, being also the best result recorded during the MediaEval benchmark. The authors create a genre prediction system for video and audio information using the VGG and SoundNet models, trained on the MovieScope data set (Sivaraman and Somappa 2016). The final retrained system is able to infer video interestingness starting from the genre prediction network. During the training process, keyframes were used as representatives for the entire video shot. Another approach, developed by Vasudevan et al. (2016), uses a deep visual semantic embedding model developed and trained on 0.5 million samples from the MSR Clickture data set (Hua et al. 2013), used to infer semantic proximity between text and images. This network uses a series of LSTM layers for encoding textual information and convolutional and fully connected layers for image processing. During the finetuning process, the title of the movie and the keyframes are embedded in the same space, and ranking is achieved based on the distance between the textual and image embeddings. This approach scored a mAP of 0.1952 on the 2016.Image data.
Finally, 6 systems (3.1%), 3 for image prediction and 3 for video prediction use correlated system approaches. For image prediction (Shen et al. 2016) achieve a mAP of 0.2315 on the 2016.Image data. The authors create a shallow MLP-based system with one hidden dense layer with 1000 neurons and ReLU activation. This system is initially trained on a data set of 0.2 million images extracted with the Flick API, 17 based on their Flickr social interestingness score. The data set is evenly balanced with regards to socially interesting and non-interesting samples. While social interestingness and visual interestingness are different concepts, they can exhibit some degree of correlation given their subjective nature (Hsieh et al. 2014). The best performing model is trained afterwards on the 2016.Image data and some addi-fine-tuned Fig. 7 Analysis of the generalization capabilities: methods developed on the provided data (none), methods pre-trained on unrelated data (pre-trained), methods pre-trained and then re-trained on provided data (fine-tuned), methods pre-trained on related data and used directly (correlated). mAP values are presented for all the methods, whereas image prediction is depicted in blue while video prediction in red. We represent both, MediaEval benchmark systems as well as state-of-the-art approaches from literature tional resampling and upsampling steps are taken to induce class balance.
For the upsampling strategy, interesting samples are multiplicated, by a factor of 5 to 13 times, with the optimum result being achieved for an upsampling factor of 11. This approach also represents the best result attained during the MediaEval competition, with a mAP of 0.2336. For the resampling strategy, the authors randomly select samples, based on a preset probability of interesting samples being selected. Values between 0.3 and 0.6 interesting samples are tested, and the optimum result, a mAP of 0.2315 is achieved with a resampling parameter of 0.6. Other approaches include the ones proposed by Erdogan et al. (2016), who retrain the fully connected weights of the memorability model Mem-Net (Khosla et al. 2015) for 3000 epochs, thus achieving a mAP result of 0.2121. For both image and video prediction, Xu et al. (2016) employed SentiBank-based systems in their approach, trained on Flickr images (Borth et al. 2013), without finetuning the systems on Interestingness10k data. For the 2016.Image data, the authors achieve a mAP of 0.229, while for the 2016.Video data the result is 0.154. Previous works have shown positive correlation between emotional content and visual interestingness (Gygli et al. 2013). Figure 7 shows a comparison between the results obtained by different generalization strategies. It is interesting to notice that, for image interestingness prediction, the pretrained extractor systems are performing significantly better than the other type of methods. The average mAP for pretrained systems is 0.2405, while for the no generalization systems the average mAP is 0.2208 (Mann-Whitney-U p < 0.05). However the same conclusion did not present statistical significance for the video data. While the other strategies did not present top results, an interesting experiment is conducted by Vasudevan et al. (2016). As mentioned before, their network, once re-trained on 2016.Image data achieves a mAP value of 0.1952. However, the same deep visual semantic embedding system trained only on the 0.5 million text-image pairs only achieves a mAP of 0.1866, while the addition of 7.5 million text-image pairs from the MSR Clickture data set surprisingly further decreases the mAP to 0.1858. This experiment shows the importance of finetuning on Interest-ingness10k data and the performance advantage it can bring.

Image to Video Generalization
We analyze how image visual interestingness prediction can generalize to video prediction. We target identical systems, e.g., use of the same set of features, pre-processing, training and post-processing, that are used for both tasks. This analysis also incorporates video systems that use simple statistical approaches in creating a video descriptor out of image descriptors, such as taking average or median values across the entire set of frames and generating a single, videowise descriptor. 10 systems fall into this category. Figure 8 presents the achieved mAP on video prediction vs. image prediction. The Pearson correlation coefficient is ρ = 0.546 indicating that there is correlation between the two. However, this can be explained also by the data which is also correlated, i.e., images are extracted from the videos.
Nevertheless, although not a statistical proof, we don't rule out the possibility of adapting image-to-video prediction and vice-versa. This was also experimented in some previous work, e.g., Liu et al. (2009), where systems are adapted to both tasks.

Long Versus Short Videos
The 2017.Video data include some longer than the average videos (see Sect. 3.1), with an average duration of 11.4 seconds compared to around 1-2 seconds for the others. We analyze here the prediction capabilities between these different length data. Results prove that the longer the videos, the better the prediction of the system. The average mAP@10 on the 1-2 s videos is 0.0562, while the average mAP@10 for the 11.4 s videos is 0.0751.

Reliability Analysis
We analyze the reliability of the MediaEval benchmark rankings for the Interestingness10k data. The general idea is to study how stable the rankings are by sampling the testing data set in different ways.
Systems are ranked using an evaluation metric based on comparing their responses to the ground truth for a set of queries q ∈ Q. If we denote the score achieved by system A with λ Q,A , and the score received by a different system B with λ Q,B , we say that system A is better than system B if λ Q,A > λ Q,B . If this ranking is reliable, it could be replicated with another set of queries Q , so that λ Q ,A > λ Q ,B still holds (Urbano et al. 2013).
Ranking stability was investigated by randomly sampling equally sized pairs (Q , Q ) of query subsets from all testset queries Q. Next, the system rankings based on Q can be compared with those based on Q . Urbano et al. (2013) suggests several reliability indicators for performing this comparison, and show that most of them are highly correlated. We selected two measures for this study, representative of two different types of measures: relative sensitivity (score-based) and Kendall's rank correlation (rank-based). In addition, we also calculate a weighted variant of Kendall's rank correlation.
Relative sensitivity δ r is defined as the minimum difference (λ Q ,A − λ Q ,B )/ max(λ Q ,A , λ Q ,B ) that needs to be observed with Q such that the differences with Q have the same sign at least 95 % of the time. For a stable system, relative sensitivity tends to 0, and Sanderson and Zobel (2005) suggest δ r = 0.25 as a reasonable limit for judging reliability.
In contrast, Kendall's rank correlation τ considers only the systems' ranks and not their specific scores (Abdi 2007). Instead, it depends only on the number of inversions of pairs of objects that would be needed to transform the ranking induced by Q to the one by Q . The value of τ ranges from 1 (identical rankings) to -1 (inverse ranking). Voorhees (1998) suggests τ = 0.9 as a reasonable limit for judging the ranking reliable.
Finally, we also compute the weighted Kendall's rank correlation τ w (Vigna 2015). Here, exchanges of highly ranked objects are considered more influential than exchanges of low ranked objects. We consider that this is well-motivated in this case as the worst systems are performing essentially randomly, and their ranking can thus be deemed somewhat arbitrary. We used the additive hyperbolic weighting as suggested by Vigna (2015).
The Interestingness10k testset data contains around 2400 video shots, which are extracted from 26 videos for 2016 and 30 for 2017. In order to have statistically independent subsamples we opted to sample among the set of videos, as shots from the same video cannot be considered to be statistically independent. We have subsampled in decrements of one, so that if the total number of videos is N , we have proceeded to randomly generate pairs of N −1 movies, N −2, and so on. For each subsample size we report average scores calculated across 50 randomly generated pairs. Figure 9 shows the reliability scores for each datum and modality according to the official metric. In all plots, the horizontal axis indicates the subsampling percentage, while the vertical axis indicates the average reliability score. The reliability limits τ = 0.9 and δ r = 0.25 are indicated with horizontal red dotted lines.
We can observe that τ ≥ 0.9 is reached with N − 1 or N −2 subsampling for images, but not for videos. For videos, only the weighted variant τ w barely reaches 0.9 at N − 1 subsampling, indicating that video ranking was less reliable than images. In contrast, the relative sensitivity limit, which also takes into account the score values, is easily reached in all cases even at lower sampling sizes (at 50% sampling or even smaller). The only exception is 2017.Video data, where the limit is reached only at sampling 25 videos (83%). Finally, we can observe that both Kendall's scores tend to 1 and the relative sensitivity tends to 0 as the number of queries that are evaluated increases.

State-of-the-art Deep Neural Networks
To account for current state-of-the-art deep neural network capabilities, we evaluate the performance of three recent image and video classification architectures, which were finetuned on the Interestingness10k data. We selected for the image data the ResNeXt-101-32x48d (Xie et al. 2017) PNASNet-5 (Liu et al. 2018, and ResNet-50 (He et al. 2016) architectures, augmented with best practices as presented in Touvron et al. (2019); and for the video data, the GSM-InceptionV3 En3 (Sudhakaran et al. 2020), IR-CSN-152 , and R(2+1)-18 (Tran et al. 2018) architectures. The achieved results are synthesized Table 6.
For image classification, we have followed the training protocol in Touvron et al. (2019). In this context, we have fine-tuned all of the three algorithms using the provided weights trained on 940 million public images with 1.5k hashtags matching with 1000 ImageNet1K synsets (Yalniz et al. 2019), fine-tuned on the ImageNet1K data set . We adopt the set of good practices proposed by the authors, namely data augmentation including resizing the images, random horizontal shift of the center crop, horizontal flip and color jittering, including batch normalization layers, classification of the images at several resolutions and average the classification scores. The best results were achieved by FixResNeXt-101-32x48d, in both 2016, and 2017 scenar- Fig. 9 Reliability scores of the system rankings: Year.Type represents the year of the data (2016 or 2017) and its type (Image or Video). X-axis is the subsampling percentage (sampling is performed at movie level) and y-axis is the reliability score. Relative sensitivity scores are marked with •, Kendall's tau with , and weighted Kendall's with . The reliability limits for the scores τ = 0.9 and δ r = 0.25 are indicated with horizontal red dotted lines. For reference, at 100% subsampling, we trivially have perfect reliability as we would compare identical subsets Table 6 Performance of state-of-the-art deep neural network architectures when trained on the Interestingness10k data (bestME stands for best method from the MediaEval benchmark and bestSoA for the best method from the literature that was tested on these data) We follow the good practices recommended by the authors which include a random patch cropping strategy, variable clip length, and temporal jittering. For GSM-InceptionV3 En3 (Sudhakaran et al. 2020), we followed the training protocol provided by the authors including the fusion of three variants of different clip lengths. The best results were achieved by GSM-InceptionV3 En3, with a mAP score of 0.1738 for the 2016 data, and a mAP@10 score of 0.0821, for the 2017 data.
The analysis of the results shows that these deep neural networks do not achieve the best results. While in a few cases, e.g., FixResNeXt-101, the best results from the MediaEval benchmark have been surpassed, none of the tested networks managed to surpass the current state of the art in media interestingness. Given the fact that the selected networks represent the current state-of-the-art in their corresponding domains, i.e., image and video classification tasks Goyal et al. 2017), the intuition is that more specialised approaches are required to cope with this highly subjective task.
To understand how deep learning algorithms interpret the visual samples and thus how they attempt to predict interestingness, we computed the Grad-CAM maps via Grad-CAM (Selvaraju et al. 2017) and Guided Backpropagation (Sprin-  Fig. 10. Results show that in many cases, the model focuses on the main subject, but predominantly more on elements adjacent to it, showing an inclination for detecting the context that surrounds the main subject. This is also true for human subjects, as the Grad-CAM analysis shows network activation on human faces, but also many times around the face. We theorize that this concentration of useful features on and around faces may represent a positive influence on the final results, as faces convey emotions.

Super-system Design
In this final experiment, we investigate the possibility of exploiting the power of many systems to create a state-of-theart performing super-system. The idea is to use an automatic, ad-hoc fusion strategy to exploit the advantages of each individual system. We prove that although individual systems are powerful, and declared state-of-the-art, there is always the possibility of achieving a greater performance via fusing system outputs. Though some of these state-of-the-art systems already include fusion strategies, our proposed ad-hoc fusion will incorporate the entire set of systems used during the MediaEval competition, therefore a larger set of system outputs. To achieve this goal, we investigate several standard approaches such as late fusion and boosting and, in the end, introduce a new fusion scheme based on a deep multilayer perceptron architecture with dense layers.

Evaluation Setup
Ensembling requires typically tens of systems to be able to boost the performance. In practice, it is basically impossible to implement or retrieve such a number of systems from the authors, considering also re-running them in the very same conditions. There are also no best practices in this respect in the literature. The only approaches that do so use a very reduced number of inducers, e.g., less than 10 (Li et al. 2016). We therefore adopted a compromise that allows to use all the system runs submitted to the MediaEval benchmark, by experimenting solely on the testset. We use two split scenarios: (i) 75% training and 25% testing (RSKF75), and (ii) 50% training and 50% testing (RSKF50). Split samples are randomized and 100 partitions are generated. The official metrics are computed as average values over these partitions.
Although this approach looks more disadvantageous than training the systems on the entire devset, because the number of training items is significantly lower, we consider the results a good lower indicator of what the performance of late fusion would be. The following small experiment highlights the differences between the two training scenarios. We re-run our systems submitted to the MediaEval 2017 Interestingness task ) under the new testset split conditions. As expected, results for the RSKF75 split are better than the ones for the RSKF50 split. However, the drop in performance is significant when compared with the original results attained by training on the entire devset and testing on testset. Thus, our system's mAP@10 results ) decreased from 0.0555 (original devset/testset) to 0.0295 (RSKF75) for the image data, and from 0.0732 (original devset/testset) to 0.0314 (RSKF75) for the video data.

Approaches
We experiment with the following approaches: late fusion, boosting and proposed MLP-based architecture, which are presented in the next sections.

Late Fusion
We investigate the possibility of using standard late fusion techniques (Kittler et al. 1996). We experiment with producing an aggregated visual interestingness score via the minimum (LFmin), maximum (LFmax), mean (LFmean), and median (LFmedian) values of all interestingness scores of all the individual systems.
We also investigate a learning strategy via a weighted mean of system outputs (LFweight), where the weights are determined by the rank of the system in comparison with the other systems. Given that some systems may negatively affect the aggregated prediction, we use only the top-N systems, where N is empirically determined. The aggregated visual interestingness score is determined as K i=1 w i · s i , where, for each individual sample, N is the total number of systems taken into account, w i is the assigned weight for each system according to its rank, and s i is the interestingness score. N is set to 2, 3, 5, 10, 20, and the number of systems. Weights are computed as w i = 1 − i * α, where α is varied between 0.01 and 0.5.
Overall, LFweigh had the best performance. For the RSKF50 configuration, 2017.Video data represent the exception, where LFmean had better results, mAP@10 of 0.0872. LFweigh performed best in the following situations: on 2016.Image data, mAP of 0.2499 (using top N = 10 systems, α = 0.08), on 2016.Video data, mAP of 0.1915 (using top N = 10 systems, α = 0.1), and on 2017.Image data, mAP@10 of 0.1567 (using top N = 20 systems, α = 0.06). For the RSKF75 configuration, LFmean performed best on: 2016.Image data, mAP of 0.2519 (using top N = 2 systems, α = 0.25), on 2016.Video, mAP of 0.1929 (using top N = 10 systems, α = 0.09), on 2017.Image data, mAP@10 of 0.1532 (using top N = 10 systems, α = 0.11), and finally on 2017.Video, mAP@10 of 0.0893 (using top N = 10 systems, α = 0.08). While the use of late fusion combinations created systems that outperformed the MediaEval best results, in some cases, e.g., on 2017.Image and 2017.Video data, there are state-of-the-art systems that had better scores. Figure 11 presents the comparison of the best two performing late fusion systems with the other approaches.

Boosting
Boosting schemes are widely used for enhancing the performance of weak learners by aggregating them into a stronger classifier (Han et al. 2016;Opitz et al. 2017;Son et al. 2015). We experimented with several consecrated strategies, namely: AdaBoost (Freund and Schapire 1997), and Gradient Boosting (Friedman 20001). We experimented with various combinations of systems based on their individual performance, from the worst performers to the best ones.
AdaBoost performed best under the RSKF75 configuration on 2016.Image data, mAP of 0.2677 (aggregating systems ranked 8 to 10), on 2017.Image data, mAP@10 of 0.1674 (aggregating systems ranked 5 to 19), and on 2017.Video data, mAP@10 of 0.1129 (aggregating systems ranked 19 to 21). Under the RSKF50 configuration, the best results are on 2016.Video data, mAP of 0.1987 (aggregating systems ranked 8 to 19). Gradient Boosting performed best under the RSKF50 configuration on 2016.Image data, mAP of 0.2463 (aggregating systems ranked 1 to 20), on 2017.Video data, mAP@10 of 0.0961 (aggregating systems ranked 15 and 16). Under the RSKF75 configuration, the best results are on 2016.Video data, mAP of 0.2209 (aggregating systems ranked 4 to 7). Overall, under the RSKF75 configuration, boosting systems surpassed both the best MediaEval results and state-of-the-art results, while with the RSKF50 configuration, there were better results from the state-ofthe-art. Figure 11 presents the comparison of the best two performing boosting systems with the other approaches.

Proposed MLP Architecture
We introduce a simple, yet efficient, fusion scheme that uses a deep MLP architecture. Our approach is motivated by the property of dense layers to at least weakly discover patterns and correlations between the individual systems decisions. We aim to model the bias learned by each system and the correlations between the biases to perform retrieval robustly and improve the overall performance of the aggregated system.
After experimenting with several architectures, we determined the following configuration: 10 layers, 5 dense layers (relu activation) with a batch normalization layer in-between each of them (totaling 4), inferring the final interestingness score with a single-layer linear perceptron (sigmoid activation). The architecture of the network is depicted in Fig. 12.
In the training phase, the network takes as input the interestingness prediction scores of the systems to be aggregated, to learn complex joint decisions. All trainable weights of the networks are optimized together by applying a stochastic gradient descent using the Adam approach in Kingma and Ba (2015), with the following parameters: lr = 0.001, β 1 = 0.9, β 2 = 0.999, = 1e − 08. The loss function is set to the standard binary cross-entropy. The network was trained for 200 epochs with a batch size of 64. To simulate the benchmark scenario, we optimize the network according to accuracy and test it with the official benchmark metrics, while creating splits on the testset in both RSKF50 and RSKF75 configurations. Given a new set of images/videos, the network treats the input systems as untrained raters to model the common visual interestingness level shared between them.
We analyze the results obtained with the investigated and proposed MLP-based system and compare them with the Fig. 11 Super-system design: baseline is a random ranking where samples are ranked randomly 5 times and mAP averaged, bestME and bestSoA are the best performers from the MediaEval benchmark and from the literature (in particular, are trained on the entire devset), respectively, LF stands for late fusion, boost for boosting, and MLP is the proposed Multi-Layer Perceptron scheme. We indicate the type of dataset split for the presented results: orig indicating the original split and RSKF50 and RSKF75 indicating the two generated splits. Results presented for RSKF50 and RSKF75 are computed as average values over 100 random partitions Fig. 12 Overview of the proposed MLP-based fusion scheme: 1 input layer followed by 4 pairs of dense/batch normalization (BN) layers, 1 dense layer, and 1 single-layer linear perceptron used for predicting the final interestingness score best systems from the MediaEval benchmark and from literature. Results are summarized in Fig. 11. Overall, clearly, the aggregated systems provide better results than the best individual systems. This is more or less expected given the fact that they exploit the advantages of several different systems. However, the proposed MLP-based learning strategy allows for a significant boost in performance. On 2016.Image and 2016.Video data, it improves the best results from a mAP of 0.2485 to 0.3459, and from 0.1815 to 0.2985, respectively. On 2017.Image and 2017.Video data, it improves the best results from a mAP@10 of 0.156 to 0.2646, and from 0.093 to 0.3202.
The improvement is dependent on the amount of training data used for the MLP, the top results being obtained for the RSKF75 configuration. Nevertheless, good improvement is achieved in the RSKF50 configuration as well.

Limitations
To understand the limitations of our approach, we empirically analyzed the results. We discuss here some of the common misclassification cases to understand the limitations of our approach. For certain types of visual samples, the inducers that we use as input into our fusion system display a correlated, positive or negative bias and the late fusion approach is not able to suppress this bias. Figure 13 illustrates some examples, we observe a number of darker interesting images that are incorrectly classified as non-interesting, with their interestingness score often being lower than 0.1. This may be the result of inducer algorithms not having enough visual information to correctly score these particular samples. For the false positive examples, some outdoor non-interesting images (compared to their class representatives), usually containing groups of people, are assigned a high interestingness score, typically greater than 0.5. This may represent an indication that the inducer algorithms tend to pay more attention to visual samples that contain people and therefore present a bias for those particular cases.

Conclusions and Open Questions
The prediction of visual interestingness is a research topic of increasing importance in the multimedia community, with practical applications in advertising, social media, education, media recommendation and many more. In this work, we introduced a publicly available, common evaluation framework for image and video visual interestingness prediction. It consists of a robust data set, with 9831 images and more than 4 h of video, and interestingness scores determined from over 1M pair-wise annotations of 800 trusted annotators.
To account for baseline systems, we provide an in-depth analysis of the crucial components of visual interestingness prediction by reviewing the capabilities and the evolution of 192 validated systems (129 from the MediaEval benchmark and 63 state-of-the-art systems from the literature). We analyze overall capabilities, influence of the employed features and techniques, and generalization capabilities. For the 129 ranked systems of the MediaEval benchmark, computed relative sensitivity, Kendall's rank correlation, and weighted Kendall's rank show good reliability of the results. We also discuss the possibility of going beyond state-of-theart performance via an automatic, ad-hoc system fusion, and propose a deep MLP-based architecture that outperforms the state-of-the-art systems by a large margin.
We summarize below the most important lessons learned and insights gained, as well as identify the remaining open questions and perspectives.

Overall System Performance
Over the analyzed systems, without taking into account our system fusion experiments, the highest precision for image visual interestingness prediction is obtained via a learningto-rank DNN using both deep features and a deep ranking approach, Parekh et al. (2018), mAP of 0.3125. For video, the highest precision is achieved via a late fusion between a learning-to-rank Siamese network and a reinforcement ranking based on a Markov decision process, using both visual and audio descriptors , mAP of 0.2228. Globally speaking, results are not that high compared to other classification and regression tasks, and are similar to the ones achieved in early object classification, e.g., see early TRECVid campaigns. Naturally, video prediction is more challenging than image prediction, as results show. Data, annotations and techniques should still adapt and come with new improvements to address this subjective task. Nevertheless, one should notice an ascending trend, as performance significantly increased over the years, e.g., in 2016 best mAP is 0.2336 for images (Liem 2016) and 0.1815 for videos, (Almeida 2016); in 2017 it reaches 0.3075 for images (Per-madi et al. 2017) and 0.2094 for videos (Ahmed et al. 2017); and in 2018 0.3125 for images (Parekh et al. 2018), and 0.2228 for videos . Therefore, progress is continuously made.

Content Representation and Methods
It is interesting to see the rich diversity of approaches, from data representation to the prediction methods. The most popular content description categories are unimodal representations, accounting for 72% of the analyzed methods, followed by deep features (employed alone or as part of multimodal and fusion systems) with 59%. The most popular methods are SVMs used as classifiers, accounting for 30% of the analyzed methods, followed by DNNs with 28%, and ranking techniques with 13%. The best performing system is, of course, a combination of the two, i.e., description scheme and prediction approach. Some of the image prediction best systems are proposed by Parekh et al. (2018). They use a unimodal approach via AlexNet fc7 layer features and a learning-to-rank DNN, mAP of 0.3125. For video, the best systems are the approaches of Wang et al. (2018). They use, either LBP-based features alone, or early fusion of deep features extracted from InceptionV3, AlexNet and C3D, and traditional visual and audio features, with Siamese networks, achieving a mAP from 0.2131 to 0.2228. We would like to also highlight the best performing SVMs for images, i.e., Permadi et al. (2017), via a polynomial kernel SVM, mAP of 0.3052, and for video (Ahmed et al. 2017), via a linear kernel SVM, mAP of 0.2122. The best performing ranking approaches for images are: Almeida and Savii (2017), via RankBoost, mAP of 0.271, and for video (Almeida and Savii 2017), via rankSVM, mAP of 0.1877. It is worth noting that, depending on the data, state-of-the-art results are not necessarily obtained using deep learning, although is predominant.

Generalization Capabilities
Annotated data is scarce to fill in the requirements of current deep neural networks. Regardless of the efforts of releasing more and more annotated data, it is not a sustainable action in the mid term. Systems have to find alternate solutions for training the algorithms. Unsupervised techniques, although very appealing, are still too incipient for this type of subjective tasks. A viable immediate alternative is to borrow data from adjacent domains and use transfer learning techniques. We noticed an encouraging trend in this direction. 45% of the analyzed systems used at least a pre-trained extraction generalization scheme, i.e., systems are trained on data unrelated to interestingness, like object recognition data sets, and used directly, usually as features in a classifier, to predict interestingness. These systems were the overall stateof-the-art performers. 9% of the systems went further, and use fine-tuning approaches, i.e., systems are firstly trained on data unrelated to interestingness and then retrained on the Interestingness10k data. For image prediction (Erdogan et al. 2016) obtains the best performance via a fine-tuned AlexNet network, with a mAP of 0.2125. For video prediction (Ahmed et al. 2017) obtains the best performance via a video genre classification system, with a mAP of 0.2094. Significantly less, 3% represent correlated approaches, via systems trained on external data, from correlated domains, e.g., memorability, aesthetics, which are used to predict interestingness. For image prediction the best approach uses a social interestingness prediction system trained on Flickr data (Shen et al. 2016), mAP of 0.2336. For video prediction  use SentiBank features, based on emotional content, achieving a mAP of 0.154.

Ad-hoc Fusion
Another important observation is the fact that regardless how good a system is, the fusion of the results from several systems, even with individual average performance, proves to increase the performance. After experimenting with several fusion techniques, like standard late fusion of system scores, boosting techniques that use weak learners and a proposed, deep MLP-based system fusion, we were able to boost performance almost in every situation. The proposed MLP system achieves a maximum improvement of 105% on image prediction over state-of-the-art results, improving mAP from 0.156 to 0.3202, and of 184% on video prediction, improving mAP from 0.093 to 0.2646. The inherent disadvantage is the significantly higher computational complexity of the aggregated system. However, good performance was obtained by fusing few systems, an order of ten, e.g., 30-40. With current hardware acceleration and parallel computing, this is a feasible alternative.

Recommendations to System Performance
During our analysis, some approaches stood out when compared with the others. For instance, when analyzing modalities, deep and traditional visual features show promising results. However, a more obvious outlier is represented by late fusion systems. On average, the performance of such systems was better, both for image and for video data (average mAP over the analyzed systems of 0.2416 and 0.1878, respectively). This observation is enforced by the good performance of hybrid classifiers on video data, that use more than one type of classifier (as presented in Sect. 5.3.2) but also by the top performance of our proposed late fusion MLP system. The intuition is that this may be an effect of the inherent subjective and multi-modal aspects of interestingness. Furthermore, while deep learning-based systems do not necessarily represent the state-of-the-art perform-ers, they do present some interesting results. For instance, when analyzing the average performance of the method categories, deep neural networks achieved the highest average score for image data. Regarding the performance of modern DNN approaches, tested in Sect. 6, while these methods do not outperform the state of the art, some of these networks, such as GSM-InceptionV3-En3 (Sudhakaran et al. 2020) and FixResNeXt-101-32x48d (Touvron et al. 2019), achieve very high scores. Finally, some good training practices are studied in Sects. 5.3.2 and 5.4.1. For instance, when extracting features from a deep semantic embedding model (Vasudevan et al. 2016), achieves better results when finetuning the semantic model with Interestingness10k data, as opposed to directly extracting the embeddings. Other good practices involve using external data from correlated domains like social interestingness and emotional content. This type of data augmentation, paired with data upsampling on Inter-estingness10k images contributed, for example, to the best mAP score on 2016.Image data achieved during the Medi-aEval competition (Shen et al. 2016).

System Performance
Although a great deal of methods were experimented with various feature representations, fusion techniques and transfer learning, top performance is just around a mAP of 0.31% and 0.22%, for image and video prediction, respectively. Current performance on video prediction is significantly lower than for images. This is still incipient and requires significant improvements. At annotations level, a lead is to deepen the understanding of the concept of interestingness and visual information by exploring more related subjective concepts. Psychological user studies revealed many concepts related to interestingness that have great potential in improving its understanding, e.g., novelty, coping potential, complexity, comprehensibility. Interestingness prediction is a multifaceted problem and should be approached from a more interconnected perspective. For a comprehensive analysis of the correlation between interestingness and other concepts, from the psychological, experimental and computer vision points of view, we refer the reader to Constantin et al. (2019). At the methods level, temporal information remains largely unexplored for video prediction. Therefore, a future lead is to augment prediction models using temporal-based models, whether they are based on new DNN architectures or on temporal aggregation of features, for better encoding of video information. Another lead is to explore the attention mechanism in DNN architectures, so as to focus the interestingness prediction on certain regions of the image and video. A small region in the image may raise great interest to the viewer, rather than the whole image itself.

Ground Truth Data
Another open challenge is the generation of meaningful training data. Deep learning models proved again to be state-of-the-art performers, therefore, there is the need of more annotated data. Given the subjectivity of the task, the annotation is not as straightforward as for example, for object annotation. Everybody understands what a chair or a tree looks like, but what is interesting is not the same for everybody. This is clearly visible in the Interestingness10k annotations. Although we used expert annotators, i.e., human assessors that were given thorough guidance on the task and scientific problem, the annotator agreement was average to good, with a kappa value of 0.556 and 0.519, for images and videos, respectively. The annotation mechanism, e.g., pairwise comparisons, user studies, especially for videos, should be more investigated and, again, perhaps explored in correlation with other subjective properties.

Unsupervised Learning
Unsupervised generation of data has currently proved a feasible task for many classification systems. Significant progress has been made via auto-encoders and generative adversarial networks (GAN). However, it was still not explored for the generation of images according to their perception. The closest experiments are for generating human faces with different emotions. This would be a pioneering direction to explore, i.e., training GANs to automatically generate data with different levels of interestingness.