Exploring Deep Fusion Ensembling for Automatic Visual Interestingness Prediction

In the context of the ever growing quantity of multimedia content from social, news and educational platforms, generating meaningful recommendations and ratings now requires a more advanced understanding of their impact on the user, such as their subjective perception. One of the important subjective concepts explored by researchers is visual interestingness. While several deﬁnitions of this concept are given in the current literature, in a broader sense, this property attempts to measure the ability of audio-visual data to capture and keep the viewer’s attention for longer periods of time. While many computer vision and machine learning methods have been tested for predicting media interestingness, overall, due to the heavily subjective nature of interestingness, the precision of the results is relatively low. In this chapter, we investigate several methods that address this problem from a diﬀerent angle. We ﬁrst review the literature on interestingness prediction and present an overview of the traditional fusion mechanisms, such as statistical fusion, weighted approaches, boosting, random forests or randomized trees. Further, we explore the possibility of employing a stronger, novel deep learning-based, system fusion for enhancing the performance. We investigate several types of deep networks for creating the fusion systems, including dense, attention, convolutional and cross-space-fusion networks, while also proposing some input decoration methods that help these networks achieve optimal performance. We present the results, as well as an analysis of the correlation between network structure and overall system performance. Experimental validation is carried out on a publicly available data set and on the systems benchmarked during the 2017 MediaEval Predicting Media Interestingness task.


Introduction
Given the prevalence of multimedia data associated with the current online environment and the immense quantity of data uploaded by both amateur and professional content creators, the need for in-depth understanding of the uploaded data has emerged. Automatic classification and recommendation systems are needed in order to help users navigate online platforms that are able to correctly understand both user preferences and the quality of the multimedia content hosted on the platforms. The research and development communities are currently giving increasing attention to the study of subjective content properties, therefore seeking to understand how visual content affects viewers and tune their algorithms accordingly. This represents a shift in research focus from previous directions, such as understanding the content of images and videos via objective properties such as object detection [1] and scene classification [2].
Visual interestingness represents one of the most popular concepts currently being studied, being defined as the capacity of "holding or catching attention" in the Oxford Dictionary of English [3]. Berlyne's initial studies in psychology [4] show that interest heavily influences human behaviour and motivation, while more recent works that study the interestingness of images [5] show that interest and the willingness to view and study a media sample are positively correlated. Many researchers also point out the importance of other factors in creating and maintaining interest [6,9], like novelty, coping potential, arousal and aesthetic quality. From an emotional perspective, Silvia [6,7] includes interest among the class of emotions that relate to comprehension, exploration and learning. In this context, it is easy to understand why researchers and developers are starting to focus their efforts on the prediction of multimedia interestingness. An interestingness value assigned to each media item can represent the difference between a video being recommended to users if it fits their viewing profile and being forgotten, and the accurate assessment of this subjective concept can generate more user engagement and satisfaction. On the other hand, it would represent an useful tool for content creators, be they online creators, professors selecting their media samples for classes or advertising agencies, as it could select the most appropriate media samples for distribution out of a large collection of images and videos. Finally, it is important to note that in the current literature the notion of "interestingness" is used to describe two different concepts: social interestingness which is usually related to social media concepts like popularity and virality, and visual interestingness which is defined as the capacity of media samples to attract and maintain viewer attention. Previous work in this domain have shown these concepts to be both positively [10] and negatively [11] correlated, therefore the link between the concepts is still an opened research direction. However, throughout the rest of this chapter, we will use "interestingness" as a synonym for visual interestingness.
In this chapter we explore the possibility of employing a set of ensembling methods for interestingness prediction, by implementing deep neural networks as the primary ensembling function. To the best of our knowledge, this type of approach presents a high degree of novelty, as deep neural networks are used as inducers in the current state-of-the-art literature, not as the primary ensemble function. Our approach consists of several architectures that include dense, attention, convolutional and the novel cross-space-fusion layers, as well as two input decoration methods that help analyze correlations between similar inducers. Our methods are tested on the publicly available Interestingness10k dataset [19], validated during the 2017 MediaEval Predicting Media Interestingness task [13]. With regards to media interestingness, [8] represents an in-depth literature review of interestingness and covariate concepts, analyzing these concepts and their correlations from psychological, user-centric and computer vision perspectives, while [19] represents a review of the MediaEval Predicting Media Interestingness task, analyzing the best practices, methods, user annotation statistics and the data itself. From an ensembling perspective, three papers introduce some of the deep neural network architectures that we will deploy in this work: [30,19,31]. The code corresponding to the proposed methods we will present is available online , developed in Python 3 using the Keras 2.2.4 and Tensorflow 1.12 libraries.
The rest of this chapter is organized as follows. Section 2 analyzes the current stateof-the-art, with regards to both interestingness prediction and late fusion systems. In Section 3 we present the methods we propose for media interestingness prediction. Section 4 presents the results and their analysis, pointing out trends and general suggestions with regards to system performance. Finally, Section 5 concludes the paper and discusses future developments.

Previous Work
This section discusses and analyzes the current state-of-the-art with regards to two main topics: the advances in the prediction and classification of media interestingness and the most important late fusion methods currently used in the literature, while also presenting some arguments that advocate the deployment of late fusion schemes for interestingness prediction.

Media Interestingness
From a computer vision perspective, media interestingness prediction, usually referring to prediction in image or video samples, is gaining considerable traction in the community, with a significant increase in the number of papers published on this subject in recent years [19]. However, this is still considered an opened research direction, as methods that improve results are constantly being published. One of the main difficulties in predicting interestingness comes from the subjectivity of interest https://multimediaeval.github.io/ https://github.com/cmihaigabriel/DeepFusionSystem_v2 among human annotators. Consequently, lower annotator agreement and a lesser degree of separation between interesting and non-interesting samples may be expected when designing a media interestingness dataset or computer vision methods that tackle this issue. Several methods of measuring interest in humans have been used. For example, for the Interestingness10k [19] dataset, annotators are shown pairs of images or videos and are asked to select which of the two samples are more interesting for them, and asked to also consider that "the selected video excerpts/key-frames should be suitable in terms of helping a user to make his/her decision about whether he/she is interested in watching a movie" [52].
Early works in interestingness prediction employ several types of traditional visual features. Gygli et al. [20] use novelty, aesthetics and general preference as cues for image interestingness. Novelty is encoded with the help of a Local Outlier Factor approach, aesthetics via a set of descriptors that encode colorfulness, arousal, complexity, contrast and edge distribution, and general preference is computed by analyzing raw RGB (Red-Green-Blue color space) values, SIFT [47] and GIST [48] features and color histograms. For the prediction of video interestingness Jiang et al. [21] use visual, audio and high-level attributes in a Ranking-SVM (Support-Vector Machine) approach. The authors show that the multi-modal fusion of audio and visual features, consisting of color histograms, SIFT, GIST, MFCC [49], Self-Similarities [50], and Spectrogram SIFT [51], obtains the best result, with a prediction accuracy of 71.4%. Similar methods, that calculate different concepts with the help of traditional descriptors are also used by Grabner et al. [22]. The performance of Sentiment features [23] and C3D models [24] are compared by Gygli and Soleimani [10], and, interestingly sentiment features achieve better results, with a Spearman's correlation rank of = 0.53. Another interesting conclusion comes from Fan et al. [25], showing that the fusion of several sources of data improves system performance.
While these studies present interesting approaches, it is difficult to compare them and propose a set of ideas that would increase the chances for a good performance, given their use of different datasets, splits and development conditions. In this context, the MediaEval 2016 and 2017 Predicting Media Interestingness competitions [12,13] address this problem, by creating a common evaluation framework, consisting of a dataset of images and videos with human-annotated interestingness values, common splits and evaluation metrics for the participating teams and open availability for the data. A large number of systems were submitted to the two editions of the benchmarking competition, 60 systems for the image tasks and 69 for the video tasks, but also outside of the competition, in state-of-the-art papers, 17 image processing systems and 46 video processing systems [19]. While there are many diverse approaches, one noteworthy aspect is that the top results for both tasks can be considered rather low, especially when compared with other more traditional and objective tasks such as object detection or scene classification. For example, the best results achieved during the benchmarking competitions with regards to the official metric, Mean Average Precision (MAP), are = 0.3075 in the image prediction task, by Permadi et al. [26], and = 0.2094 in the video prediction task, by Ben-Ahmed et al. [27]. These results are further improved outside of the competi-tion, Parekh et al. [28] obtaining a result of = 0.3125 for the image task and Wang et al. [29] obtaining a = 0.2228. However, a study on the annotation process published by Constantin et al. [19] shows that human annotators also do not achieve near-perfect scores, considering that the best performing annotators never scored above = 0.7. This further enforces the idea that the subjectivity of such a task represents one of its main challenges. While the approaches are diverse and a large number of systems are used for image and video predictions in the context of the MediaEval competition, one of the noticeable trends is that many of the top performing systems use some sort of fusion scheme. In general fusion is defined as "a technology to enable combining information from several sources in order to form a unified picture" [53], therefore it involves combining the power of multiple detection systems in order to create a better final system. For the methods analyzed in this context, fusion is applied at feature level (also called early fusion), at decision level (also called late fusion or ensemble learning) or a combination of the two.

Ensembling Systems
Late fusion, also knows as ensembling systems or decision-level fusion, consist of a set of initial predictors, called inducers, that are trained and tested on the dataset, whose prediction outputs are combined in the final step in order to create a new and improved set of predictions. These systems have a long history and are shown to be particularly useful in scenarios where the perfomance of single-system approaches is not considered satisfactory. While their usefulness is proven even in some traditional tasks, such as video action recognition [32], recently there is a noticeable trend of employing such approaches in subjective tasks, that seek to analyze the human perception of multimedia data. Some examples for this trend would include the prediction of media memorability [33], violence detection in videos [34], emotional content analysis [35], and media interestingness prediction [29].
One important theoretical aspect of ensembling systems is formulated by Wolpert [36], stating that, given an ensemble of inducers, trained in a similar way, it is improbable that the prediction outputs of these inducers are completely uncorrelated. Thus, promoting a high level of diversity in the inducer set may improve the final result of the ensemble. Recently, Liu et al. [37] show that ensemble error may decrease as the inducer error decreases and inducer diversity increases. These aspects and many more are analyzed in depth in several ensembling literature review papers [39,38].
Regarding the ensembling functions, the methods that are used in combining inducer prediction outputs, while there is a high variety among them, deep neural networks still represent a novelty for this domain. To the best of our knowledge, our works in using deep neural networks as the primary ensembling function is one of the first attempts in this direction. So far ensembling functions are dominated by simple statistical methods [46], such as late fusion via weighted arithmetic mean calculation, voting systems, etc. Other more complex approaches employ methods that require an initial learning step, including Boosting approaches such as Ad-aBoost [40], Gradient Boosting [41] or XGBoost [42], Bagging [43] or Random Forests [44]. While these approaches have been successfully implemented in several tasks, our assumption is that, with the introduction of deep neural networks as the main ensembling function, late fusion results will significantly improve. In our work we will use two approaches as comparison baseline for our proposed prediction method, namely statistical methods and boosting.
One example of a statistical approach is the weighted late fusion. Under this scheme, given a set of inducer methods, = [ 1 , 2 , ..., ] that create a set of prediction outputs denoted = [ 1 , 2 , ..., ], the goal of a weighted late fusion approach is to create a set of weights, = [ 1 , 2 , ..., ], that, once applied to the prediction outputs , represent better predictors for the dataset that is being studied. In other words, weighted late fusion creates a new prediction output denoted , that is calculated as follows: The goal of this approach is to minimize the prediction error , so that the new prediction output < , ∈ [1, ]. Several types of strategies can be employed in choosing the values of . The most common strategy involves ordering the vector according to inducer performance, i.e., 1 < 2 < ... < . This would allow systems to assign higher weights for better inducers, thus making sure that the top performing inducers dictate the final result. Working under the assumption that the vector is ordered, some such schemes would be: Boosting approaches represent another important class of ensemble learning techniques. In general, boosting can be defined as an iterative way of adding inducers into a final ensemble system, while updating the weights assigned to each inducer as more inducers are added in the system. While there are major differences between different boosting approaches, such as AdaBoost and Gradient Boosting, the overarching idea is the sequential training of inducer weights, i.e., trying to adjust the learning process so that it can correct preceding errors.
AdaBoost identifies weaknesses in the inducers in each learning step, represented by miss-classified data points, and assigns higher internal weights for those points, under the assumption that this will allow the next classifiers in the ensembling scheme to correct these errors. Therefore, given a set of data points, , ∈ [1, ], initially all the weights for these data points are set to = 1/ . The total error can be calculated for each individual inducer , ∈ [1, ] as : where I is a function that outputs 1 for a true positive or negative prediction and 0 for a false positive or negative one and C represents the new classification rule created by the ensembling scheme. Also, given the factor for each inducer, the system will update the weights accordingly: Thus, considering as the set of the possible prediction classes associated with the prediction task, the new output can be expressed as: Gradient boosting, on the other hand, does not focus on individual data points, but on finding the difference between prediction sets and ground truth data. Therefore, the goal of this method is the minimization of the loss function ( , ), where represents the prediction output of the method, while represents the ground truth values for the given samples. Practically, the goal is to create a new ensembling functionˆ that best approximates the ground truth of the dataset: While going through consecutive calls of the training loop, gradient boosting methods seek to apply gradient descent for optimizing the ensembling result. The final version of the ensembling functionˆ can therefore be expressed as a weighted sum computed over a set of approximation functions ℎ, starting from the initial version 0 for this function:ˆ where represents the number of training steps. The function is then updated, based on its previous values, as follows:

Deep Ensembling
In a general sense, ensembling systems are represented by an algorithm or function F , that, given a set of dataset samples denoted and a series of algorithms denoted , uses the classification or regression outputs of all the algorithms, called inducers, and by combining them can create a new output for each of the samples. Individual elements of the sample set can be represented as , ∈ [1, ], representing a vector = [ 1 , 2 , ..., ], while the series of algorithms can be represented by a set of functions , ∈ [1, ], representing a vector = Therefore, a matrix (see Equation 11) that contains elements , , ∈ [1, ] and ∈ [1, ] can be constructed, containing the prediction outputs of each inducer for each individual sample, where each row represents inducer outputs for a certain sample.
Obtaining the final ensembled prediction output for a single sample consists of using the [ ,1 , ,2 , ..., , ] inducer output vector as inputs for the ensembling function F , thus obtaining the final prediction value . This entire process is presented in Figure 1. While some variants of the ensembling methods can be represented by simple mathematical functions, i.e., calculating the average value of the inducer output vector, other functions can be more complex and can require a preliminary learning stage, such as boosting methods, as shown in Section 2.2. We propose a different perspective in which the ensembling function is represented by deep neural networks that will process inducer prediction output values.
It is also interesting to note that, while in more complex cases, such as multi-label regression, the predictions created by the inducers do not represent a single value, as one output probability is assigned to each of the possible labels, in our case inducers output a single value, representing the degree of interestingness assigned to each image or video sample. Therefore the , values are uni-dimensional.
With this general framework in mind, we will present in the following sections some new perspectives, consisting of several types of deep neural networks that are used as ensembling functions for the task of predicting media interestingness. Our assumption in this case is that DNNs are able to better understand the patterns and biases that individual inducers have towards the samples in the dataset. Our proposed DNN models will only use the inducer outputs in determining the final prediction score, so image and video samples will not be fed into the ensemble algorithm.
We investigate four types of DNN architectures as follows: (i) a dense layer-based approach, that is the augmented with (ii) attention layers, (iii) convolutional layers, and finally, (iv) Cross-Space-Fusion layer (CSF), a novel approach designed for   parsing inducer vectors. While the first two types of network do not need any special data pre-processing, the latter two, namely convolutional and CSF, are designed to process data based on the spatial arrangement of data and understand how adjacent elements in a matrix can be interpreted in order to obtain a prediction. While this is heavily exploited in images and videos by convolutional layers, inducer output vectors have no intrinsic spatial arrangement and correlation, and therefore, some data preprocessing and decoration schemes that create spatial information are necessary for these two final types of neural networks, which we will present along with the implementation of the respective DNN models. One of the main reasons we theorize that such structures are able to create better ensembling systems is the ability of neural networks to accurately use various types of input data and classify this data into output predictions. While not directly attempting to model human behaviour and understanding of visual interestingness, we believe these models are able to model inducer behaviour and understanding, thus being able to learn the positive and negative biases of inducers towards visual samples. Thus, while the approaches presented here are centered around the prediction of visual interestingness, they are domain-independent and are useful in other tasks as well [31].

Dense Networks
Dense networks composed of fully connected (or dense) layers arguably represent one of the most popular DNN implementations. Given the innate ability of dense layers to correctly detect patterns in the input data and accurately classify samples, we theorize that, by using a set of connected dense layers, our proposed method will be able to accurately learn the correlation between inducer biases [14], allowing combinations of inducers to support or dismiss their predictions, based on the patterns the networks learns. Another component of the final network is represented by the addition of batch normalization layers [15], between the individual dense layers, with the role of helping the improving the network's learning process and speeding it up. Several variations of the dense network setup are tested, in order to ensure optimal performance. We present the optimal network architecture search method in Algorithm 1. We therefore change the depth of the network, by testing various numbers of layers in the network (5,10,15,20,25) and the width of the network by changing the number of neurons per layer (25,50,500,1000,2000). The third parameter in this search algorithm is represented by the presence or absence of batch normalization layers. Also, in Algorithm 1, the function has the role of both creating the network according to the three variable parameters and the role of training and testing the created network. A schematic view of the dense network architecture is presented in Figure 2.

Attention Augmented Dense Networks
Though computational attention mechanisms [16] were initially predominantly used in works that dealt with text processing and translation, it was quickly adopted in other domains, including computer vision [17]. In a general sense, attention mechanisms have the role of understanding and detecting the parts in the input space that are most important for the final prediction stage and assigning higher weights for the important parts. While in a general computer vision these mechanisms would infer the most important parts in images or videos, the intuition in our ensembling system is that the attention layer will create a set of weights that will indicate the relevance of each of the values from the inducer output vector [ ,1 , ,2 , ..., , ]. The implementation we choose for our experiments consists of a soft attention layer inserted into the dense architecture presented in Section 3.1, as presented in Figure 3. Using the notation in Equation 12 that represents the network input space for a single sample , and the soft attention vector as , with values between 0 and 1, the system will create an appropriate attention mask , computed as the element wise product of the input vector and the attention vector, as shown in Equation 13. The learning process for the attention mechanism is based on a supervised back-propagation approach: ...

Convolutional Augmented Dense Networks
Convolutional networks represented a big step forward for deep learning in the field of computer vision, aided by the advancement of hardware processing power and software libraries that allow such networks to be easily deployed and lower the processing time, starting with AlexNet's performance at the ILSVRC 2012 benchmarking competition [18]. While the shape of the input space is not important, as one, two or three dimensional convolutional networks have been implemented, they all rely on detecting and learning local correlations between adjacent elements in the input space. More to the point, convolutions can be represented by a set of filters of pre-determined shape that cover and process the entire input space. While this approach performs well for images and videos, that intrinsically have a spatial arrangement and correlation in the input space, in our particular case the order of the inducer prediction outputs in the vectors does not have any intrinsic spatial correlation, and, furthermore, at this stage no relationships between individual inducers are calculated. Therefore, we must create these correlations and relationships, via a process we call input decoration. Our assumption in this case is that, by creating the decorated input vector for convolutional processing for each sample and applying convolutional filters to this new input, we would be able create a system where similar inducers can be arranged in close spatial proximity and can support or revoke their prediction decisions based on their spatial relations. Two problems must therefore be solved in order to introduce convolutions into the ensemble networks: (i) find a criterion for detecting similarity between inducers, and (ii) create a spatial arrangement based on the similarity.
For the first problem, similarity between individual inducers can be calculated with the help of the official metric used for measuring system performance in the task. While in the case of interestingness mean average precision at 10 elements is used (mAP@10), in a generalized approach the metric can be expressed as a function M, that takes two vectors as input (either ground truth data and prediction data or two prediction vectors from two separate systems), and outputs a value of similarity between them, denoted . In other words, given a general form for a prediction vector = [ 1, , 2, , ..., , ], that represents the prediction vector created by inducer for all the samples in the dataset, the similarity value between two inducers and can be calculated as presented in Equation 14. Finally, by ordering the vector of similarity scores between an inducer and all other inducers, we can create a list of the most similar inducers for each of the inducers.
The second problem involves using the similarity values calculated at the previous step, and decorating the predictions for each sample based on the values. The decorated input vector for a sample is presented in Equation 15, and is composed of centroids built around the initial inducer prediction output values, denoted 1 , 2 , ..., . The elements in each centroid, are as follows: (i) the central element, , represents the initial value, (ii) the similarity scores for the first four most similar inducers, denoted 1, , ... 4  The decorated array will represent the new input for the convolutional ensembling system, as presented in Figure 4. Finally, the array in processed by the convolutional layers, centroid by centroid. Equation 16 shows this process for a single centroid , where the centroid is element-wise multiplied with the weights in the convolutional filter. The final step involves, in our case, an average pooling layer that will output a single element for the convolutional step that represents the average value of the element-wise multiplication result matrix. In a simple case where only one convolutional filter is employed, the input to the dense layers will practically be similar as the initial input, where each inducer output value is basically replaced by the result of the convolution process for the inducer's centroid. Finally, several setups will be tested for the convolutional architecture, that include different number of convolutional filters: 1, 5 or 10 filters. This would allow the network to assess more than one type of correlation between the inducers.

Cross-Space-Fusion Augmented Dense Networks
With the introduction of convolutional layers in the network a method that can process the similarities between inducers has been created. However, convolutional networks are created with image processing as their main objective and use the same filters for processing the entire image and therefore would, in the case of ensembling systems, share the same weights between different centroids. While this does represent a step forward in processing inducer correlation, our assumption is that correlation between inducers are different for each individual inducer, and therefore weights should not be shared between centroids. Given this assumption, we propose the creation of a novel type of DNN layer, which we name "Cross-Space-Fusion", or CSF layer. The implementation of the CSF layer is based on creating a new input decoration method and the creation of the layer itself. A few architectural decisions must be taken in order to fully exploit the correlation data we generate and overcome the possible limitations of convolutional processing. First of all, as shown in Equation 16, inducer outputs and similarity scores are not processed together, each one of them being multiplied separately with its corespondent convolutional weight. This may break the correlation between the two elements and make it harder to process and learn in the neural network. Secondly, the same possible issue would appear no matter what type of convolutional layer we would use, as three-dimensional convolutional layers do not process correlations inter-dimensionally. Therefore, we propose a novel input decoration method, that would create an additional, third dimension, that would separately memorize similar inducer outputs and similarity scores. Also the CSF layer would need to process these details across the third dimension of the array, processing inducer outputs and corespondent similarity scores together, while using the same M presented in Equation 14 function for calculating similarity scores. Finally, as previously mentioned, we must take into account that regular convolutional filters may not be the optimal for learning correlations, as they may be different from centroid to centroid. Thus a larger number of parameters must be designed into the CSF layer and, while this may represent a strain on the neural network, the number of added parameters is still small, especially when compared with the depth and width of the dense architecture.
Given the particularities of this approach, Equation 17 presents the new version of the decorated input, where represents the matrix of prediction outputs from the 8 most similar inducers for an inducer , while represents their respective similarity score, calculated with the help of the M function. These two matrices create the third dimension of the decorated input, as shown in Figure 5. Similar to the convolutional approach, in this example, the 1, and 1, pair represents the prediction output and similarity score of the most similar system with inducer , 2, and 2, the second most similar, and so on. While it is obvious that by using this decoration scheme more similar inducers can be added to the system that in a similar convolutional approach, the question of their utility for this task still remains and will be analyzed, as it may be possible that the new data inserted into the system is noisy or little real correlation exists between the systems.  Algorithm 3 presents this input decoration algorithm. It is worth to note that, in the case of the CSF approach, the shape of the decorated input array changes once more, from (3 × , 3) in the convolutional approach to (3 × , 3, 2), doubling in size. After the decoration step, the input is fed into the CSF layer. For each ( , ) group of centroids, the network must create and learn a set of weights that can combine the initial inducer prediction with the prediction outputs and similarity scores grouped in the centroids. Thus, the CSF layer contains a set of and parameters that must be learned. Equation 18 describes the operations that are performed by the CSF layer, where are used for controlling the prediction output of each inducer and parameters are used for controlling the prediction outputs and similarity scores for the inducers similar to .        1, · + 1, · 1, · 1, 2 2, · + 2, · 2, · 2, 2 3, · + 3, · 3, · 3, 2 8, · + 8, · 8, · 8, 2 4, · + 4, · 4, · 4, 2 7, · + 7, · 7, · 7, 2 6, · + 6, · 6, · 6, 2 5, · + 5, · 7, · 5, 2 Figure 6 presents an outline of this approach. As presented, the final step in the CSF augmentation part of the method is represented by the addition of an average pooling layer, thus obtaining an input of equal dimensions as the initial one for the dense architecture. Also, given the number of inducers , the final number of parameters in the CSF layer is 16 × , with 8 × and parameters. As previously mentioned, we must also take into account the possibility that the addition of so many similar inducers in the centroid could add noise to the input and damage the final result. Thus, we decide to test two different setups for the CSF architecture: 4 , where we only populate the ( , ) centroid pairs with the top-4 most similar inducers, and 8 , where the centroid pairs are completely populated with 8 inducers. It is important to note that, while our experiments may show a preference for one of

Experimental Setup
This section will present the main components of the experiment and how these components interact. We will describe the training protocol employed for the experiments, the dataset and the evaluation protocol used for obtaining the results.

Training Protocol
The common component in all the methods presented in Section 3 is represented by the dense architecture deep neural network. Our experiments will therefore start with finding an optimal dense architecture with regards to the depth and width of the network and the positive or negative influence of batch normalization layers, using the values presented in Section 3.1. This is done by collecting the prediction outputs of the entire set of inducers and feeding them into the different variations of the dense architecture networks. This step is described with Algorithm 1. In the following steps, the optimal dense network is augmented with attention, convolutional and CSF layers. As special implementations of the convolutional and CSF layers, the input, consisting of the prediction outputs, is decorated, according to Algorithm 2 for the convolutional approach and Algorithm 3 for the CSF approach. The training process is performed for 50 epochs, for each variation of the network, using a batch size of 64 samples, mean squared error loss function and an Adam [45] optimizer featuring a learning rate of 0.01. We are interested in pointing out the optimal dense architecture, given the set of search parameters, as well as the effect of augmenting the dense network with the three types of layers: attention, convolutional and CSF.

Dataset
For our experiments we are using the latest version of the Interestingness10k [19] dataset, validated and used during the MediaEval 2017 Predicting Media Interestingness task [13]. The dataset is composed of 9,831 images and videos, split between 7,396 samples included in the development set (devset) and 2,435 samples in the testing set (testset). Participants to the benchmarking competition were tasked with developing and training their media interestingness prediction methods on the devset, running the systems on the testset samples and submitting their testset predictions to the task organizers for performance calculation.
Given the high number of systems submitted at the benchmarking competition, i.e., 33 for the image task and 42 for the video task, and the considerable amount of research and work that went into creating them, we consider these systems as ideal candidates for being used as inducers in our proposed method. With the help and collaboration of the task organizers, we gathered participant submission files and used them as input into our systems. However, given the fact that participants only submitted predictions for the testset samples and the inherent problems in recreating such a large number of diverse systems, we are bound to only use those predictions and create a new evaluation protocol that will be used in training our systems, based only on the samples that are featured in the testset.
We therefore have to create a new set of data splits, and choose to use two protocols for this: (i) RSKF75, featuring a random stratified k-fold that uses 75% of the samples for training and 25% for testing, and (ii) RSKF50, generating 50% training samples and 50% testing. It is important to note that, in order to avoid any "lucky" data splits that would create an unfair advantage for our approach, the split samples are randomized, and experiments are repeated with different random splits, generating 100 partitions for each network architecture variation. Therefore, the results we present in Section 6 are average values calculated over the 100 partitions. System performance is calculated by using the official metric of the MediaEval benchmarking competition, i.e., MAP@10.

Experimental Results
This section presents the experimental results, featuring a comparison with a set of baseline systems, a set of baseline ensembling approaches and identifying the best performing architectures.

Baseline Systems
In order to correctly position and analyze the results of the proposed methods, we compare them with a few methods from the literature, including (i) the best performers at the MediaEval competition, (ii) the best overall performers on the Interestingness10k dataset, and (iii) a set of traditional ensembling methods.
The best performers from the MediaEval competition also represent inducers for our systems, and an important target for the proposed systems. For the image prediction task we have the system developed by Permadi et al. [26], with a MAP@10 performance of 0.1385, while for video prediction we have the system developed by Ben-Ahmed et al. [27], with a MAP@10 performance of 0.0827. The overall performers consist of methods that are published outside the MediaEval venue, but used the same benchmarking protocol and metrics. For the image task we have the work of Parekh et al. [28], with a perfomance of @10 = 0.156, while for the video task, Wang et al. [29] achieve a @10 = 0.093. The final set of baseline systems consists of a set of traditional ensembling methods, that we created using the same protocol and set of inducers as used by our proposed methods. Several types of ensembling methods are tested, starting with simple strategies [46] like taking the maximum value of inducer prediction outputs (LFMax), average and mean values (LFAvg and LFMean), and weighted average (LFWeight), but also more complex approaches that involve learning steps, like AdaBoost [40] (BAda) and Gradient Boosting [41] (BGrad).

Results
The results are presented in Table 1. At a first glance, it is important to note that the proposed systems surpasses every baseline system, including the best performing baseline ensembling system, which for both images and videos is the AdaBoost approach. Furthermore, the best performing variant of the proposed systems increased performance by a large margin. Taking into account the RSKF75 split, the increase is as follows: for the image subtask an increase of 148.08% over the best MediaEval system, 73.09% over the best overall system and 105.25% over the best traditional ensembling system, while for the video subtask these values are 241.59%, 203.76% and 150.22% respectively. Table 1 Results on the two interestingness prediction tasks: image and video. Systems are divided into baseline best performers from MediaEval and from the literature (b), best baseline ensembling performance (e) and proposed systems (p), and according to the split the systems employ (original or RSKF50 and RSKF75). The best results with regards to the official metric (MAP@10) are presented in bold. With regards to the overall best performing proposed method, results vary, as the convolutional approach has the best results on the image task using the RSKF75 split (MAP@10 = 0.3436) and in the video task using the RSKF50 split (MAP@10 = 0.1692), while the CSF approach has the best results using the other two variants, obtaining a MAP@10 value of 0.2403 for image prediction under the RSKF50 setup and 0.2825 for video prediction under the RSKF75 setup. It is also important to note the architecture variations that led to these results, i.e., the optimal dense, convolutional, and CSF architecture setups. For image prediction, the optimal dense architecture uses 10 layers with 1,000 neurons per layer, and no batch normalization, achieving MAP@10 values of 0.2316 for RSKF50 and 0.3355 for RSKF75, while the best performing convolutional architecture uses 5 filters. Also, the best performing CSF setup in this case is 4 . For video prediction, the optimal dense setup is composed of 25 layers with 2,000 neurons per layer and features batch normalization, achieving MAP@10 values of 0.1563 for RSKF50 and 0.2677 for RSKF75. With regards to the convolutional architecture, the best setup again features 5 convolutional filters, while 4 again represents the best setup for the CSF layer. While the dense network performance is very good, the augmentation process with attention and especially convolutional and CSF layers further improves the results.

Image
One final observation with regards to network setup is presented in Table 2. During our experiments, we observed that there are certain points when the network stops learning and achieves saturation. While Table 2 presents a particular setup, for the video task with batch normalization layers and RSKF75 split, the same behaviour is observable regardless of the task, of the presence of batch normalization layers or of the split. In the example presented, increasing the number of neurons past 1,000 while keeping the number of layers constant at 5 only decreases the performance, while the same is true when increasing the number of layers past 20 when maintaining a constant number of 25 neurons per layer. Most importantly, this seems to indicate that the optimal network setup is not outside the set of values we tested in our experiments. Another important point to make here is that the proposed method have a high performance even when looking at the values for the most basic setup (5 layers, 25 neurons per layer), scoring a MAP@10 value of 0.2414, 9.82% lower than the best performing dense architecture, but still significantly better than all the selected baseline methods.

Conclusions
This work presents the creation and deployment of a series of deep neural network based ensemble systems, used in the prediction of image and video interestingness. The latest Interestingness10k dataset is used in our experiments, a dataset that was previously used and validated during the MediaEval 2017 Predicting Media Interestingness task. Though a large number of systems use this dataset, both during the MediaEval benchmarking competition and outside it, in different journals and conferences, system performance is generally low when compared with other tasks, i.e., a maximum MAP@10 performance of 0.1985 for image interestingness prediction and 0.093 for video prediction.
While very high, near-perfect performance is not necessarily expected for such tasks, where annotator subjectivity plays an important role, we theorize that the implementation of ensemble systems can increase overall performance. Furthermore, the exploration of deep neural networks as ensembling functions presents a high degree of novelty in the current literature, as current literature shows that they are only employed as inducers and not as ensembling functions. Different network setups are presented and tested, including architectures based on dense, attention, convolutional and CSF layers, presenting the theoretical background of implementing these architectures as ensemble functions and the introduction of input decoration algorithms that allow inducer prediction output data to be used and inducer correlations learned with the help of these architectures.
Experimental results show a significant increase in performance over state-ofthe-art systems. Our proposed methods show a 148.08% increase in performance in the image prediction task over the best MediaEval system and 73.09% over the best state-of-the-art system, while in the video prediction task the increase is even higher: 241.59% and 203.76%. Furthermore, the proposed ensemble methods are compared with some traditional ensembling methods implemented under the same conditions, having a significantly better performance, i.e., 105.25% for the image task and 150.22% for the video task. While it is certainly possible that better results could be achieved with other network setups, featuring different number of layers or neurons, or different architectures, we believe the advantages of deep fusion systems to be thoroughly demonstrated. Given the results, it is still unclear which of the two inducer correlation based architectures (convolutional or CSF) perform better for this task, with top results being split between them. However it is important to note that inducer correlation processing did indeed improve the results of both the dense networks and the attention-based networks, thus indicating the validity of inducer correlation calculation, input decoration and correlation processing.
Finally, another important point, not only for our proposed methods, but for ensembling systems in general, is the analysis of the deployability of the proposed systems. While using a late fusion approach can be cost intensive, considering that inducers must be trained, tested and run individually, and a final ensembling step performed before the final prediction is provided, there are cases where developing a late fusion system can become a necessity. Critical infrastructure applications, where very accurate prediction results are a constant need, represent a good example, but, closer to the domain of interestingness prediction, applications where single-method approaches do not perform well, due to inherent multi-modality or complexity of the concept that is being predicted, represent another good example. While deploying an ensembling method may prove to be more costly, it may also be one of the only methods that achieves market-level performance, allowing the introduction of new features that can greatly increase user satisfaction. In this case we also consider the possibility that lowering the number of inducers may not affect system performance to a high degree, therefore trading an insignificant amount of performance for higher execution speed and lower hardware demands. While the creation of an inducer selection method is still an open question for our approach, we propose that future developments could address this problem by analyzing inducer correlations or by testing performance in a recursive leave-one-out scenario.