Subclass deep neural networks: re-enabling neglected classes in deep network training for multimedia classiﬁcation

. During minibatch gradient-based optimization, the contribution of observations to the updating of the deep neural network’s (DNN’s) weights for enhancing the discrimination of certain classes can be small, despite the fact that these classes may still have a large generalization error. This happens, for instance, due to overﬁtting, i.e. to classes whose error in the training set is negligible, or simply when the contributions of the misclassiﬁed observations to the updating of the weights associated with these classes cancel out. To alleviate this problem, a new criterion for identifying the so-called “neglected” classes during the training of DNNs, i.e. the classes which stop to optimize early in the training procedure, is proposed. Moreover, based on this criterion a novel cost function is proposed, that extends the cross-entropy loss using subclass partitions for boosting the generalization performance of the neglected classes. In this way, the network is guided to emphasize the extraction of features that are discriminant for the classes that are prone to being neglected during the optimization procedure. The proposed framework can be easily applied to improve the performance of various DNN architectures. Experiments on several publicly available benchmarks including, the large-scale YouTube-8M (YT8M) video dataset, show the eﬃcacy of the proposed method 1 .


Introduction and related work
Deep neural networks (DNNs) have shown a breakthrough performance in many machine learning problems and are currently witnessing a significant commercial deployment in several application domains such as multimedia understanding, self-driving cars, IoT and other. The state-of-the-art DNNs for classification tasks consist of a series of weight layers, nonlinear activation functions and downsampling operators and on top of them an output layer typically equipped with a sigmoid or softmax activation function modeling c categorical probability distributions [16,21]. An important aspect on the design of a DNN is the choice of the cost function and the optimization algorithm. The cross-entropy (CE) loss and the stochastic gradient descent (SGD) combined with the back-propagation (BP) algorithm for updating the DNN parameters are almost always the sole choice in practice [12]. The great success of those DNNs is based on their extraordinary ability to extract nonlinear features at different layers guided by the SGD-BP algorithm in order to transform a set of c (possibly) nonlinear classification tasks in the input space of the DNN to c linear ones in the input space of the output layer. More specifically, for the ith output node a gradient update to the correct direction is generated, whose length is proportional to the training error of the ith class, guiding the overall network to extract the desired features and producing a linearly separable subspace for the ith classification task. In [19], it is shown that the application of the CE loss with gradient descent on separable data convergences to the max-margin solution with a logarithmic convergence rate. Moreover, it is shown that the above analysis is also valid in deep networks if after a certain number of iterations the weight vectors of the last weight layer are assumed fixed and the class distributions at its output are considered linearly separable (or piecewise linearly separable). However, as we show in this paper not all weight vectors in the last layers yield a linearly separable problem simultaneously and thus not all class separating hyperplanes converge to the max-margin solution with the same rate. Instead, there is an antagonism, where the extraction of discriminant features for certain classes is emphasized during the optimization of the DNN, while other classes are partially neglected yielding a "less" linearly separable problem in the input space of the output layer for these classes, and thus a separating margin that is suboptimal.
The limitation of DNNs to treat all classes fairly during the training procedure has been mostly studied in the context of class imbalanced learning [15]. Moreover, the identification of classes receiving little attention during training as described above is a relatively unexplored topic. To this end, a new criterion for identifying such neglected classes is proposed. This criterion computes the contribution of positive and negative observations in the gradient update of the weight vectors in the output layer and combines the computed quantities to form a stable measure for the likelihood that the underlying class is going to be neglected. Moreover, in order to turn the attention of the DNN on the identified neglected classes, we resort to a subclass partitioning strategy. Subclass-based classification techniques have been successfully used in the shallow learning paradigm. In [10], learning vector quantization (LVQ) is used to find a set of cluster centers for each class and classification is performed by finding the closest class center. In [7], mixture discriminant analysis (MDA) fits a Gaussian mixture density to each class, extending the linear discriminant analysis (LDA) to the non-normal setting. In [5], nonlinear classification problems are solved by splitting the original set of classes to subclasses and embedding the binary problems in a problem-dependent subclass error-correcting output codes (SECOC) design. In [20,6], a set of kernel subclass discriminant analysis techniques are proposed in order to deal with nonlinearly separable subclasses, and it is shown that the identification of the optimum kernel parameters can be performed more easily exploiting the subclass partitions.
Motivated by the above works, a subclass DNN (SDNN) framework is proposed, where the neglected classes are augmented and partitioned to subclasses, and subsequently a novel subclass CE (SCE) loss, which emphasizes the separation of subclasses belonging to different classes, is applied to train the network. In this way, the network is trained to derive a piecewise linear subspace for the neglected classes, imposing a less strict requirement for the extraction of nonlinear features for these classes. Thus, the DNN is trained more effectively with respect to the neglected classes, increasing its overall generalization performance. The novel SDNN framework is compared with state-of-the-art approaches in 3 popular benchmarks (CIFAR10, CIFAR100 [11] and SVHN [14]) and in the large-scale YT8M video dataset [1] for the task of multiclass and multilabel classification, respectively. The results show that in most cases the proposed SDNNs obtain significant performance improvements.
The rest of the paper is structured as follows: Section 2 presents the proposed method and Section 3 describes the experimental evaluation. Conclusions are drawn in Section 4.

Identification of neglected classes
Suppose a DNN with a sigmoid output layer (SG) Under this framework, the weight vector associated with the ith class is updated at each iteration as below where, g i = 1 n n κ=1 ζ i,κ x κ is the gradient of L with respect to w i , η is the learning rate and ζ i,κ = q i,κ − y i,κ . Noting that q i,κ ∈ [0, 1] we observe that ζ i,κ ∈ [−1, 1], with ζ i,κ ≈ 0 when the right answer for x κ 's label is provided by SG layer's unit i, and ζ i,κ moving towards |1| as the likelihood of unit i to provide a wrong answer increases These properties of ζ i,κ can aid the correct operation of the gradient-based learning approach, i.e., shrinking the gradient in (4) when the right answer is obtained, and providing a strong gradient otherwise, forcing the overall network to act quickly in order to correct the mislabeled observations. However, this is not always the case. For instance, considering that the contribution to the summand in (4) of different observations may cancel out, the gradient may shrink despite the fact that many observations are misclassified. To see this, we rewrite the gradient as whereδ i ,δ i equal zero when the positive and negative observations, respectively, are classified correctly. Note that −ζ i,κ , x κ ∈ ω i and ζ i,κ , x κ / ∈ ω i are less than one and always positive, and thusδ i ,δ i are the weighted means of the target and non-target class, respectively, weighted with the likelihood derived from the DNN that this observation belongs to the respective category or not. Whenδ i ,δ i are close to each other, the overall gradient δ i approaches zero and w i remains relatively unchanged, despite the fact that many observations are still not classified correctly by unit i. When this undesired effect appears, the network gradually stops to optimize the weights of the different layers below for extracting discriminant features associated with such "neglected" classes, paying more attention on improving the training classification rates of classes which still produce a strong gradient at each iteration. A unit i with large δ i , δ i and at the same time small difference between these two quantities reflects a high likelihood that the associated class is not getting the required attention and is going to be neglected in subsequent iterations. Based on the analysis above, every τ minibatch iterations we compute the following measure for estimating how likely a class is to be neglected whereδ i,l ,δ i,l are the gradient terms (7), (8) at the lth minibatch iteration, is the vector norm operator and p is the current iteration. The identification of the most neglected class ı is then performed by using a simple argmax rule

SDNNs
The major consequence of neglecting a class during the optimization procedure is that the trained DNN will fail to learn an appropriate feature mapping where the neglected classes are linearly separable. To alleviate this unwanted behavior we propose the use a clustering algorithm to derive a subclass partition for those classes that are prone to be neglected. By exploiting this partition it is expected that it will be generally easier for the DNN to learn a nonlinear mapping where the subclasses are linearly separable. Under this framework, the easiest way to extend the CE criterion would be to treat each subclass as a class. However, this loss will treat equivalently the costs associated with misclassifying an observation to the non-target subclasses without examining which non-target subclasses are associated with the target class of the observation and which not. To this end, we propose the following loss in order to favor the separability of those subclasses that correspond to different classes where, y i,j,κ is the label of the κth training observation in the batch associated with jth subclass of class i, i.e., y i,j,κ equals one if x κ ∈ ω i,j and zero otherwise, and h i,j,κ , q i,j,κ are the input and output to the activation function of the (i, j) unit associated with x κ . Note, that in the second summand of (11) the class label y i,κ is utilized instead of the subclass label y i,j,κ in order to emphasize the separation of subclasses belonging to different classes, as explained above.

Subclass partitioning and augmentation
Any clustering algorithm and augmentation approach can be applied to derive a subclass division of the neglected classes. However, for large-scale datasets such as the YT8M [1], it may be infeasible to use computationally demanding clustering approaches such as k-means. To this end, the lightweight approach described in Algorithm 1 for partitioning the observations of the ith class into two subclasses is proposed. It is based on the computation of the distance of each class observation to m, which is the mean along all observations in the training set and used as a representation of the rest-of-world class. Moreover, data augmentation can be performed to the neglected classes by applying extrapolation in the feature space for each observation as proposed in [3] x where, λ ∈ [0, 1] andx i,1 ,x i,2 are the observations of class i with the largest and smallest distance from m, respectively. Using the approach described in this section, both class partitioning and augmentation can be performed very efficiently on-line without the need to load the whole dataset or large parts of it in memory.

Validation of the neglection criterion
In order to verify the validity of the proposed criterion we train and evaluate a VGG16 network for 420 epochs in the CIFAR10 dataset and record the testing CCR i , the neglection measure θ i , and the gradient vectors δ i ,δ i andδ i for each epoch and class i, i = 0, . . . , 9. The exact details of the network architecture and the training procedure are provided in Section 3.2. The recorded values for θ i and CCR i are shown in Figure 1, while the length of the three gradients plotted between the epochs 100 and 200 are depicted in Figure 2. We observe the following: i) There is a clear correlation between the generalization error rate and the neglection criterion. More specifically, as shown in Figure 1 the neglection values can be used to rank the classes in terms of their expected generalization performance. ii) From the CCR rates, the classes can be roughly categorized into two groups, i.e., one group with classes 3 and 5 that attain a rather low CCR and another group with the rest of the classes having better CCR rates. Looking at the 0 to 30 epoch temporal segment we observe that the classes of the first group clearly exhibit a smaller rate of CCR increase, while the majority of the ones in the second group almost attain their steady-state condition during this period. Moreover, after the 10th epoch a CCR gap between the first and second group of more than 10% in absolute values is observed, which stabilizes after the 230th epoch. Exactly the same conclusions can be drawn from the evolution of the θ i values, where in this case a gap of 1 unit between the two groups is observed after the 30th epoch. iii) The norm of the gradient update δ i alone, or its contributing parts δ i , δ i , exhibit high fluctuations and a rather noisy behavior, and their direct observation does not provide any valuable information concerning the generalization performance of the classes during the training procedure.
From the above analysis we can see that a group of classes is neglected during the optimization procedure and that the proposed criterion can be used to identify these classes, verifying the theoretical analysis in Section 2.1.

Multiclass classification using SDNNs
Datasets For the experimental evaluation of the proposed approach in the problem of multiclass classification the following 3 datasets are used: i) The CIFAR-10 and CIFAR-100 datasets [11] consist of 60000 32 × 32 color images each, drawn from 10 and 100 classes, respectively. Both datasets are divided to a training and test partition with 50000 and 10000 images respectively. ii) The street view house numbers (SVHN) dataset [14] contains 630420 color images of 32 × 32 pixel resolution, similar to the CIFAR datasets. They depict house numbers extracted from Google Street View images, i.e., each image belongs to one of ten classes. The dataset is split to a training, testing and an extra partition of 73257, 26032 and 531131 images, respectively. Following the standard procedure for this dataset, the training and extra partitions are combined in our experiments to form a new training partition.
Experimental setup Two modern DNN architectures are used for the evaluation of the proposed approach, namely, the VGG16 [16] with batch normalization after every convolutional layer, and two variants of the wide residual networks (WRN) [21] depending on the dataset. Specifically, a WRN with depth 28, widening factor 10 (WRN-28-10) and dropout rate of 0.4 is used for the CIFAR datasets, and the WRN-16-8 with 0.3 dropout rate is employed for the SVHN dataset. The reason that these two WRN architectures are employed is because they have exhibited state-of-the-art performance in the above datasets [4]. All networks are trained for 200 epochs using the CE loss (3), minibatch SGD with Nesterov momentum of 0.9, batch size of 128, weight decay of 0.0005, and an exponential learning rate schedule set to decrease at the 60th, 120th and 160th epoch. For the CIFAR datasets, the initial learning rate is set to 0.1 and reduced by a factor of 0.1 according to the learning rate schedule above, while for the SVHN dataset an initial learning rate and reduction factor of 0.01 and 0.2 are used, respectively. The images are normalized per-channel to zero mean and unit variance, and data augmentation is performed during training following [4], i.e., 4 pixels zero-padding and random cropping, horizontal mirroring with 50% probability, and cutout 16 × 16 and 8 × 8 for the CIFAR-10, CIFAR-100 datasets, respectively. The SVHN undergoes the same normalization, however, only 20 × 20 cutout is used to augment this dataset.
The subclass VGG16 (SVGG16) and WRN (SWRN) are created as explained in the following. The original VGG16 and WRN are executed for 30 epochs in order to compute a reliable neglection score θ i for each class. In this way, 2 classes from the CIFAR10 and SVHN (20% of the total classes) and 10 classes from the CIFAR100 (10% of the total classes) with the highest θ i 's are selected, i.e., the classes with labels 3, 5 from CIFAR10, 1, 3 from SVHN and 0, 11, 18, 35, 53, 55, 62, 69, 72, 88 from CIFAR100. In order to alleviate any class imbalance problems resulting from the partitioning to subclasses, the selected classes are first doubled in size using the augmentation method described in [9], and then the k-means algorithm is applied to create two new subclasses from each class. The augmented datasets are then used to train SVGG and SWRN using the SCE loss (11) and the training procedure described above for the conventional networks. Learning is performed using the training partition of the datasets and the performance of each method is measured using the correct classification rate (CCR) along all classes achieved by the trained network in the test set.
All networks are implemented in PyTorch, extending the code provided in [4,21], and the experimental evaluation is performed in an Intel i7 3770K@3.5Ghz PC with 32 GB RAM, Windows 10, and Nvidia GeForce GPU (GTX 1080 Ti).

Results
The evaluation results in terms of CCR and training times in hours are shown in Table 1. The testing times are only a few seconds in all cases (spanning the range of 5 secs for VGG16 in CIFAR10 to 12 secs for SWRN in SVHN). From the obtained results we can see that the proposed SVGG16 and SWRN outperform the conventional networks in all datasets, with differences in performance from 0.21% (SWRN over WRN in SVHN) to ≈ 2.5% (SVGG16 over VGG16 in CIFAR100). Considering that the CCR rates obtained with the WRN combined with cutout regularization [4] are currently among the state-ofthe-art performances, even the small improvements obtained with the proposed approach are considered significant. Moreover, we observe that the training time overhead caused by the application of the subclass approach is negligible for the medium size CIFAR datasets, and relatively small for the much larger SVHN dataset.

Multilabel classification using SDNNs
Dataset The YT8M [1] is utilized to evaluate the proposed approach for the task of multilabel classification. This is the largest publicly available multilabel video dataset consisting of 6134598 videos annotated with one or more labels from 3862 classes (3.4 labels per video on average). For facilitating the comparison of different classification techniques the dataset is already divided to a training, evaluation and testing partition, consisting of 3888919, 1112356 and 1133323 videos, respectively. Visual and audio feature vectors in R 1024 and R 128 , respectively, are already provided at video-level as well as at frame-level granularity. The data is stored in Tensorflow's tfrecord file format (3844 shards for each data partition and granularity level), which offers very efficient import and preprocessing functionalities for large-scale datasets.
Experimental setup For the evaluation, a rather simple convolutional neural network (CNN) is utilized with a convolutional, a max-pooling, a dropout and a SG layer of c outputs. The convolutional layer consists of 64 one-dimensional (1D) filters and is equipped with a rectification (ReLU) nonlinearity. Each filter has a receptive field of size 3 and stride 1, and zero padding is applied in order to preserve the spatial size of the input signals. The max-pooling layer employs a filter of size 2 and stride 2, while a keep-rate of 0.7 is used for the dropout layer. The CE loss (3) combined with the minibatch SGD-BP algorithm and weight decay of 0.0005 is used for training the CNN. The training is performed over 5 epochs with an exponential learning rate schedule, initial learning rate of 0.001, learning rate decay 0.95 in every epoch, and batch size of 512. For the construction of the subclass CNN (SCNN), the CNN above is initially applied in the training set for 1 3 of an epoch in order to obtain a neglection value θ i (9) for each YT8M class and the 386 classes with the highest θ i are selected, i.e., 10% of the total number of classes. The selected classes are then partitioned to H i = 2 subclasses using the efficient on-line algorithm described in Algorithm 1, avoiding the loading of the whole dataset or large parts of it in memory, which would be infeasible for the YT8M dataset. Moreover, data augmentation is performed to the neglected classes using the extrapolation technique described in Section 2.3, setting λ = 0.5. In this way the number of observations in each subclass partition is doubled. The resulting SCNN is trained using the proposed SCE loss (11) and the training procedure described for the conventional CNN. For completeness, a standard logistic regression (LR) classifier is also evaluated using the same training procedure with initial learning rate of 0.001.
We performed experiments with the video-level visual features, as well as with audio-visual features produced by concatenating the video-level visual and audio feature vectors. In all cases L2-normalization was applied. The models are trained and evaluated using the YT8M training and validation set respectively. The labeling information for the testing set is not provided and for this reason it is excluded from the evaluation. Nevertheless, as reported in relevant works [8] the performance difference on the validation and test set is negligible. The evaluation metrics of the YT8M Video Understanding Challenge [1] are used to report our results, namely, Hit@1, precision at equal recall rate (PERR), mean average precision (mAP), and global average precision at 20 (GAP@20), with the latter being the official metric of the YT8M challenge for ranking the different participating teams. The models are implemented in Tensorflow and the evaluation is performed in the same PC used in Section 3.2. Results The evaluation results in terms of Hit@1, PERR, mAP, GAP@20 and training time (T tr ) in minutes for each method are shown in Table 2. Moreover, in Table 3 we show state-of-the-art results achieved from single-model approaches in YT8M. From the analysis of the obtained results we observe the following: i) The SCNN attains the best results, outperforming the conventional CNN by 1% and 1.5% GAP using the visual and audio-visual features, respectively. Both networks outperform the standard LR. ii) By exploiting the audio information both CNN and SCNN attain a significant performance gain of more than 3%. On the other hand, a degradation in performance is observed for the LR model, which most likely does not have the capacity to exploit the additional discriminant information provided by the audio modality. iii) As shown in Table 3, our SCNN method achieving a GAP of 82.2% performs in par with the best single-model approaches reported in [8,13,2,18]. This is an excellent performance considering that our SCNN exploits only the video-level feature vectors provided by the YT8M dataset in contrast to the top-performers in the competition, which additionally exploit the frame-level visual features and build upon stronger and much more computationally-demanding feature vector descriptors such as Fisher Vectors, VLAD, BoW, and other [2,18]. We should also note that the best performing approach [17] in the YT8M competition achieved a GAP score of 88.9%. However, this is achieved using an ensemble of classifiers and a variety of feature descriptors (e.g. NetVLAD, FVNet, DBoF), whose extraction and use would increase the computation requirements by at least an order of magnitude; thus this approach cannot be fairly compared with our proposed approach that creates a single model using the video-level descriptors already provided in the YT8M dataset.