A data-aware dictionary-learning based technique for the acceleration of deep convolutional networks

The deployment of high performing deep learning models on platforms of limited resources is currently an active area of research. Among the main directions followed so far, pre-trained neural networks are accelerated and compressed by appropriately modifying their structure and / or parameters. Capitalizing on a recently proposed codebook of a special structure that can be utilized in the frame of the so-called weight sharing methods, this paper describes a "data-driven" technique for designing such a codebook. The performance of the technique, in terms of the observed representation error and classification accuracy versus the achieved acceleration ratio, is demonstrated by considering the VGG16 and the ResNet18 models, pre-trained on the ILSVRC2012 dataset.


I. INTRODUCTION
Nowadays, there is an increasing interest in deploying Deep Neural Network (DNN) models on (mobile) platforms of limited resources [1]. As high-end DNN models consist of tens, or even hundreds of millions of parameters [2], the design of smaller/faster, still high-performing, DNN models has been recently considered a key objective. To address this, two main lines of research can be identified [2]. In the first line, new compact, smaller DNN models are designed or searched for in a design space for the applications at hand (e.g., SqueezeNet [3], MobileNets [4], and EfficientNet [5]). In the second line, existing pre-trained, highly performing DNN models (e.g., Visual Geometry Group (VGG) [6], Residual Net (ResNet) [7], etc.) are transformed into new smaller models by utilizing Model Compression and Acceleration (MCA) techniques. The latter line of research enables high-performing, pre-trained DNN models to be utilized also as back-bone modules in DNN models designed for different (but similar in nature) applications. For example, the convolutional layers of VGG (trained initially for image classification), constitute the core modules of the Fast R-CNN object detector [8].
The MCA-related literature has been increasing in recent years and there are numerous surveys that provide a comprehensive overview of the area ( [2], [9], [10], [11]). To understand, roughly, the relevant literature, some works propose the pruning (removal) of unimportant parameters which are not considered during the inference (e.g., filters [12], [13]). Other works utilize weight sharing by reducing the bit-width of the involved parameters or by increasing their common representations (e.g., in a scalar, [14], or vector/product [15], [16], [17], [18], [19] quantization fashion). These common representations construct a so-called codebook with a twofold aim. First, the network can be compressed as less parameters need to be stored. Second, it can be accelerated as the involved calculations are performed using only the codebook and not the original (considerably more) parameters. Finally, other works employ tensor / matrix decompositions on the involved quantities into factors by utilizing, for instance, low-rankness [20].
In this paper, we capitalize on a recently proposed codebook of a special structure that can be utilized in the frame of the socalled weight sharing methods [19]. In [19], the problem was treated from a "data-agnostic" perspective, namely by directly approximating the original kernel parameters. Here, we revisit the problem of designing this codebook by introducing a "dataaware" technique, i.e., a technique that utilizes part of the availiable dataset in order to accelerate / compress the targeted DNN model. For this, the proposed technique minimizes the representation error at the output of a single convolutional layer, instead of the quantization error considered in [19]. The performance of the new technique, in terms of representation error and classification accuracy, is assessed when applied on the VGG16 and ResNet18 DNN models, pre-trained on the ILSVRC2012 dataset, and compared with [16] and [19]. The results demonstrate an improvement on the representation error at the output of the convolutional layers which ultimately results in a smaller accuracy loss for the accelerated models.

II. PRELIMINARIES
Let us first define the linear operation performed at a convolutional layer between input volume X ∈ R m×m×N and the k-th kernel volume W k ∈ R p×p×N , k = 1, . . . , M . To simplify notation and without loss of generality, we assume square-shaped input feature maps and kernel filters, as well as a stride of 1. We also adopt a contiguous numbering format for the spatial coordinates (n, l)'s of the involved tensors, following any numbering convention (e.g., in a rowor column-major order). Using this convention, we can write the convolution operation in terms of dot-products between input and kernel vectors, as follows: where U k [i] denotes the i-th element (residing at the i-th spatial coordinate) of the k-th convolutional output, x i,j ∈ R N ×1 is the j-th (depth-wise) input vector around the i-th spatial position of the input, while w k,j ∈ R N ×1 denotes the jth (depth-wise) vector of the k-th kernel, respectively, using the same numbering convention. It is stressed here that the involved vectors are defined along the N channels of the input volume and the kernel volumes. By adopting the product quantization framework [15], the N -Dimensional (D) vector space is partitioned into S, N -D subspaces with N = N/S, so that the s-th subspace spans dimensions [(s − 1)N + 1, . . . , sN ], s = 1, . . . , S. Then, by partitioning vectors x i,j and w k,j defined in Eq. (1) as . . , (w S k,j ) T ] T , respectively, where each of the sub-vectors lies in an N -D space, (1) can be rewritten as where the inner sum denotes the contribution of the s-th subspace to the k-th convolutional output, at position i. For subspace s, the goal of product quantization is to perform vector quantization to the M p 2 kernel sub-vectors lying in this subspace, i.e., cluster them into K s M p 2 clusters and represent each sub-vector with the centroid (or representative) of the cluster it belongs to. This way, the original dot-products, between the input and the M p 2 kernel sub-vectors, are approximated by the ones between the input and the K s representatives.

III. ADOPTED CODEBOOK STRUCTURE
In this section, we describe the adopted codebook structure (as proposed in [19]), which can be viewed as a special case of the general Dictionary Learning (DL) problem [21], and compare it against the conventional approach (referred to as Vector Quantization (VQ) in the following).
Specifically, the conventional VQ, and proposed DL -based kernel approximations can be defined as follows: where the columns of W ∈ R N ×p 2 M and Γ ∈ R Kvq×p 2 M contain the sub-vectors of all kernel volumes (of a particular subspace) and assignment vectors, respectively. Matrix C ∈ R N ×Kvq denotes the representatives (or cluster centroids) in the VQ approximation whereas D ∈ R N ×L dl and Λ ∈ R L dl ×K dl denote the dictionary and the matrix of sparse coefficients, respectively, for the DL approximation. Specifically, each column of Γ has exactly one non-zero element, equal to 1, meaning that each column of W is approximated by one column of the codebook C in the VQ case and one column of the codebook DΛ in the DL case.
Thus, in the conventional case, the M p 2 original sub-vectors are approximated by K vq p 2 M representatives, using the codebook C, while in the DL case, they are approximated via K dl representatives contained in the codebook DΛ, which, in turn, are obtained as linear combinations of at most α atoms from a dictionary of size L dl , with L dl < K dl p 2 M . It should be noted that, due to the linearity of the operations performed in the convolutional layer, the sparse coefficients in Λ need only be applied to the convolution between the input and the dictionary atoms in D, instead of the atoms themselves. This endows the proposed approximation scheme with the flexibility to use a number of representatives K dl that is several times larger than K vq , while restricting the size of the dictionary (so that L dl K vq ) thus reducing the number of "dense" operations. This enables the proposed technique to achieve significantly better approximation of the desired convolutional output, for the same target accelerations, as it is going to be demonstrate by our experimental results.

A. Computational complexity
Let us now examine the computational complexity of the original convolutional layer and its two approximate versions based on the VQ and DL approaches, respectively, and define the acceleration gains induced by the approximations. Specifically, by arranging the input sub-vectors of a particular subspace in the columns of a matrix X ∈ R N ×m 2 , the dotproducts entailed by the original convolution operation can be obtained as the following matrix product: where W contains the kernel sub-vectors (as defined in (3)). In the VQ case, the original W is approximated via the codebook C, meaning that the approximate Y is obtained as: namely, it involves calculating the dot-products between input and codewords, and then substituting the results appropriately, as indicated by the columns of Γ. Accordingly, for the DLbased approximation scheme, we can write: meaning that in this case, the approximate dot-products are obtained via a two-stage operation, namely, it involves calculating the dot-products between the input and the dictionary atoms, and subsequently, obtaining their linear combinations according to the columns of Λ. Following (4), (5), (6), its not difficult to show that the computational complexities in terms of Multiply and Accumulate (MAC) operations, for the original and the approximate versions of a convolutional layer, using VQ and DL respectively, can be expressed as follows ( [16], [19]): T vq = m 2 N K vq .
The achieved acceleration ratio is defined as the ratio of original over accelerated computational complexities and for the two rival parameter-sharing approaches examined in this paper, takes the following form: Finally, of particular interest is also the relative complexity between the proposed DL-based and the VQ approach, which acts also as a guideline for selecting the free parameters of the proposed technique. To this end, it is not difficult to, that the VQ-and DL-based approximate layers entail the same computational complexity (i.e., yield the same acceleration ratio) when the respective parameters satisfy the following equality: where c > 1 is a size coefficient between the DL-based and the VQ-based codebooks, i.e. K dl = c K vq holds.

IV. THE PROPOSED DATA-AWARE SOLUTION
Let us collect the j-th vector from all kernels in the columns of a matrix W j ∈ R N ×M as W j = [w 1,j · · · w M,j ], j = 1, . . . , p 2 , and let us define: where u(i) = [U 1 [i], . . . , U M [i]] contains the outputs from all kernels, at the i-th spatial location. By sampling I spatial locations (from the same or different output volumes), we obtain I input sub-volumes of dimension p × p × N . Let us, also, collect the j-th vector from all input sub-volumes in the columns of a matrix X j ∈ R N ×I , i.e., X j = [x 1,j · · · x I,j ], j = 1, . . . , p 2 . Furthermore, by defining the I × M matrix U = u T (1) · · · u T (I) T , we have: where W s and X s contain the rows lying in the s-th subspace of W and X, respectively. Under the DL approximation defined in (3), we are seeking the factorization of each W s j in (14) as W s j ≈ D s Λ s Γ s j . In a data-aware fashion, this can be achieved via the following minimization problem.
The minimization of (15) is carried out in a coordinatedescend fashion, by optimizing one set of parameters at a time, while keeping all others fixed. In the following subsections, we briefly describe this strategy, as well as the proposed solutions to the sub-problems entailed by it.

A. Sparse coding
To minimize (15), first, the optimization for the Λ s 's in an alternating fashion among s, is carried out by keeping all other unknowns fixed. Focusing on a particular sub-space t, the cost function of (15) can be written as where Y t gathers all terms of (15) apart from the ones related to the subspace t, y t = vec(Y t ) ∈ R M I×1 , G t i ∈ R M I×L dl and λ i ∈ R L dl ×1 . The summation terms in (16) are connected via the identity vec(AXB) = (B T ⊗ A) vec(X), where ⊗ denotes the Kronecker product. Using this identity, The cost-function in (16) along with the relevant constraint can be solved using the standard OMP algorithm by alternating among the λ t i 's, ∀i.

B. Dictionary update
In this sub-problem, the optimization is carried out similarly for each D t . The cost-function, in this case, can be written as denotes the i-th dictionary atom. It is noted that the superscript t was dropped for simplicity.
Thus, the dictionary-update step of the proposed technique, takes the form of the following Quadratically Constrained Quadratic Program (QCQP): (17) Though (17) constitutes a non-convex, NP-hard problem in general, it can be tackled via the technique of SemiDefinite Relaxation (SDR) [22], [23], which translates a QCQP with an unknown vector x into a convex trace minimization problem, where the unknown is a matrix X. Such SemiDefinite Programs (SDP) can be solved using standard software packages like CVX [24]. Finally, it is noted that, depending on the input size, the target acceleration and the selection of parameters, H can become quite large. Fortunately, in our case, H is, also, extremely sparse, which can be exploited for the speed-up of the involved computations.

C. Assignment update
Let us first write the cost function in (16) as follows: where y t k ∈ R I×1 denotes the k-th column of Y t , while γ t j,k ∈ R K dl ×1 is the k-th column of Γ j , which holds the assignment vector for the j-th sub-vector lying in the t-th subspace of the k-th kernel. Observing (18), the assignment vectors can be updated for each kernel separately, leading to minimization problems of the following form: where F t j = X t j T D t Λ t ∈ R I×K dl , under the constraint that each ζ j has a single non-zero element equal to one. Taking into account this constraint, the goal of the assignmentupdate problem is to select one column f t j,i from each F t j , j = 1, . . . , K dl (thereby determining the position of the nonzero unity element in each γ t j,k ), so that the sum of all f t j,i 's best approximates y t k . This is a problem of combinatorial complexity, which can be tackled via a greedy (OMP-like) algorithm, similar to the one used in the sparse-coding step.

V. EXPERIMENTAL RESULTS
In this section, the performance of the proposed technique is evaluated and compared against the data-agnostic version presented in [19], as well as both versions of the conventional VQ approach used in [16]. Our experiments are based on pre-trained versions of two state-of-the-art DNNs for image classification, namely, VGG-16 [6] and ResNet18 [7], using the training and validation datasets of ILSVRC2012 [25], for obtaining the original conv-layer responses, as well as for finetuning and accuracy evaluation purposes.

A. Experiment I. Representation error
In the first experiment, we evaluate the performance of the rivals with respect to the achieved representation error for individual conv-layers, namely the error between the original layer output and the approximate output obtained using the kernel approximations defined in (3), for a range of target accelerations. To this end, we obtain the original layer output U, and the corresponding input sub-volumes X (as defined in (14)), by random sampling of the layer's output and input, respectively, for I = 1000 images from the ILSVRC2012 training dataset. For the DL-based approach, the (proposed) data-aware approximation was obtained via the solution of (15), while the data-agnostic variant was obtained as in [19]. The latter variant was also used as an initial solution D 0 , Λ 0 , Γ 0 , to the proposed data-aware technique. Moreover, the four free parameters of the technique, namely, the subspace dimension N , the number of representatives K dl , the size of the dictionary L dl , and the sparsity level α are obtained according to the procedure outlined in [19]. Specifically in this experiment, the subspace dimension was set to N = 8, which is a typical value used in the relevant bibliography, while the parameter values c = 2, α = 3 were selected after some experimentation. Finally, the procedures presented in [16] were followed for the VQ variants.

B. Experiment II. Accuracy loss
In this experiment, we apply the proposed technique for the "full-model" acceleration of ResNet18. To this end, we followed a stage-wise acceleration approach (as described in [16], [19]) with each stage involving accelerating (and fixing) one or more layers of the network and, subsequently, fine-tuning (i.e., re-training) the remaining original layers. For fine-tuning and performance assessment, we use the training and validation datasets, respectively, from ILSVRC2012. In order to expedite the process, we divided the initial training dataset into smaller subsets (of 100 images per class) for fine-tuning purposes. By following this procedure, we accelerated ResNet18 by 10× and 20× with an increase of top5 classification error by only 0.9% and 2.1%, respectively. These preliminary results are inline with the outcome of Experiment I and act as a further confirmation of the quality of the proposed technique.

VI. CONCLUSIONS
A new data-aware, weight-approximation technique was proposed in this paper. The original problem was treated in a coordinate descent fashion and solutions for the entailed sparse coding, dictionary update, and assignment sub-problems, were presented. The preliminary results on two state-of-the-art pretrained DNN models, reveal the superior performance of the technique against its data-agnostic variant as well as both VQ counterparts.