Structured Pruning of LSTMs via Eigenanalysis and Geometric Median for Mobile Multimedia and Deep Learning Applications

In this paper, a novel structured pruning approach for learning efficient long short-term memory (LSTM) network architectures is proposed. More specifically, the eigenvalues of the covariance matrix associated with the responses of each LSTM layer are computed and utilized to quantify the layers' redundancy and automatically obtain an individual pruning rate for each layer. Subsequently, a Geometric Median based (GM-based) criterion is used to identify and prune in a structured way the most redundant LSTM units, realizing the pruning rates derived in the previous step. The experimental evaluation on the Penn Treebank text corpus and the large-scale YouTube-8M audio-video dataset for the tasks of word-level prediction and visual concept detection, respectively, shows the efficacy of the proposed approach1.


I. INTRODUCTION
Deep learning (DL) is currently becoming a game changer in most industries, ranging from media and mobile communications to health care and security [1]- [5]. However, it is well known that top-performing DL models consist of millions or billions of parameters and their deployment to resource-limited applications such as smartphones and other mobile devices is challenging [1].
Structured network pruning has been identified as a remedy to the above problem [6]- [9]. However, the structured recurrent neural network (RNN) pruning techniques introduced so far (i.e. [8], [9]) require the modification of the loss function in order to impose sparsity constraints, which may lead to numerical instabilities and performance reduction when a high degree of sparseness is pursued [10]. In order to address the above limitations, we follow a different path, taking advantage of recent advances in structured deep convolutional neural networks (DCNN) pruning [6], [11]. Firstly, the eigenanalysis of the sample covariance matrix computed using a LSTM layer's responses is utilized to quantify correlations among units in the same LSTM layer and derive automatically the pruning rates at layer level, as for instance it is done for convolutional layers in 1 Source code is made publicly available at: https://github.com/bmezaris/ lstm structured pruning geometric median [11]. Subsequently, a GM-based criterion, which has shown superior performance in comparison to criteria exploiting sparsity-inducing constraints in the domain of structured DCNN pruning [6], is used to identify and prune the most replaceable RNN structures according to the pruning rates computed above. Experimental results show that the proposed approach provides competitive performance on the Penn Treebank text corpus [12] and the YouTube-8M audiovideo dataset [13] for the tasks of word-level prediction in text and concept detection, respectively.
The rest of the paper is structured as follows: Section II reviews related work on pruning. Section III details the proposed method. The experimental evaluation is described in Section IV and conclusions are drawn in Section V.
II. RELATED WORK DNN compression and acceleration approaches can be roughly categorized to quantization, low-rank approximations, knowledge distillation and pruning [1], [2], [14]. The latter is currently getting increasing attention mainly because the methods falling in this category can achieve high compression rates while maintaining a stable model performance. In the following, we briefly review several pruning approaches in order to put ours into context.
Pruning techniques typically consist of the definition of the elementary network structures as candidates for pruning, an importance estimation criterion to rank the above structures, and a pruning strategy defining how pruning is performed [6]- [9], [11], [15]. Depending on whether a pruning approach removes individual network weights or well-defined network components, is characterized as unstructured or structured, respectively. Concerning pruning strategies, most approaches either prune a pre-trained model or incorporate pruning into the training procedure. Another important aspect of pruning is whether the layers' pruning rates are fixed or obtained automatically [11]. The latter can usually provide more efficient architectures and is closely related with the network architecture search paradigm [16].
The major advantage of structured pruning techniques over the unstructured ones is that the former do not require the use of special-purpose accelerators [8], [15] and thus can take advantage of cheap, widely available devices such as conventional GPUs. The structured pruning of DCNNs has been studied extensively in the literature [6], [7], [11]. In contrast, structured RNN pruning is a much less investigated topic [8], [9]. More specifically, in [8] intrinsic sparse structures (ISSs) of LSTMs are defined and a Group Lasso-based (GL-based) approach is used for ISS pruning. Similarly, the authors in [9] utilize the L 0 norm to constrain network parameters and subsequently prune the ISS components which are close to zero, achieving higher pruning rates than [8]. Both the above works utilize sparsity-inducing regularizers to modify the loss function, which may lead to numerical instabilities and suboptimal solutions for certain network architectures [10]. To this end, inspired from best practices in the DCNN structured pruning domain, we quantify the redundancy at layer level using the eigenanalysis of the covariance matrix formed by the layer's responses (e.g. similar to the way it is done in [11] for DCNN filters), and utilize a GM-based criterion [6] to rank and prune the most replaceable LSTM structures in each layer.

A. Formulation
Suppose an annotated training dataset X of N sequences and C classes where, the matrix X κ = [x κ,1 , . . . , x κ,T ] ∈ R F ×T represents the κth sequence of length T (without loss of generality it is assumed that all sequences have the same length), y κ ∈ R C is its class indicator vector, whose ιth element is 1 if X κ belongs to class ι and zero otherwise, x κ,t is the tth feature vector of the κth sequence, and F is the input space dimensionality. A DNN consisting of L LSTM layers is utilized for learning the above classes. The computations in the lth LSTM layer with respect to the κth input sequence at a specified time step t are performed as [17] i where, the superscript l is the layer index (i.e. l = 1, . . . , L); i κ,t and h [l] κ,t are the H [l] -dimensional input gate, forget gate, output gate, input update, unit state and hidden state vectors; H [l] is the number of the layer's units, x [l] κ,t ∈ R F [l] is the layer's input vector at time step t; and W , are the layer's weight matrices and vectors. Based on the above formulation, the goal of structurally pruning LSTM architectures can be stated as follows: given a target pruning rate θ ∈ (0, 1) for the overall network, estimate the pruning rate θ [l] and subsequently select the less significant (in terms of their influence to the overall network classification performance) θ [l] H [l] units to prune at layer l so that

B. Computation of layer's pruning rate
Due to its high representational power, the hidden state vector h [l] κ,T of any LSTM layer l at the last time step has been often used for representing the overall sequence at the output of the layer (e.g. see [18]). Based on this fact, the whole training set at the output of the lth layer is represented using the data matrix where for simplicity we set z κ,T . Given Z [l] , the sample covariance matrix associated with the responses of the lth layer can be computed using where κ is the sample mean vector. The above matrix is symmetric positive semidefinite, thus, with real nonnegative eigenvalues, which can be efficiently computed using appropriate techniques [19]. Sorting S [l] 's eigenvalues into descending order and normalizing them to sum to one, we car represent them as where, λ [l] i = 1. As explained in [11], the set of the eigenvalues provides insight on the correlation of the responses produced by the different units of the layer. An eigenvalue close to zero implies that the variables along the corresponding principal component of S [l] are linearly dependent. Therefore, the situation where all the energy in the output of a layer (represented by its hidden state vectors) is accumulated to only a small fraction of eigenvalues indicates that there are many redundant units in this layer. Based on the above analysis, we proceed to express the layer pruning rate with respect to the derived eigenvalues. We define the following two sets of variables, ζ In the equation above, α a is a parameter in [0, 1] defining the amount of energy to keep at the output of a layer, and thus closely related with layer's pruning rate. The pruning rate θ [l] for the lth layer can then be computed using Thus, our goal is now to identify α by solving the following single-variable optimization problem Because α is bounded in [0,1], the above problem can be efficiently solved using an appropriate iterative method.

C. Minibatch computation of sample covariance matrix
The computation of the sample covariance matrix for each LSTM layer requires high memory storage space for retaining the layer's output along with all epoch steps, and may be even infeasible when processing large-scale datasets in devices with limited computational resources. To this end, we propose a minibatch algorithm for the computation of this matrix, as explained in the following. For simplicity of illustration, let us consider the case that the dataset is split into two partitions, with one partition consisting of theN sequences processed so far, and the new minibatch ofÑ sequences, i.e. N =N +Ñ . Dropping the superscript layer index [l] for simplicity, the data matrix at the output of any layer can be then represented as . . . , zN , zN +1 where the block matricesZ,Z contain the hidden state vectors corresponding to the already processed sequences and the minibatch of new sequences, respectively. The sample mean vector can then be written as where,m = 1 On the other hand, the sample covariance matrix (9) can be decomposed as where Σ = 1 N N κ=1 z κ z T κ . The latter can be further expressed as

D. LSTM unit importance estimation and pruning
Without loss of generality we examine the unit selection and pruning procedure for a popular LSTM architecture, consisting of a biderectional LSTM (BLSTM) [20] with layer indices l = (1, 1), (1,2), for the forward and backward LSTMs, respectively, and a regular LSTM with layer index l = 2, as shown in Fig. 1. For each layer the weight matrices can be stacked to form the following block matrices where, W [l] where w [l] j ∈ R Q is the jth row of W [l] , directly related with the jth unit of the lth layer, and Q = 4(H [l] + F [l] ).
Based on the above formulation, an importance score η [l] j for each unit in the lth layer can be derived using a GMbased function, which has shown excellent performance in DCNN pruning [6], [7], The value η [l] j quantifies the dissimilarity between the jth unit and all other units in the layer. Therefore, a small η [l] j denotes that in average this unit is highly correlated with the other units in the layer and thus can be discarded safely without harming the classification performance of the network.
Proceeding to the definition of weight structures for structurally pruning the network, we learn ISSs for both LSTMs and BLSTMs by extending the approach presented in [8], as explained in the following. Let us suppose that the kth and rth hidden states of the forward and backward LSTM, respectively, have been selected to be removed, as shown in Fig. 1. Then, the kth and rth row of W [1,1] x , W [1,1] h , and, W [1,2] x , W [1,2] h , respectively, contributing to the generation of these states should be removed as well, as shown by the four white horizontal lines in the figure. Moreover, due to their correspondence to the connections receiving the above hidden states from the previous time step, the kth and rth column of each matrix block of W [1,1] h and W [1,2] h , should also be removed, as shown by the eight vertical lines in the matrix blocks of the BLSTM in the figure. Finally, the kth and rth columns of the four matrix blocks in W [2] x receiving the above states are also set to zero, as shown by the eight white vertical lines in the weight matrices of the second layer LSTM in Fig. 1.

1) Penn Treebank (PTB):
This is one of the most widely used datasets for evaluating the performance of statistical language models [12]. It consists of 1086k tokens in ASCII format and 10k classes (i.e. unique tokens). It is partitioned to training, validation, and testing sets with 930k, 74k and 82k tokens, respectively.
2) YouTube-8M (YT8M): The large-scale YT8M video dataset is utilized to evaluate the proposed approach for the task of audiovisual concept detection [13]. This dataset consists of 3862 classes (semantic concepts) and 6134598 videos. Visual and audio feature vectors have been preextracted and provided at frame-level (1 frame per second) with dimensionality 1024 and 128, respectively.

B. Setup
The proposed method, called hereafter ISS-GM, is evaluated against ISS-GL [8] and ISS-L 0 [9] in the PTB dataset. In this experiment, a two-layer stacked LSTM model [21] is utilized, and the training procedure described in [9] is followed. For the evaluation in the YT8M, a variant of the BLSTM architecture presented in Section III-D is utilized to compare ISS-GM and ISS-GL (the software implementation of ISS-L 0 is not provided in [9] and for this reason ISS-L 0 is not included in this experiment). In more detail, the forward and backward layers of the BLSTM consist of 512 units each, while 1024 units are used for the LSTM layer. Each video is represented with a feature vector sequence of T = 300 length. Both models are trained for 10 epochs using CE loss with minibatch SGD, batch size of 256, an exponential learning rate schedule with initial learning rate of 0.0002, learning rate decay of 0.95 at every epoch, and pruning is applied every 200 training steps.
The performance evaluation on the PTB and YT8M datasets is performed using the per-word perplexity (PPL) and the global average precision at 20 (GAP@20) [13], respectively. ISS-GM is implemented in PyTorch and Tensorflow for the evaluation in PTB and YT8M, respectively. For the ISS-GL method, the Tensorflow code provided in [8] is adapted for the YT8M experiments. The evaluation is performed in an Intel i7-3770K PC with 32 GB RAM, Windows 10, and Nvidia GeForce GPU (GTX 1080 Ti).

C. Results
The experimental results in terms of PPL on the PTB dataset are shown in Table I. Table II depicts GAP@20 rates and training times in hours per epoch (T tr ) for the evaluation on the YT8M dataset with pruning rates 30% and 70%. From the obtained results we conclude the following: i) The proposed ISS-GM achieves the best performance in all experiments. More specifically, on the PTB dataset a small but significant PPL gain is obtained using ISS-GM (considering that ISS-L 0 is the previous state-of-theart approach), while, on the YT8M dataset a quite large GAP@20 improvement of approximately 1% is attained over ISS-GL for both 30% and 70% pruning rates. ii) ISS-GM exhibits a high degree of robustness against large pruning rates, making it suitable for compressing deep networks and allowing their deployment in mobile and other resourceconstrained environments. For instance, only 0.21% and 1.23% performance drop is observed on the YT8M dataset for 30% and 70% pruning rates, respectively. iii) Concerning training times, we observe that ISS-GM is approximately two times slower than ISS-GL in the YT8M experiment, mainly because ISS-GL computes the eigenvalues of the covariance matrix for each layer every time the pruning procedure is applied. However, concerning that the training is performed off-line, this time overhead is considered insignificant.

V. CONCLUSION
In this paper, a new LSTM structured pruning approach was proposed that utilizes the sample covariance matrix of layer's responses and a GM-based criterion to automatically derive pruning rates at layer level and compress the network, to make it more suitable for deployment in mobile or other resource-constrained environments. The proposed approach was evaluated on two datasets for the tasks of word-level prediction in text and concept detection in audiovisual sequences, providing competitive performance at high pruning rates.