Multi-Task Learning at the Mobile Edge: An Effective Way to Combine Traffic Classification and Prediction

Mobile traffic classification and prediction are key tasks for network optimization. Most of the works in this area present two main drawbacks. First, they treat the two tasks separately, thus requiring high computational capabilities. Second, they perform data mining on the information collected from the data plane, which is unsuitable for the mobile edge. To bridge this gap, this paper properly tailors a Multi-Task Learning model running directly at the edge of the network to anticipate information on the type of traffic to be served and the resource allocation pattern requested by each service during its execution. Our study exploits data mining from the control channel of an operative mobile network to also reduce storage and monitoring processing. Different configurations of neural networks, which adopt autoencoders (i.e. Undercomplete Autoencoder or Sequence to Sequence Autoencoder) as key building blocks of the proposed Multi-Task Learning methodology for common feature representations, are investigated to evaluate the impact of the observation window of traffic profiles on the classification accuracy, prediction loss, complexity, and convergence. The comparison with respect to conventional single-task learning approaches, that do not use autoencoders and tackle classification and prediction tasks separately, clearly demonstrates the effectiveness of the proposed Multi-Task Learning approach under different system configurations.

these algorithms, a system can scrutinize data and deduce knowledge: hidden patterns in the training data are identified and used to analyze unknown information and drive the execution of a given task (typically classification, prediction, or clustering) [1]. To improve these capabilities, deep learning further enables the mining of valuable information of data coming from heterogeneous sources and unveils hidden correlations automatically, which would have been too complex to extract by human experts [2]. Recently, ML-based solutions have been applied to the mobile networking domain [3], where the growing diversity and complexity of the mobile network architectures made the monitoring and the managing of the multitude of network elements intractable [4], [5]. At the same time, networking researchers have been recognizing the importance of deep learning and its ability to solve specific problems in current and future generations of mobile systems [2], [6], [7]. In line with this emerging research trend, we investigate in this paper the potential of deep learning for mobile traffic classification and prediction, which are key tasks for network optimization. In fact, the envisaged architecture of the fifth generation (5G) of mobile broadband systems will integrate new technology components (e.g., massive MIMO, mm-Wave communication, network slicing, vehicular networks, more and broader frequency bands), a higher variety of devices (e.g., smartphone, sensors, and different types of machines), a larger number of services (typical broadband services, as well as some advanced applications such as extended reality and automated driving) with tighter latency requirements, so that resource allocation is expected to reach unprecedented complexity [8]- [10]. In this context, network optimization frameworks may be supported by deep learning algorithms, which, when properly tailored, may anticipate information on: i) the type of traffic to be served, e.g. its main characteristics in terms of bandwidth and latency requirements (i.e. traffic classification) and ii) the resource allocation pattern requested by each service along its duration (i.e. traffic prediction).
Most of the literature in this field treat traffic classification and prediction separately [11]- [22] (please see Section II for further details). Instead, we propose a Multi-Task Learning (MTL) approach [23], which reduces the number of training samples to be learnt by the two tasks and leads to performance improvement compared with learning them individually [24].
At the same time, it is important to remark that offloading the huge amount of data generated from edge to cloud is intractable in 5G scenarios since it causes oppressive network congestions. 0018-9545 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
Therefore, it is highly preferable that deep learning algorithms run at the edge of the network and give online support to optimization frameworks to promptly take decisions and trigger the proper management actions (e.g., radio resource scheduling, cell selection, and sleep mode enabling, to name a few) [25]- [27]. Almost all the approaches presented in the current state of the art implement data mining on the huge amount of information collected at the network or application layers of the data plane. Differently, the proposed MTL model considers data belonging to the control plane, as recently investigated, and it is trained with information extracted from the Physical Downlink Control CHannel (PDCCH) of an operative mobile network in Spain. The rationale behind the choice of using the control channel is twofold. First, the volume of control messages from the control plane is much smaller than the user traffic from the data plane (which may also be encrypted), leading to fast and efficient classification and prediction, which are still evaluated on the derived data plane information. Specifically, the classification task registers an accuracy up to 99% and the prediction task ensures a Mean Square Error (MSE) lower than 10 −3 . Second, the algorithm runs at the radio interface, which allows fast execution of the two tasks directly at the edge. In summary, the original contributions of this work are: r Comparison with conventional single-task learning approaches for traffic classification and prediction, that do not use autoencoders and tackle classification and prediction tasks separately. The remainder of the paper is as follows. In Section II we introduce the related work on this area and identify the gaps, which we intend to fill with this paper. Section III is dedicated to the proposed MTL approach, including the design criteria and the data processing for training. In Section IV we analyze and compare the performance achieved by single-task models for traffic classification and prediction used as benchmarks. Finally, Section V concludes the paper and draws future research activities.

II. STATE OF THE ART
As already anticipated in the Introduction, ML has been recently applied to the mobile networking domain [3]. Possible applications include radio access technology selection [34], malware detection [35], development of networked systems [36], energy saving [37], panoramic video streaming [38], and cloudlets activation for scalable Mobile Edge Computing [39]. Several approaches, based on Support Vector Machine and Random Forest algorithms, have been also conceived to identify applications or smartphone types starting from the observation of encrypted communication flows [40]- [43]. Nevertheless, mobile data are usually generated by heterogeneous sources, exhibit non-trivial spatio/temporal patterns, and often embrace high volumes of different information [44]. Flows' characteristics are also rapidly prone to be out of date and need to be frequently updated [45]. In these complex and dynamics conditions, ML algorithms generally fail to automatically extract and use the key features describing the investigated flows [6]. On the contrary, deep learning methods demonstrated to be able to overcome the traditional ML approaches because of their native ability to successfully support traffic analysis and accurately characterize traffic dynamics [6], [8], [45]- [50]. Unfortunately, mobile networking and deep learning problems have been explored mostly independently and only recently crossovers between the two research areas have emerged.
Reference deep learning solutions for traffic classification leverage Convolutional Neural Networks with one-dimensional [11]- [13] or two-dimensional [13], [14] convolutional layers, Stacked Autoencoder with five stacked layers [12]- [14], Multi-Layer Perceptron (MLP) with one [13] or two hidden layers [13], [14], and standard or hybrid Long Short-Term Memory (LSTM) combined with two-dimensional convolutional layers [13]. However, only [13] focuses on mobile networks. Among the other important investigations it provides, the work [13] also demonstrates how deep neural networks guarantee greater accuracy levels than conventional ML approaches in mobile networks. On the other hand, deep learning also outperforms baseline approaches for traffic prediction, including the conventional Auto Regressive Integrated Moving Average scheme [6]. Here, reference methodologies are based on densely connected Convolutional Neural Networks with two-dimensional convolutional layers [15] and Long Short-Term Memory (LSTM) [16]- [19], as well as on their combination [20]- [22], that can extract spatial and temporal correlations of data through the convolutional operation and LSTM memory cells, respectively. In that case, all of the reviewed contributions focus on mobile networks.
The analysis of the state of the art on deep learning strategies highlights that traffic classification and prediction are generally treated separately. In other words, classification and prediction are achieved by means of two separate single-tasks. Unfortunately, this represents an important drawback because their parallel execution involves the training of different learning architectures, as well as an inevitable increment of computational requirements [24].
The MTL approach solves the aforementioned issue, while often reaching greater performance levels when compared with single-tasks approaches [24], [51]. Differently from the singletask scheme, MTL basically embraces a learning architecture that extracts common feature representations from the training data and jointly executes multiple, but related, tasks. Therefore, MTL emerges as a suitable solution for meeting the computational and memory constraints affecting mobile networks [6]. Valuable contributions in this direction are presented in [28],  [29], where a MTL architecture is designed to implement multiple tasks related to the traffic classification only. Unfortunately, they do not address traffic prediction and do not focus on mobile networks.
Another important consideration emerging from the scientific literature is that all the investigated contributions perform data mining from the messages exchanged over the data plane (i.e., traffic volume/load collected at the network or application layers, equipped with application labels for classification task). Therefore, by considering the huge amount of data handled by mobile systems, the reviewed methodologies cannot be applied to the control plane and require high computational and memory capabilities, thus becoming unfeasible for the mobile edge.
The goal of this paper is to adopt a MTL architecture at the edge of the network to jointly classify mobile services and forecast future traffic demands. Our study exploits data mining from the unencrypted control channel of an operative mobile network to properly characterize the mobile traffic at the radio interface, in addition to getting out data plane information (i.e., traffic volume/load and application labels) and reducing storage and monitoring processing. Therefore, even if the data mining is performed on the control plane, the accuracy of the classification and prediction tasks is still evaluated on the derived data plane information. Interesting contributions in this direction address traffic pattern analysis and classification [30]- [32] and traffic prediction [33] through data mining performed on the PDCCH. The proposed solutions, however, are not based on the MTL approach.
In this work, we still pursue the idea that traffic classification and prediction at the radio interface can enable advanced Quality of Service and Quality of Experience enforcement policies based on a priori knowledge of application behaviors. Thus, network operators can configure and manage network resources in a more intelligent and prolific mode thanks to the knowledge extracted by deep learning algorithms. Nevertheless, differently from the current state of the art, and for the best of our knowledge, we formulate a novel methodology that applies MTL to classify and predict mobile traffic at the mobile edge, as we are proposing in this work.
To conclude, Table I summarizes the goals and the methodologies followed by the scientific contributions reviewed in this

III. THE PROPOSED MULTI-TASK LEARNING APPROACH
The developed methodology originates from the consideration that any active session can be described, at the radio link-level, through a traffic profile reporting the amount of data exchanged between the base station and mobile terminal during the time, simply referred to as radio utilization pattern. Therefore, by observing such a profile during a time interval T , it could be possible to classify the application type which the investigated session belongs to (task 1) and predict the radio utilization pattern that the session will experience in the upcoming time instants (task 2). This goal is successfully achieved through a MTL architecture running directly at the edge of a mobile network (Fig. 1). Without loss of generality, the contribution directly focuses on the downlink communication. However, the whole approach can be applied to the uplink as well.
To facilitate the understanding of the notations adopted in what follows, a summary of symbols is reported in Table II. Following these initial considerations, the proposed MTL approach grounds its roots into the feature learning representation concept [24], according to which the features for a common representation of our input (i.e. traffic profiles) are extracted and jointly used to execute the two tasks (i.e., classification and prediction). In particular, the conceived methodology uses an autoencoder to obtain the common feature representations of input data because it can directly accomplish this operation without requiring the knowledge of data distribution nor the explicit identification of a certain structure [49]. Classification and prediction tasks are later executed through softmax and fully-connected layers, respectively. Accordingly, the autoencoder is a key building and enabling block of the proposed MTL methodology, that effectively allows the joint execution of classification and prediction tasks.
As depicted in Fig. 1, the outcomes of the proposed scheme can be exploited to implement advanced methodologies for the management and the optimization of mobile networks. Our approach is conceived to process data directly at the edge, so that the right actions may be triggered faster and locally. Possible strategies that may benefit from the implementation of our architecture range from radio resource scheduling and admission control, mobility management and energy saving mechanisms, to network slicing and dynamic placement of virtualized functions, as well as to the optimization of computing resources at both edge and core network (see Fig. 1). Nevertheless, note that the rest of this Section focuses on the MTL approach and the reference dataset taken into account for training purposes. Any other considerations related to network optimization aspects, however, remain out of the scope of this work and they will be addressed in the future.

A. The Training Dataset
Being our approach intended to work at the mobile edge, data exchanged through the radio interface are needed for training our model. An operator owing the mobile infrastructure can simply retrieve this information and use it for both the training and operating phases. However, in our case, we use the dataset created in our previous work [32], which consists of traffic traces containing the Downlink Control Information (DCI) messages carried within the PDCCH with a time granularity of 1ms. This information is used by the eNodeB to communicate scheduling information to the connected mobile terminals. DCI messages are unencrypted and be decoded by a specific hardware/software tool called Online Watcher for LTE (OWL) [52]. A key characteristic of the training dataset is that it is gathered from the control channel, which simplifies the monitoring system complexity, assures fast data processing, and reduces the storage capacity due to the limited volume of data.
The captured traces are generated by different applications running in a mobile terminal under our control and attached to an operative mobile network in Spain. Six different applications grouped in three categories have been tested: YouTube and Vimeo for video-streaming, Spotify and Google Music for audio-streaming, and Skype and WhatsApp Messanger for video-call. We selected those applications because they generally produce, according to recent Ericsson [53] and Cisco [54] reports, more than 80% of the mobile data traffic and require optimal resource management due to their strict quality requirements. The proposed approach, however, can be safely applied to other mobile network scenarios with a different set of applications and services, only requiring a new training procedure. Also, after an effective training, our methodology is extendable to any number of classes because it is general and not restricted to a specific use-case (see Section IV-E for more details).
Among the several parameters extracted from the DCI messages, we used the Transport Block Size (TBS), which specifies the length of the packet burst to be sent to/from the considered mobile terminal in the current time slot [55]. Then, TBS values are processed to generate the radio utilization patterns describing the amount of data exchanged between the base station and mobile terminal during the time, with a time granularity of 1 s.
Formally, let r D be the number of traffic sessions collected in a period of time equal to Δ. In this work, r D = 11574 and Δ = 60 s. The distribution of the sessions among the considered application categories is reported in Fig. 2. The original training dataset contains a matrix D and a vector c of labels. In particular, the original input matrix D describes the captured traffic profiles (also referred to as the radio utilization patterns) of r D different sessions for the amount of time equal to Δ. Thus, the matrix D has a dimension equal to r D × Δ, where r D and Δ are the number of rows (traffic sessions) and the number of columns (time instants) in D. The vector c of labels contains the application type of the controlled sessions, with a dimension equal  to r D × 1. For example, given the i-th investigated session, it holds that d i,j ∈ D and c i ∈ c are the amount of data delivered across the radio interface during the j-th time slot and the label describing the application type of the i-th session, respectively. All the values stored in D are normalized within the range [0, 1] to accelerate the training convergence [56].
The training dataset has been conveniently pre-processed to be managed by our deep learning models. For the sake of clarity, the pre-processing procedure has been depicted in Fig. 3. A new matrix M is generated from D, whose rows represent the observation windows of duration T . The resulting matrix M has a dimension of r D (Δ − T + 1) × T . The vector c is used to generate a new set of labels, namely l, describing the application type associated to each portion of the investigated session stored in M. The vector l has a dimension of r D (Δ − T + 1) × 1. A set of new column vectors, namely m T+1 , m T+2 , and m T+3 , with dimension r D (Δ − T + 1) × 1, are generated from D to store the amount of data exchanged between base station and the mobile terminal after the observation window T .
Finally, 80% of M is used as training set, while the remaining 20% is used as validation set. The number of rows of the matrix M selected as training set, whose performance will be listed and evaluated, is simply denoted with r M,tr . Fig. 4 shows the proposed MTL model, embracing three main components: autoencoder, classifier, and predictor. Each component presents specific input and output parameters. The training 1) The Autoencoder: It represents a particular Artificial Neural Network (ANN) implementing two key functionalities. Given an input data m i = {m i,1 , ..., m i,T }, that is a row of the matrix M, the encoder generates the corresponding feature representation, namely h i , which then allows the joint execution of the two tasks. Specifically, h i ∈ H appears like a compression of input data [49] and it is referred to as codeword in the next sections. On the other hand, the decoder provides a reconstruction of the input data, namelym i = {m i,1 , ...,m i,T }, starting from the aforementioned feature learning representation. The autoencoder uses the sigmoid activation function for the output layer and Rectified Linear Unit (ReLU) for other layers [6]. In addition, it also uses weights, that are properly configured during the training phase.

B. Components of the Developed MTL Model
This work investigates two different autoencoder schemes: r the Undercomplete Autoencoder, leveraging regular densely-connected neural network layers, based on MLP [57]. In particular, MLP is a fully-connected and feedforward neural network, that has low computational complexity.
r the Seq2Seq Autoencoder, that manages encoder and decoder functionalities through LSTM [58]. The LSTM is a popular variant of Recurrent Neural Networks (RNNs) that can extract long range temporal dependencies through input, forget, and output gates and mitigate gradient vanishing and exploding problems. This type of neural network is suitable for processing time series because the output of each memory cell may depend on the entire sequence of previous cell states [6], [13], [59]. Due to the intrinsic temporal relations in mobile traffic data, LSTM-based architecture appears as the logical choice, at the cost of higher computational complexity. To train the two types of autoencoder, weights are iteratively updated in order to minimize the MSE loss function L A , formally defined as [57], [60]: As shown in Fig. 4, the common feature representation h i generated by the autoencoder is provided to both classifier and predictor for driving classification and prediction tasks.
2) The Classifier: It maps the feature learning representation h i to a learned labell i describing the application type of the investigated session. To this end, it uses the softmax layer, based on the softmax activation function [6], working with a number of classes (i.e., the considered application types) equal to 3, even if our methodology is extendable to any number of classes.
The softmax layer of the classifier is configured by penalizing the MSE loss function L C between the true label l i associated to the input data m i and the learned labell i associated to the feature learning representation h i : Once configured, the classifier accuracy A C quantifies the percentage of correct classifications with respect to the total number of classifications [61]: 3) The Predictor: It predicts the amount of data that a given session is expected to exchange with the base station after the observation window T , that are:m i,T +1 stored inm T+1 , m i,T +2 stored inm T+2 ,m i,T +3 stored inm T+3 , and so on. It makes use of a fully-connected layer with the ReLU activation function [6].
The predictor is configured in order to minimize the MSE loss function L P , formulated for T + 1 s as [62]: (4) Of course, it is expected that the prediction loss function, which minimizes the difference between the true and the predicted amount of exchanged data, will increase with the time distance between the latest value of the investigated traffic profile and the predicted one.

IV. PERFORMANCE EVALUATION
The conceived MTL architectures have been implemented in Keras, a high-level neural networks API written in Python, running on top of TensorFlow [63], and simulations have been executed on an Intel Core i7 CPU with 16 GB of RAM. Moreover, different configurations of neural networks are investigated to quantify the impact of the observation window, T , on the classifier accuracy, A C , and the prediction loss, L P . Once the best solutions are selected, we present a complete analysis on the classification and prediction performance together with a Assuming to describe the ratio between the size of the input layer and the size of hidden layers in the form X:Y for the neural networks with only one hidden layer and X:Y:Z for the neural networks with two hidden layers, the investigated configurations include 8:5, 8:6, 8:8, and 8:5:3. The observation window T is chosen in the range from 5 to 20. Regarding the autoencoder, the size of the codeword is also set to different values (please see Tables III and IV for further details).
The training phase for all the components belonging to the designed MTL architectures is done with 200 epochs. The Adam optimization is used to iteratively update the network weights based on the training data [64].
To provide further insight, the comparison with baseline single-task learning architectures, that do not use the autoencoder and that tackle traffic classification and prediction separately, is presented as well. In particular, the reference single-task architectures selected for the cross-comparison are based on LSTM because, as stated in Section III-B, this type of neural network is suitable for processing time series. Furthermore, due to the wide adoption of LSTM in the state-of-the-art deep learning models (e.g., [13], [16]- [19]), LSTM-based architecture appears as the logical choice for the comparison (single-task learning) schemes, as well as for the MTL approach. Assuming to work with the same training dataset and to adopt the same set of symbols, the single-task classifier and the single-task predictor are depicted in Fig. 5.

A. Selection of Suitable MTL Architectures
Autoencoder loss, L A , classification accuracy, A C , and prediction loss, L P , achieved for all the configurations of the designed MTL architectures are reported in Tables III and IV. The same performance indexes obtained with single-task approaches are reported in Table V. For both MTL and single-task architectures and for each observation window T , these results are used to select the configurations that ensure the best performance.
Regarding the conceived MTL architectures, the analysis concerns multiple objectives, that refer to the maximization of A C   and the minimization of L P . To this end, a performance metric, P MT L , is defined in (5) as a weighted linear sum of obtained results for each task, where the weight α may assume an arbitrary value from 0 to 1 [65], [66]. Since the higher the loss, the lower the performance, the min-max normalization is performed for L P to properly combine the two metrics [61], considering the minimum prediction loss reported in Tables III, IV, and V (i.e., L P min ), the maximum prediction loss reported in Tables III,  IV, and V (i.e., L P max ), the value of the normalized metric describing the worst performance (i.e., L P max = 0), and the  value of the normalized metric describing the best performance (i.e., L P min = 100).
Figs. 6 and 7 show the performance of the MTL configurations that register the highest P MT L metric as a function of α, for MTL-U and MTL-S2S, respectively. These figures help to identify the suitable values of α to be used for the selection of the best MTL configurations. Reported curves demonstrate that α = 0.5 and α = 1 cannot be used for this purpose. In fact, if α ≤ 0.5, the multi-objective metric P MT L suggests to select configurations that register low classification accuracy. On the contrary, when α = 1, the multi-objective metric P MT L suggests to select configurations that register higher prediction losses, especially when T increases. Other values of α provide similar outcomes. Thus, the rest of this paper considers the best configurations of the proposed MTL architectures selected with α = 0.8. They are highlighted in Tables III and IV. Regarding the single-task approaches, the best configurations are simply selected by considering those that offer better performance for each T . Also in this case, they are highlighted in Table V.
In general, we note that the performance of both MTL and single-task approaches improve when T increases because more data are used to make decisions. Focusing the attention on the proposed MTL model, there is not a precise relationship between MTL performance and codeword size: while MTL-S2S always achieves the best performance with the biggest codeword size, the same consideration cannot be done for MTL-U.  Fig. 8 depicts the classification accuracy of the selected architectures as a function of T . As already anticipated, the performance always improves when T increases because all the learning architectures can use a higher number of training data to perform session classification. It is also evident that the single-task approach registers lower accuracy levels, ranging from 92.52% to 97.73%. On the contrary, better results are registered by the proposed MTL architectures: in this case, it is possible to reach an accuracy level up to 99.64%. The conducted study also demonstrates that MTL-S2S achieves higher classification accuracy for each T .  Classification performance can be further investigated through the F-score [61] index. Theoretically, the higher the F-score value, the better the ability of the classifier to make proper decisions. The results summarized in Table VI generally confirm what already discussed. In fact, F-score improves when T increases, and the single-task approach always registers the lowest F-score values. Regarding the proposed MTL architectures, an exception is reported when T = 10 s: in that case, even if MTL-U registers the highest F-score, it achieves a lower classification accuracy than MTL-S2S because of a higher error rate for a specific application type (see the study on the confusion matrices proposed below).

B. Classification Performance
To analyze which classes are mismatched in the classification process, the confusion matrices are provided in Fig. 9 for each T . In general, both MTL architectures misclassify video-streaming sessions with audio-streaming ones. Nonetheless, such an error classification rate decreases when T increases. When T = 5 s, in fact, 14% and 13% of video-streaming sessions are (wrongly) classified as audio-streaming by MTL-U and both MTL-S2S and the single-task classifier, respectively. These percentages decrease to 2% for MTL-U, 1% for MTL-S2S, and 4% for the single-task classifier when T = 20 s. However, also in this case, it is possible to observe how the proposed MTL architectures always provide better results with respect to those measured for the single-task approach. Going more into detail, MTL-S2S presents the highest percentage of sessions, which are correctly classified, for each T , except for T = 10 s. When T = 10 s, as anticipated with the analysis of F-score, MTL-U reports a lower A C than MTL-S2S. However, MTL-U reports a higher F-score. The confusion matrices show the reason why it occurs. The percentages of video-streaming sessions which are correctly classified by MTL-U (see Fig. 9(b), on the left) and MTL-S2S (see Fig. 9(b), in the middle) are 94% and 92%, respectively. Fig. 10 shows the prediction loss registered for the time instants T + 1 s, T + 2 s, and T + 3 s. First of all, it is evident that the curves for T + 3 s are incomplete. In this case, the training process always fails when T = 5 s. As expected, the prediction loss decreases with the observation window T , because the learning architectures have more training data to make a prediction. Regarding the prediction performed at both T + 1 s and T + 2 s, MTL-S2S and MTL-U always register the best and the worst performance levels, respectively. On the other hand, when the prediction is done a T + 3 s, the single-task approach slightly exceeds the prediction losses registered by MTL-U.

C. Prediction Performance
In summary, MTL-S2S always guarantees the lowest prediction losses, at the cost of higher complexity (see Section IV-D). MTL-U registers the worst performance when the prediction is done at T + 1 s and T + 2 s. The single-task approach exhibits intermediate performance levels when T + 1 s and T + 2 s, but it registers the highest prediction losses at T + 3 s. Obtained results also confirm the ability of LSTM, which is exploited in both MTL-S2S and the single-task scheme, to suitably process time series by taking into account the temporal sequence of TBS values.

D. Complexity and Convergence Analysis
The complexity of selected learning architectures is evaluated by measuring the number of trainable parameters: the higher the number of parameters, the higher the complexity level. Results are summarized in Table VII. Firstly, it is evident that the complexity of all the investigated learning architectures increases when the observation window T increases. MTL-S2S always registers the highest complexity. Also the single-task approach, based on LSTM, has a high complexity because of the structures of LSTM cells. On the contrary, MTL-U guarantees the lowest complexity for each observation window T .
The convergence analysis evaluates the performance of the investigated learning architectures (including autoencoder loss, classification accuracy, and prediction loss) as a function of the number of epochs considered during the training phase. Fig. 11 shows the autoencoder loss as a function of the number of epochs. MTL-S2S has the slowest convergence time, while providing the lowest autoencoder loss. Fig. 12 depicts the classification accuracy as a function of the number of epochs. While the proposed MTL architectures reach similar performance, the single-task approach always registers the highest convergence time. Fig. 13 shows the prediction loss as a function of the number of epochs. In this case, it is possible to observe that MTL-S2S achieves lower performance losses, at the cost of a slower convergence time.

E. A Further Evaluation With More Classes
As described in Section III-A, the proposed MTL approach can be applied to different scenarios with a higher number of classes. To provide further insight, the training dataset considered in this work allowed us to evaluate the performance of the proposed methodology when considering the six available classes of applications: YouTube, Vimeo, Spotify, Google Music, Skype, and WhatsApp Messanger. Specifically, differently from the original investigation, the applications belonging to the same service category have been treated as separate classes. We tested the configurations of the MTL-S2S approach that achieved the best performance in the analysis of three service categories only. Figs. 14 and 15 depict the classification accuracy and the prediction loss of MTL-S2S and the single-task schemes with six classes as a function of T . Obtained results further confirm that the proposed MTL approach outperforms baseline single-task scheme also in scenarios with a higher number of classes. Differently from the previous case, however, lower accuracy levels are caused by very similar patterns of applications (especially those of audio-streaming type) and it is increasingly difficult to distinguish the different applications when the observation window T decreases.

V. CONCLUSION
This work has tailored a Multi-Task Learning model for traffic classification and prediction at the mobile edge, which leverages data mining from the Physical Downlink Control Channel and two types of autoencoders (i.e., the Undercomplete Autoencoder and the Sequence to Sequence Autoencoder) exploited as key building blocks for obtaining common feature representations. Different configurations of neural networks have been trained with a real dataset collected from an operative mobile network in Spain. Moreover, a wide set of simulations has investigated the performance of the developed approach in terms of classification accuracy, prediction loss, complexity, and convergence. A crosscomparison with respect to conventional single-task learning schemes, that do not use autoencoders and that are generally investigated in the current state of the art for traffic classification and prediction, has also demonstrated that: i) the Multi-Task Learning architectures, leveraging the autoencoders, always guarantee higher performance than the single-task learning approach, ii) the Multi-Task Learning architecture based on the Sequence to Sequence Autoencoder always achieves the highest classification accuracy and the lowest prediction losses, at the cost of a higher complexity and convergence time. Further research activities will exploit the conceived methodology to properly design advanced techniques for mobile network optimization, ranging from radio resource scheduling and admission control, mobility management and energy saving mechanisms, to network slicing and dynamic placement of virtualized functions.