Mobile Traffic Classification Through Physical Control Channel Fingerprinting: A Deep Learning Approach

The automatic classification of applications and services is an invaluable feature for new generation mobile networks. Here, we propose and validate algorithms to perform this task, at runtime, from the raw physical control channel of an operative mobile network, without having to decode and/or decrypt the transmitted flows. Towards this, we decode Downlink Control Information (DCI) messages carried within the LTE Physical Downlink Control CHannel (PDCCH). DCI messages are sent by the radio cell in clear text and, in this article, are utilized to classify the applications and services executed at the connected mobile terminals. Two datasets are collected through a large measurement campaign: one labeled, used to train the classification algorithms, and one unlabeled, collected from four radio cells in the metropolitan area of Barcelona, in Spain. Among other approaches, our Convolutional Neural Network (CNN) classifier provides the highest classification accuracy of 98%. The CNN classifier is then augmented with the capability of rejecting sessions whose patterns do not conform to those learned during the training phase, and is subsequently utilized to attain a fine grained decomposition of the traffic for the four monitored radio cells, in an online and unsupervised fashion.


I. INTRODUCTION
W IRELESS mobile technology is advancing at a fast pace, through better monitor resolutions, larger memories, higher communication speeds, etc., and, with that, more requirements in terms of supported data rates [1], [2], new services, and a higher network responsiveness across diverse physical contexts [3].
In this work, we design and evaluate, via a proof-of-concept implementation, non-intrusive tools for the online estimation of Long Term Evolution (LTE) cellular activity, i.e., the type of traffic that users exchange with their serving base station.As we quantify, our technology allows one to infer with high accuracy the service (e.g., audio, video, etc.) and the application (e.g., Skype, Vimeo, You Tube, etc.) that are being used by the connected mobile users.This is accomplished by decoding LTE downlink/uplink control channel messages (i.e., radio-link level data), which are transmitted in clear text and without breaking any security protocol (encryption).We believe that this brings a high value along several dimensions: • first, our tool permits a better understanding of spectrum needs across time and space.Note that there are limited means to non-intrusively monitor user density and traffic demand in real-time.These measurements are key for the correct dimensioning of future mobile systems, and the investigation of new data communication and processing techniques at the network edge.In fact, although network operators gather such data through their base stations, they seldom release it due to privacy concerns, so these measures are almost never used for research purposes by university, public or private research labs.We only found a limited number of datasets available including mobile data traffic [4] [5], but they are often incomplete, poorly documented and contain aggregated data, often averaged hourly or across a certain geographic area.Our technique allows the extraction of data flows with a granularity of one second, isolating the amount of data, the type of service and application exploited by each mobile user within an LTE cell.We believe that this has a high value for researchers within the mobile networking space, who will be able to create their own rich datasets for research purposes, to spur the advancement of future edge communication (and computing) technology; • second, our work sheds new light on possible attacks that may be carried out by exploiting the LTE control channel data.In fact, being able to track user density across time and space by just deploying a number of sniffers within a city may be also exploited maliciously, for example to infer details about mobility and user density.While such analysis, so far, has been carried out with data from network operators, e.g., the Telecom Italia mobile challenge [6], our tool allows the extraction of fine grained LTE data in full autonomy.This may represent a privacy concern.
A large body of work exists in the area of mobile traffic classification (see Section VI for an in depth discussion of the related work).The key challenge of existing classification algorithms consists in the identification, and in the subsequent computation, of a number of representative features.These features are then used to train algorithms that classify the data flows at runtime.Most of the surveyed approaches leverage some domain knowledge, which is utilized to manually obtain the feature set, i.e., by a skilled human expert.However, the use of deep learning techniques has recently paved the way to automatic feature discovery and extraction, often leading to superior performance.For example, in [7] encrypted traffic is categorized through deep learning architectures, proving their superior performance with respect to shallow neural network classifiers.The authors of [8] present a mobile traffic super-resolution technique to infer narrowly localized traffic consumption from coarse measurements: a deep-learning architecture combining Zipper Network (ZipNet) and Generative Adversarial neural Network (GAN) models is proposed to accurately reconstruct spatio-temporal traffic dynamics from measurements taken at low resolution.In [9], the identification of mobile apps is carried out by automatically extracting features from labeled packets through Convolutional Neural Networks (CNNs), which are trained using raw Hypertext Transfer Protocol (HTTP) requests, achieving a high classification accuracy.We stress that the work in these papers, as the majority of the other techniques discussed in Section VI, use statistical features obtained from application or Internet Protocol (IP) level information for both service and app identification, along with UDP/TCP port numbers.
The solution here presented sharply departs from previous approaches, as it performs highly accurate traffic classification directly from radio-link level data, without requiring any prior knowledge and without having to decode and/or decrypt the transmitted data flows.The proposed classifiers enable fully automated, over the air, and detailed traffic profiling in mobile cellular systems, making it possible to infer, in an online fashion, the radio resource usage of typical service classes.To do this, we leverage OWL [10], a tool that allows decoding the LTE Physical Downlink Control CHannel (PDCCH), where control information is exchanged between the LTE Base Station (eNodeB) and the connected User Equipments (UEs).Specifically, we decode the Downlink Control Information (DCI) messages carried in the PDCCH, which contain radio-link level settings for the user communication (e.g., modulation and coding scheme, transport block size, allocated resource blocks).From DCI data, we create two datasets 1) a labeled dataset, used to train service and app classification algorithms, where labeling is made possible by injecting an easily identifiable watermark into the application flows generated by a terminal under our control; 2) an unlabeled dataset, used for traffic profiling purposes, which is populated by monitoring, for a full month, mobile traffic from four operative radio cell sites with different demographic characteristics within the metropolitan area of Barcelona, in Spain.
For the traffic analysis, we focus on a few services and applications that dominate the radio resource usage, but the approach can be readily extended to further services and applications.We directly use raw DCI data as input into deep learning classifiers (automatic feature extraction), achieving accuracies as high as 98% for both mobile service and app identification tasks.Moreover, we propose a novel technique to use our best classifier in unsupervised settings, to profile the mobile traffic from operative radio cell sites at runtime.Our tool allows for a fine grained and automated analysis of user traffic from real deployments, using radio link messages transmitted over the PDCCH.The developed classification algorithms, as well as our experimental results are highly novel within the traffic monitoring literature, which only provides hourly and aggregated measures for typical days [11] [12], and where traffic profiling is performed from UDP/TCP, IP or above IP flows, e.g., [7]- [9].
In summary, the original contributions of this work are • Mobile Data Labeling: we present an original and effective approach to automatically label LTE PDCCH DCI data traces.This approach is utilized for six mobile apps, to create a unique correspondence between the software programs (the apps) and the session identifiers that were assigned to them by the eNodeB.The result is a labeled dataset of real DCI data from selected applications, i.e., YouTube, Vimeo, Spotify, Google Music, Skype and WhatsApp video calls.The paper is organized as follows.Section II presents the experimental framework and the proposed methodology to obtain the two datasets.Section III introduces the two classification problems, namely, service and app identification and presents the classification algorithms.The performance of such classification algorithms is assessed in Section IV.In Section V, the CNN classifier is augmented with the capability of rejecting out of distribution sessions.Thus, the mobile traffic from four selected cell sites of an operative mobile network in Spain is decomposed over a full day.The related work on mobile traffic classification is reviewed in Section VI, and some concluding remarks are provided in Section VII.

II. DATASET CREATION
Fig. 1 shows the different building blocks of the experimental framework that has been developed to populate the unlabeled and labeled datasets.Briefly, the data measurement and collection block acquires data from the LTE PDCCH channel to extract the relevant DCI information.Data preparation, instead, processes the gathered DCI data so that it can be used for training and classification purposes.

A. Data Collection System
In LTE, the eNodeB communicates scheduling information to the connected UEs through the DCI messages that are carried within the PDCCH with a time granularity of 1 ms.While the actual user content is sent over encrypted dedicated channels, i.e., the Physical Uplink/Downlink Shared Channel (PUSCH/PDSCH respectively), the PDCCH is transmitted in clear text and can be decoded.To process DCI data, we have adapted the OWL monitoring tool [10].A Software-Defined Radio (SDR) has been programmed, acquiring the PDCCH via an open-source software sitting on top of the srs-LTE library [13], which makes it possible to synchronize and monitor the channel over a specified LTE bandwidth.The SDR is connected to a PC that performs the actual decoding of DCI data: in our experimental settings, we used a low cost Nuand BladeRF x40 SDR and an Intel mini-NUC, equipped with an i5 2.7 Ghz multi-core processor, 256 GB Solid State Storage (SSD) storage and 18 GB of RAM.
Decoded DCI messages for a connected UE contain the following scheduling information [14]: • Radio Network Temporary Identifier (RNTI), • Resource Block (RB) assignment, • Modulation and Coding Scheme (MCS).DCI messages use RNTIs to specify their destination.RNTIs are 16-bit identifiers that are employed to address UEs in an LTE cell.They are used for different purposes such as to broadcast system information (SI-RNTI), to page a specific UE (P-RNTI), to carry out a random access procedure (RA-RNTI), and to identify a connected user, i.e., the cell RNTI (C-RNTI).In this work, we are interested in the C-RNTI, that is temporarily assigned when the UE is in RRC (Radio Resource Control) CONNECTED state, to uniquely identify it inside the cell.The C-RNTI can take any unreserved value in the range [0x003D-FFF3].Once the C-RNTI is assigned to a connected UE, the DCI information directed to this terminal is sent using this C-RNTI, which is transmitted in clear text as part of the PDCCH channel.Hence, knowing the C-RNTI allows tracking a specific connected user within the radio cell.Assuming that the C-RNTI is known (see Section II-C), the following information about the ongoing communication for this UE can be extracted from its DCI data: • Number of allotted resource blocks: in LTE, a RB represents the smallest resource unit that can be allocated to any user.The number of resource blocks that are assigned to a UE (N RB ), is derived based on the DCI bitmap.
• Modulation order and code rate: the MCS is a 5-bit field that determines the modulation order and the code rate that are used, at the physical layer, for the transmission of data to the UE.• Transport Block Size (TBS): the TBS specifies the length of the packet to be sent to the UE in the current Transmission Time Interval (TTI).It is derived by from a lookup table by using MCS and N RB , see [14].In this work, we demonstrate that monitoring the downlink and uplink TBS information of a given UE enables service and app classification with high accuracy.

B. Unlabeled Dataset
Thanks to the just described DCI collection system, four cell sites of a Spanish mobile network operator in the metropolitan area of Barcelona have been monitored for a full month.The selected eNodeBs are located in areas having different demographic characteristics and land uses, so as to diversify the captured traffic in terms of service and app behavior.We have named the datasets according to the corresponding neighborhood: PobleSec (mainly residential area), Born (mixed residential, transport and leisure area), Castelldefels (mixed suburban and campus area), Camp Nou (mixed residential and stadium area).In total, we have collected more than 68 GB of DCI data from the LTE PDCCH.Fig. 2 shows the locations of the four monitored sites, along with that of the data collection system.After the data collection, the signaling associated with each active C-RNTI is extracted from the PDCCH DCI data stream, and is prepared for the classifier.During this, we discarded short-length traces, which are mainly due to signaling, paging and background traffic.These accounted for less than 3% of the total traffic in the monitored radio cells.

C. Labeled Dataset
A labeled dataset is obtained by running specific services and apps at a mobile terminal under our control, detecting its C-RNTI within the PDCCH channel and finally associating the corresponding DCI trace with a label, which links it to the service/app that is executed at the UE.Generating data sessions is easy, and boils down to running a specific app from a device that we control, and that is connected to the monitored eNodeB.The difficult part is to identify the generated data flow among those carried by the PDCCH channel, which contains DCI information for all the connected UEs within the radio cell.We made this labeling possible by injecting a watermark into the traffic that we generated by the controlled UE, so that it could be easily identified among all other users.

1) Data preparation and watermarking
The data preparation procedure is divided into two phases: 1) the identification of the C-RNTI corresponding to the controlled UE, 2) the extraction and labeling of the cor-   responding DCI trace.In the LTE PDCCH channel, each UE is identified by the C-RNTI, which uniquely identifies the mobile terminal within the radio cell.This identifier is temporary, i.e., it changes after short inactivity periods.This is done to prevent the plain tracking of mobile users, since the PDCCH is sent unencrypted.To allow traffic labeling (i.e., user identification), we introduced a watermark into the traffic generated through our mobile terminal.This watermark amounts to producing, for each application, a regular pattern: any given application (e.g., YouTube) is run for a pre-defined amount of time (60 seconds in our measurements), then, a pause interval of fixed duration is inserted before running the app for further 60 seconds.We loop this over time, obtaining a duty cycled activity pattern that is easily distinguishable from all the other activity traces within the radio cell.Through this watermarking procedure, we could successfully associate our UE with the corresponding C-RNTI from the DCI.Also, we split the traces into different sessions thanks to the duty cycled pattern, where subsequent sessions are separated by the pause interval (of fixed duration).The label, corresponding to the application that is being executed at the mobile terminal, was finally associated with the extracted DCI data.
In our measurement campaign, we have recorded and labeled about M = 10, 000 mobile sessions, gathering the scheduling information contained in the DCI messages for selected apps.We considered three data-intensive services: video streaming, audio streaming and real-time video calling, which represent classes producing a considerable amount of traffic and taking most of the network resources [1].For each service type, we chose two popular applications: Spotify and Google Music for audio, YouTube and Vimeo for video streaming, while for the video calling we picked two instant-messaging applications, namely, Skype and WhatsApp Messenger.
A large measurement campaign was conducted to expose the mobile terminal running the selected apps to different radio link conditions, thus obtaining a comprehensive dataset.In particular, the UE was placed into two different locations (termed B 1 and B 2 in Fig. 2a) within the Castelldefels radio cell to experience different received signal qualities (−84 dBm and −94 dBm for B 1 and B 2 , respectively), and in the Camp Nou eNodeB during football matches, to capture data in high cell load conditions.Fig. 3 shows a few radio resource usage patterns collected for the selected apps.Some similarities can be recognized within the same service class.For example, audio and video streaming present similar behaviors.Also, significant differences can be observed between the radio resource usage of real time video calls (Skype and WhatApp Video) and the other apps.Video and audio streaming applications use up a high amount of radio resources at the beginning of the sessions, buffering most of the content into the terminal memory.Real time video calling, instead, entails a continuous transmission and a more constant usage of radio resources throughout the sessions.Note that the amount of data exchanged in the uplink direction is significant only for this service class, since a video call requires a bidirectional communication.

D. Synchronous and Asynchronous Sessions
Through the watermarking approach and the splitting procedure, we obtained a labeled dataset, where each session, depending on the service, presents patterns similar to those shown in Fig. 3. Assuming that the beginning and the end of each session are known is rather optimistic, as in a real measurement setup we have no means to accurately track these instants.Put it another way, it is unlikely that the LTE PDCCH measurements and the application run on the UE will be temporally synchronized.Synchronizing the measurement with the beginning of each session would facilitate the classification task, since most of the generated traffic is buffered on the terminal at the beginning, see Fig. 3, and this behavior is a distinctive feature that is easy to discriminate.
To ensure the applicability of our classifiers to real world (asynchronous) cases, we account for asynchronous sessions, entailing that the classification algorithm has no knowledge about the instants where the sessions begin and end.Specifically, each session is split using a sliding window of length W seconds, moved rightwards from the beginning of the session with a stride of S seconds, see Fig. 4. The split sessions (asynchronous sessions), of W seconds each, represent the input data to our classification algorithms.Note that W and S are hyper-parameters of the proposed classification frameworks.

E. Sessions Correlation over Time
As a sanity check, we verify the soundness of the watermarking strategy: our aim is to understand whether the transmission of user data in the form of duty-cycled patterns may affect the way in which the eNodeB handles the communication from our terminal, e.g., through some advanced channel reservation mechanism.In that case, in fact, our watermarking strategy would be of little use, as it would introduce scheduling artifacts that do not occur in real life conditions.To verify this, we evaluated the Pearson correlation between the initial session (i.e., when the UE connects to the LTE PDCCH for the first time and it is assigned a new C-RNTI) and the following ones.Fig. 5 shows that, for each of the three services, the correlation is high only when we compare the first session with itself (n = 0).Instead, low values are observed between the first session and the following ones (n > 1), indicating that the behavior of the eNodeB scheduler is not affected by the repetitive actions (i.e., the duty-cycled activity) performed at the UE side.

III. CLASSIFICATION PROBLEM A. Problem Definition
Let M be the number of windowed-sessions obtained through the data preparation procedure of Section II-C, W is the window size, and D = 2 is the number of communication directions (downlink and uplink).We define X the input dataset tensor with size M × W × D, where the m-th row vector x m contains the trace associated with W TBS samples per session for both downlink and uplink directions (D = 2).
A classifier estimates a function c : X → Y , where the output matrix Y has size M ×K, with K representing the number of classes.The row vector y m = c(x m ) = [y m1 , . . ., y mK ] contains the probabilities that session m belongs to each of the K classes, with k y mk = 1.The final output of the classifier is class k , where k = argmax k (y mk ).The following classification objectives are addressed: O1) Service identification: to classify the collected sessions into K = 3 classes, namely, audio streaming, video streaming and video calls; O2) App identification: to identify which app is run at the UE.In this case, the number of output classes is K = 6, namely, Spotify, Google Music, YouTube, Vimeo, Skype and WhatsApp Messenger.Next, we present the considered classification algorithms, grouping them into two categories: those based on artificial neural networks and those based on standard machine learning techniques (referred to here as benchmark classifiers).

B. Deep Neural Networks
Next, we describe how we tailored three neural network architectures to solve the above traffic classification problem, namely, Multilayer Perceptron (MLP), Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs).

1) Multilayer Perceptron
A multilayer perceptron is a feedforward and fully-connected neural network architecture.The term "feed-forward" refers to the fact that the information flows in one direction, from the input to the output layer.An MLP is composed of, at least, three layers of nodes: an input, a hidden and an output layer.A directed graph connects the input with the output layer and each neuron in the graph uses a non-linear activation function to produce its output.Links are weighted and the backpropagation algorithm is utilized to train the network in a supervised fashion, i.e., to find the set of network weights that minimize a certain error function, given an input set of examples and the corresponding labels.For further details, see [15].
The MLP that we use for mobile traffic classification has three fully connected layers.The first layer M LP 1 contains N M LP 1 = 128 neurons, the second layer M LP 2 has N M LP 2 = 64 neurons and the third layer M LP 3 is fully connected, with K neurons and a softmax activation function to produce the final output.The output of M LP 3 is the class probability vector y m .
All neurons in layers M LP 1 and M LP 2 use a leaky version of the Rectified Linear Unit (ReLU) (leaky ReLU) activation function.Leaky ReLUs help solve the vanishing gradient problem, i.e., the fact that the error gradients that are backpropagated during the training of the network weights may become very small (zero in the worst case), preventing the correct (gradient based) adaptation of the weights.To prevent this from happening, leaky ReLUs have a small negative slope for negative values of their argument [16].To train the presented MLP architecture, we use the RMSprop gradient descent algorithm [17], by minimizing the categorical cross-entropy loss function L(w), defined as [15] L where t(x) m = [t 1 (x m ), . . ., t k (x m )] contains the class labels associated with the input trace x m , i.e., t k = 1 if x m belongs to class k and t k = 0 otherwise (1-of-K coding scheme).Vector w contains the MLP weights and y mk (w, x m ) is the MLP output obtained for input x m .Eq. ( 1) is iteratively minimized using the training examples in the batch set B ⊂ X, where B is changed at every iteration so as to span the entire input set X.

2) Recurrent Neural Networks
Recurrent Neural Networks (RNNs) have been conceived to extract features from temporal (and correlated) data sequences.Long Short-Term Memory (LSTM) networks are a particular type of RNN, introduced in [18].They are capable of tracking long-term dependencies into the input time series, while solving the vanishing-gradient problem that affects standard RNNs [19].
The capability of learning long-term dependencies is due to the structure of the LSTM cells, which incorporates gates that regulate the learning process.The neurons in the hidden layers of an LSTM are Memory Cells (MCs).A MC has the ability to retain or forget information about past input values (whose effect is stored into the cell states) by using structures called gates, which consist of a cascade of a neuron with sigmoidal activation function and a pointwise multiplication block.Thanks to this architecture, the output of each memory cell possibly depends on the entire sequence of past states, making LSTMs suitable for processing time series with long time dependencies [18].The input gate of a memory cell is a neuron with sigmoidal activation function (σ).Its output determines the fraction of the MC input that is fed to the cell state block.Similarly, the forget gate processes the information  that is recurrently fed back into the cell state block.The output gate, instead, determines the fraction of the cell state output that is to be used as the output of the MC, at each time step.Gate neurons usually have sigmoidal activation functions (σ), while the hyperbolic tangent (tanh) activation function is usually adopted to process the input and for the cell state.All the internal connections of the MC have unitary weight [18].
The proposed RNN based traffic classification architecture is shown in Fig. 6.In our design, we consider three stacked layers combining two LSTM layers and a final fully connected output layer.The first and the second layer (respectively RN N 1 and RN N 2 ) have N RN N 1 = N RN N 2 = 180 memory cells.The fully connected layer RN N 3 uses the softmax activation function and its output consists of the class probability estimates, as described in Section III-B1.

3) Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are feed-forward deep neural networks differing from fully connected MLP for the presence of one or more convolutional layers.At each convolutional layer, a number of kernels is used.Each kernel is composed of a number of weights and is convolved across the entire input signal.Note that the kernel acts as a filter whose weights are re-used (shared weights) across the entire input and this makes the network connectivity structure sparse, i.e., a small set of parameters (the kernel weights) suffices to map the input into the output.This leads to a considerably reduced computational complexity with respect to fully connected feed forward neural networks, and to a smaller memory footprint.For more details the reader is referred to [20].
CNNs have been proven to be excellent feature extractors for images and inertial signals [21] and here we show their effectiveness for the classification of mobile traffic data.The CNN architecture that we designed to this purpose is shown in Fig. 7.It has two main parts: the first four layers perform convolutions and max pooling in cascade, the last two are fully connected layers.The first convolutional layer CN N 1  • c: penalty parameter for the error term • extended to multi-class with one-vs-rest [22] Logistic Regressor • penalty = L2 • c = 1 • c: inverse of the regularization strength • extended to multi-class with one-vs-rest [22] Nearest Neighbours • K: number of neighbors for queries • p: distance metric parameter • p = 2 amounts to using the Euclidean distance [23] Random Forest • n. estimators = 10 • max depth = 5 • criterion = entropy • n. estimators: number of trees in the forest • max depth: maximum depth of a tree • criterion: function to measure the quality of a split of subsets [25] Gaussian Processes • Radial Basis Function (RBF) used as kernel • σ is the sigmoid function used to "squash" the nuisance function • Laplacian method used to approximate the non Gaussian Posterior [26] uses one dimensional kernels (1 × 5 samples) performing a first filtering of the input and processing each input vector (rows of X) separately.The activation functions are linear and the number of convolutional kernels is The second convolutional layer, CN N 2 , uses one dimensional kernels (1 × 5 samples) with non-linear hyperbolic tangents as activation functions, and the number of convolutional kernels is N CN N 2 = 64.Max pooling is separately applied to the outputs of CN N 1 and CN N 2 to reduce their dimensionality and increase the spatial invariance of features [21].In both cases, a one-dimensional pooling with a kernel of size 1 × 3 is performed.A third fully connected layer, CN N 3 , performs dimensionality reduction and has N CN N 3 = 32 neurons with Leaky ReLU activation functions.This layer is used in place of a further convolutional layer to reduce the computation time, with a negligible loss in accuracy.The last (output) layer CN N 4 is fully connected with softmax activation functions, and returns the class probability estimates, see Sections III-B1.

C. Benchmark classifiers
Other standard classification schemes have been tailored to the considered tasks O1 and O2.The selected algorithms are: Linear Logistic Regression, K-Nearest Neighbours and Linear SVM, as examples of linear classifiers; Random Forest, as an ensemble learning method, and Gaussian Processes as an instance of Bayesian approaches.The implementations of Linear Logistic Regression, K-Nearest Neighbours and Linear SVM are based on [22], [23] and [24], respectively.The Random Forest implementation is based on [25], whereas for the classifier based on Gaussian Processes we refer to [26].Configuration parameters and implementation details for the benchmark classifiers are provided in Table I.

IV. SUPERVISED TRAINING AND COMPARISON OF
TRAFFIC CLASSIFICATION ALGORITHMS The performance tests have been carried out using an Intel core i7 machine, with 32 GB of RAM and an NVIDIA GTX 980 GPU card.We divided the dataset, featuring 10, 000 labeled DCI sessions, into training and validation sets with a split ratio of 70% -30%.These sets are balanced, as they contain the same percentage of traces for all classes.The classification algorithms have been implemented in Python.We have used keras library on top of Tensorflow backend for the implementation of deep NNs.For the benchmark classifiers, we used the popular sklearn library.

A. Performance Metrics
The classification performance is assessed through the following metrics: 1) Accuracy: defined as the ratio between the number of correctly classified sessions to the total number of sessions.2) Precision P : defined, for each class, as the ratio between true positives T p and the sum between true positives and false positives F p , 3) Recall R: defined, for each class, as the ratio between the true positives T p and the sum of true positives and false negatives 4) F-Score F is defined as the harmonic mean of precision P and recall R, Note that the definition of precision and recall only applies to classification tasks with one class.However, tasks O1 and O2 both have a number of classes K > 2, namely, K = {3, 6} for app and service identification, respectively.Thus, precision and recall are separately calculated for all the K classes, and their average is shown in the following numerical results.

B. Comparison of Classification Algorithms 1) Accuracy and Algorithm Training
Tables II and III summarize the obtained performance metrics for the deep NNs and the benchmark classifiers for app and service identification, respectively.First, we focus on synchronous sessions results.In general, better performance is achieved through deep NNs (+13.8% on app identification, +8.7% on service classification).Moreover, we observe a significant performance gap between the service and app classification tasks, due to the higher number of classes of the latter: the performance gap is higher than 8% for the benchmark classifiers and ranges from 2 to 6% for NNs.Furthermore, RNN and CNN architectures achieve an accuracy of about 99% for the service identification task (O1) and higher than 95% for the app identification task (O2).The algorithm based on Gaussian Processes performs the best among the benchmark classifiers.In general, the higher the complexity (i.e., the number of parameters, and also hidden layers for NNs), the higher the performance.The only exception to this is provided by CNNs, which present the highest accuracy but use a small number of parameters.This fact confirms the high efficiency of convolutions in processing high amount of data with complex temporal structure, and the effectiveness of parameter sharing.CNNs require only 6% of the variables used up by RNNs, achieving a better accuracy.This also translates into a faster training: in Fig. 8, we show the accuracy as a function of the number of epochs for training and validation sets for RNNs and CNNs.The number of epochs required by the CNNs to reach an accuracy higher than 90% is fewer than 20 (Fig. 8b), whereas for RNNs convergence is achieved only after 30 epochs (Fig. 8a).

2) Confusion Matrix
A deeper look at the performance of CNNs is provided by the confusion matrices of Fig. 9, whose rows and columns respectively represent true and predicted labels, and all values are normalized between 0 and 1.For the service classification task (Fig. 9a), CNNs only misclassify the video streaming sessions: 2% of those are labeled as video calls.For the app identification task (Fig. 9b), errors (4%) mainly occur for Skype and WhatsApp videocalls.These errors are understandable, as these are both interactive real-time video applications and, as such, their traffic patterns bear similarities.The lowest performance is found for Vimeo traces, for which 88% of the sessions are correctly classified.Here our CNN-based classifier confuses them with the other video applications for both streaming service (Youtube -3%) and real-time calling (WhatsApp and Skype -6% and 3%, repectively).

3) Asynchronous Sessions Results
As shown in the last two columns of Tables II and III, the algorithms' accuracy is affected by asynchronous sessions.As expected, we observe a general decrease in the accuracy for all the algorithms (−6.0% for service identification, −7.7% for app identification, on average).However, for both classification problems, neural network-based approaches are more robust to the asynchronous case, showing a performance degradation of −4.3%, while the degradation of standard algorithms is −8.4%.

4) Impact of Different Window Sizes
Fig. 11 shows the classification accuracy as a function of the window size, W .For the app identification task, 40 seconds suffice for CNNs and RNNs to reach accuracies higher than 90%, with negligible additional improvements for longer observation periods.Periods shorter than 40 seconds provide less accurate results.Similar trends are observed for the service classification task.However, in this case after 20 seconds the accuracy of CNNs and RNNs is already higher than 90%, due to the smaller number of classes.In summary, the ability of CNNs and RNNs to extract representative statistical features from a session grows with the input data length.In our tasks, deep NN algorithms become very effective as monitoring intervals get longer than 20 seconds.
V. UNSUPERVISED TRAFFIC PROFILING Next, we analyze the mobile traffic exchanged within the four selected cell sites (see Section II-B).The traffic load is modeled in terms of aggregated traffic dynamics and type of service requests over the 24 hours of a day.The identification of mobile traffic, for each of the considered services, is performed using the trained classifiers from Section III with the unlabeled dataset.Formally, for each eNodeB, the corresponding unlabeled dataset is stored into the tensor X , with size M × N × D, where M corresponds to the number of monitored RNTIs (sessions, N to the number of collected samples per session and D = 2 is the number of communications directions (downlink and uplink).Given X , as input, the classifier c computes the output Y , whose analysis provides a detailed characterization of the mobile user requests for the eNodeB within the monitored time span.Vectors x m and y m = c(x m ) respectively indicate the m-th sample of X and  1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.96 0.00 0.00 0.04 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.03 0.00 0.88 0.06 0.03 0.00 0.04 0.00 0.00 0.96 0.00 0.00 0.00 0.00 0.00 0.00 1.00 Confusion matrix (b) Confusion matrix for app identification.Y .In this paper, we restrict our attention to the unsupervised classification of services, and use the CNN classifier, as it yields the highest accuracy.

A. Aggregated Traffic Analysis
Fig. 12 shows the aggregated traffic demand for the four selected eNodeBs over the 24 hours of a typical day, where each curve has been normalized with respect to the maximum traffic peak occurred during the day for the corresponding eNodeB.The four traffic profiles have a different trend, which depends on the characteristics of the served area (demographics, predominant land use, etc.), as confirmed by [27].PobleSec is a residential neighborhood and, as such, presents traffic peaks during the evening, at 5 and 11 pm.Born is instead a downtown district with a mixed residential, transport and leisure land use.Two peaks are detected: the highest is at lunchtime around 2 pm, whereas the second one is at dinner time, from 9 pm.This traffic behavior is likely due to the many restaurants and bars in the area.CampNou is mainly residential and presents a similar profile to PobleSec.However, Barcelona FC stadium is located in this area, and three football matches took place during the monitored period (events started at 8:45 pm and ended at 10:45 pm).As expected, a higher traffic intensity is observed during the football match hours.In particular, we registered a high amount of traffic exchanged between 7 pm to 1 am, i.e., before, during and after the events.This behavior is probably due to the movement of people attending the matches.Castelldefels is a suburban and low populated area with a university campus.The traffic variation suggests a typical office profile with traffic peaks at 10 am and 5 pm.However, in this radio cell the amount of traffic exchanged is the lowest observed across all eNodeB sites, i.e., 6.8 Gb/hour in the peak hours.The highest traffic intensity was measured in Camp Nou, reaching a peak of 106.1 Gb/hour (29.5 Mb/s on average).Intermediate peak values are detected in Poble Sec and Born, amounting to 49.7 Mb/s and 46.1 Mb/s, respectively.The only common pattern among the four areas is the low traffic intensity at night, between 2 am and 7 am.

B. Traffic Decomposition at Service Level
The set of applications that we have labeled is restricted to those apps and services that dominate the radio resource usage.However, additional apps may also be present in the monitored traffic, such as Facebook, Instagram, Snapchat, etc.These apps generate mixed content, including audio-streaming, video-streaming and video-calling.Additional service types may also be generated by, e.g., web-browsing and file downloading.While in the present work the classifiers were not trained to specifically track these apps, for a robust classification outcome, it is desirable that the audio and video  streams that they generate will either be captured and classified into the correct service class, or at least flagged as unknown.
To locate those traffic patterns for which our classifier may produce inaccurate results, in our analysis we additionally account for the detection of out-of-distribution (OOD) sessions, i.e., of DCI traces that show different traffic dynamics from those learned at training time.To identify these "statistical outliers", the DCI data from each new session, x m , is fed to the CNN and the corresponding softmax output vector y m = [y m1 , . . ., y mK ] T is used to discriminate whether x m is OOD or not, following the rationale in [28] [29].In detail, the k-th softmax output corresponds to the probability estimate that a given input session x m belongs to class k, i.e., y mk = Prob(x m ∈ class k), with k = 1, . . ., K. The classifier chooses the class k that maximizes this probability, i.e., If a new app, not considered in the training phase, generates sessions having similar characteristics to those in the training set, namely, audio-streaming, video-streaming or real time video-calls, we expect the CNN to generalize well and return similar vectors at the output of the softmax layer.That is, the softmax vector that is outputted at runtime for the new app should be sufficiently "close" to the output learned by the classifier from the labeled dataset, as the new signal bears statistical similarities with those learned in the training phase.
In this case, it makes sense to accept the session and classify it as belonging to class k .Otherwise, the session would be classified as OOD.The problem at hand, boils down to assessing whether the softmax output y m belongs to the statistical distribution learned by the CNN or it is an outlier.This amounts to performing outlier detection in a multivariate setting, with y m ∈ [0, 1] K , k y mk = 1.Among the many algorithms that can be used to this purpose, we adopt the method based on Kernelized Spatial Depth (KSD) functions of [30] as it is lightweight and does not require the direct estimation of the probability density function (pdf) of the softmax output layer, which is a critical point, as good estimates require training over many points.Briefly, for a vector y ∈ R K , we define the spatial sign function as S(y) = y/ y if y = 0 and S(y) = 0 if y = 0, where y = (y T y) }, the sample spatial depth associated with a new softmax output vector y m is: Note that D(y m , Y k ) ∈ [0, 1] provides a measure of centrality of the new point y m with respect to the points in the training set Y k .In particular, if D(y m , Y k ) = 1, it follows that y∈Y k S(y − y m ) = 0 and the new point is said to be the spatial median of set Y k , i.e., it can be thought of as the "center of mass" of this set.Hence, the spatial depth attains the highest value of 1 at the spatial median and decreases to zero as y m moves away from it.The spatial depth can thus be used as a measure of "extremeness" of a new data point with respect to a set.In [30], the spatial depth of Eq. ( 6) is kernelized, which means that distances are evaluated using a positive definite kernel map.A common choice, that we also use in this paper, is the generalized Gaussian kernel κ(x, y), which provides a measure of similarity between x and y.
Noting that the square norm can be expressed as kernelizing the sample spatial depth amounts to expanding (6) and replacing the inner products with the kernel function κ.This returns the sample KSD function (Eq.( 4) in [30]).
Session classification procedure in an unsupervised setting: the CNN classifier is augmented through the detection of OOD sessions, as follows: • Initialization: for each class k = 1, . . ., K in the service/app identification task a number of softmax output vectors is computed by the trained CNN using the sessions in the training set.Tuning the OOD threshold t: for each class k, S k is obtained from the training dataset.We recall that S k is used to compute the covariance matrix associated with the adopted Gaussian kernel, which models the contours of the pdf of the output softmax vectors.The threshold t ∈ [0, 1] is instead used by the outlier detection algorithm to gauge the (kernelized) distance between the center of mass of set S k and a new softmax vector, acquired at runtime.If t = 1, the kernelized spatial depth of the new point will always be smaller than or equal to t and all points will be rejected (marked as outliers).This is of course of no use.However, as we decrease t towards 0, we see that more and more points will be accepted, until, for t = 0, no rejections will occur.So, t determines the selectivity of the outlier detection mechanism, the higher t, the more selective the algorithm is.For our numerical evaluation, once the sets S k are obtained for all classes k, we set this threshold by picking the highest value of it, t , for which all the softmax vectors belonging to the test set are accepted, i.e., none of them is marked as an outlier (OOD).In other words, this is equivalent to making sure that the F -Score obtained over the test set from our trained CNN without the OOD mechanism enabled equals that of the CNN classifier augmented with the OOD detection capability.As t is the highest value of t for which all the data in the test set are correctly classified as valid, our approach amounts to tuning the threshold in such a way that the outlier detection algorithm will be as selective as possible, while correctly treating all the data in the test set.In Fig. 13, we show the F -Score as a function of t for the CNN algorithm with and without OOD detection.Threshold t = 0.48 corresponds to the highest value of t for which the F -Score remains at its maximum, i.e., at the end of the flat region.
Experimental analysis of eNodeB traffic: in Fig. 14, the traffic decomposition into the considered service classes is shown for the four selected eNodeBs using t = 0.48.classifier is uncertain, is also reported at the top of each bar.Common characteristics are observed in all the considered deployments: • the most used service is video-streaming, with typical shares ranging from 50% to 80%.This confirms the measurements in [1] and [2].• The least used service is video-call, whose share is typically between 5% and 10%, whereas audio-streaming takes 21% of the total traffic load.• OOD sessions are consistently well below 8%.Note that this share accounts for all those apps that are not tracked by our classifier, such as texting, web browsing, and file transfers.Through the proposed service identification approach, we can accurately characterize, at runtime, the used services.Moreover, the traffic decomposition at service level allows one to make some interesting considerations on the land use.For example, in a typical residential area (PobleSec) the audio-streaming service is the one used the least across the four monitored sites, with an average of 16.4%.Instead, in a typical office and university neighborhood (Castelldefels), audio-streaming has the highest traffic share across all sites (22% on average).Born and CampNou, which are two leisure districts, present a similar traffic distribution across the day.We finally remark that, while the traffic profiling results are shown using a time granularity of one hour, our classification tool allows for traffic decomposition at much shorter timescales, i.e., on a per-session basis.

VI. RELATED WORK
The most common classification methods in the literature leverage UDP/TCP port analysis and/or packet inspection.
UDP/TCP port analysis relies on the fact that most Internet applications use well-known Transmission Control Protocol (TCP) or User Datagram Protocol (UDP) port numbers.For instance, the authors of [31] define a mobile traffic classifier as a collection of rules, including destination IP addresses and port numbers.Based on these rules, application-level mobile traffic identification is performed deploying a dedicated classification architecture within the network, and measurement agents at the mobile devices.However, port-based schemes hardly work in the presence of applications using dynamic port numbers [32].
A scheme based on deep packet inspection is presented in [33].The authors of this paper devise a technique for Code Division Multiple Access (CDMA) traffic classification, using correlation-based feature selection along with a decision tree classifier trained on a labeled dataset (which is labeled via deep packet inspection).The algorithm in [34] extracts application layer payload patterns, and performs maximum entropy-based IP-traffic classification exploiting different Machine Learning (ML) algorithms such as Naive Bayes, Support Vector Machines (SVMs) and partial decision trees.Remarkably, payload-based methods are limited by a significant complexity and computation load [32].Furthermore, many mobile applications adopt encrypted data transmission due to security and privacy concerns, which renders packet inspection approaches ineffective.
Classification schemes for encrypted flows utilize traffic statistics, extracting meaningful features from the observed traffic patterns.For example, the authors of [35] propose a classification system for mobile apps based on a Cascade Forest algorithm exploiting features related to: i) the mobile traces, ii) the Cascade Forest algorithm, iii) connection-oriented protocols and connection-less protocols, and iv) encrypted and unencrypted flows.Along the same lines, a classification approach that combines state-of-art classifiers for encrypted traffic analysis is put forward in [7].
Another interesting work is presented in [32], where the authors classify service usage from mobile messaging apps by jointly modeling user behavioral patterns, network traffic characteristics, and temporal dependencies.The framework is built upon four main blocks: traffic segmentation, traffic feature extraction, service usage prediction, and outlier detection.When traffic flows are short and the defined features do not suffice to fully describe the traffic pattern, Hidden Markov Models (HMMs) are exploited to capture temporal dependencies, to enhance the classification accuracy.
The authors of [36] show that a passive eavesdropper is capable of identifying fine grained user activities for Android and iOS mobile apps, by solely inspecting IP headers.Their technique is based on the intuition that the highly specific implementation of an app may leave a fingerprint on the generated traffic in terms of, e.g., transfer rates, packet exchanges, and data movement.For each activity type, a behavioral model is built, then K-means and SVM are respectively used to reveal which model is the most appropriate, and to classify the mobile apps.
Automatic fingerprinting and real-time identification of Android apps from their encrypted network traffic is presented in [37].IP-based feature extraction and supervised learning algorithms are the basis of a framework featuring six classifiers, obtained as variations of SVMs and Random Forestss (RFs).RFs have also been considered in [38], where the authors claim that the sole use of packet-based features does not suffice to classify the traffic generated by mobile apps.As a solution, they use a combination of packet size distributions and communication patterns.
Recent works exploit NNs [9] [8] [7].In [9], mobile apps are identified by automatically extracting features from labeled packets through CNNs, which are trained using raw HTTP requests.In [7], encrypted traffic is classified using deep learning architectures (feed forward, convolutional and recurrent neural networks) for Android and iOS mobile apps, with and without using TCP/UDP ports.The authors of [8] combine Zipper Networks (ZipNet) and Generative-Adversarial Networks (GAN) to infer narrowly localized and fine grained traffic generation from coarse measurements.
A systematic framework is devised in [39] for comparison among different techniques where deep learning is proposed as the most viable strategy.The performance of the deep learning classifiers is thoroughly investigated based on three mobile datasets of real human users' activity, highlighting the related drawbacks, design guidelines, and challenges.Several survey papers dealing with deep learning techniques applied to traffic classification can be found in [40], [41] and [42].The authors in [40] overview general guidelines for classification tasks, present some deep learning techniques and how they have been applied for traffic classification, and finally, open problems and future directions are addressed.The survey in [41] presents a deep learning-based framework for mobile encrypted traffic classification, reviewing existing work according to dataset selection, model input design, and model architecture, and highlighting open issues and challenges.Finally, a comprehensive and thorough study of the crossovers between deep learning and mobile networking research is provided in [42] where the authors discuss how to tailor deep learning to mobile environments.Current challenges and open future research directions are also discussed.
We stress that that most of the works in the literature, with the exception of [7]- [9] and [39], classify mobile traffic based on manual feature extraction and all the papers that we surveyed process network or application level data.Our work departs from prior art as we leverage the feature extraction capabilities of deep neural networks and classify mobile data gathered from the physical channel of a mobile operative network, at runtime, and without access to application data and TCP/UDP port numbers.

VII. CONCLUSIONS
In this paper, we have presented a framework that allows highly accurate classification of application and services from radio-link level data, at runtime, and without having to decrypt dedicated physical layer channels.To this end, we decoded the LTE Physical Downlink Control Channel (PDCCH), where Downlink Control Information (DCI) messages are sent in clear text.Through DCI data, it is possible to track the data flows exchanged between the serving cell and its active users, extracting features that allow the reliable identification of the apps/services that are being executed at the mobile terminals.For the classification of such traffic, we have tailored deep artificial Neural Networks NNs, namely, Multi-Layer Perceptron (MLP), Recurrent NNs and Convolutional NNs, comparing their performance against that of benchmark classifiers based on state-of-the-art supervised learning algorithms.Our numerical results show that NN architectures overcome the other approaches in terms of classification accuracy, with the best accuracy (as high as 98%) being achieved by CNNs.As a major contribution of this work, labeled and unlabeled datasets of DCI data from real radio cell deployments have been collected.The labeled dataset has been used to train and compare the classifiers.For the unlabeled dataset, we have augmented the CNN with the capability of detecting input DCI data that do not conform to that learned during the training phase: the corresponding patterns are detected and associated with an unknown class.This increases the robustness of the CNN classifier, allowing its use, at runtime, to perform fine grained traffic analysis from radio cell sites from an operative mobile network.To summarize, the main outcomes of our work are: 1) a methodology to extract DCI data from the PDCCH channel, and for the use of such data to train traffic classifiers, 2) the fine tuning and a thorough performance comparison of classification algorithms, 3) the design of a novel technique for the fine grained and online traffic analysis of communication sessions from real radio cell sites, and 4) the discussion of the traffic distribution resulting from such analysis from four selected sites of a Spanish mobile operator, in the city of Barcelona.

Fig. 1 :
Fig.1: Experimental framework adopted for the creation of the unlabeled and labeled datasets.
(a) Castelldefels: suburban area with a university campus.(b) Camp Nou: mainly residential area with Barcelona FC stadium.(c) Born: mixed residential, transport and leisure area.

Fig. 2 :
Fig.2: Maps of Barcelona metropolitan areas where the measurement campaign took place for the creation of unlabeled and labeled datasets.In the maps, the eNodeB location is denoted by A, whereas the data collection system and the mobile terminal are marked as B. In Castelldefels, the mobile terminal has been placed in two different locations (B 1 and B 2 ).

Fig. 3 :Fig. 4 :Fig. 5 :
Fig. 3: Traffic pattern snapshots showing the normalized data rate for different applications as a function of time.

Fig. 8 :
Accuracy vs number of epochs for training and validation sets for the app identification task.v i d e o _ s t r e a m i n g v i d e o _ c a l l a Confusion matrix for service identification.g o o g l e m u s i c s k y p e s p o t i f y v i m e o w h a t s a p p v i d e o y o u t u b

Fig. 9 :
Confusion matrices for the CNN algorithm.
Li n e a r S V M Lo g is ti c R e g N e a re st N e ig h b o rs R a n d o m Fo re st G a u ss ia n P ro ce ss Li n e a r S V M Lo g is ti c R e g N e a re st N e ig h b o rs R a n d o m Fo re st G a u ss ia n P ro ce ss Accuracy for app identification.

Fig. 13 :
Fig. 13: Finding threshold t using the CNN with (solid line) and without (dashed line) the OOD detection mechanism.the local and global behavior of KSD.If properly chosen, the contours of KSD should closely follow those of the (actual) underlying statistical distribution.Σ is learned, for each class k, from the training vectors in Y k , and for the following results we picked Σ = Σ 2 in [30].

Fig. 14 :
Fig.14: Traffic decomposition at service level for the four monitored eNodeBs during the 24 hours of a day.

TABLE I :
Configuration parameters for the benchmark classifiers.

TABLE II :
Classifiers comparison for the app identification task.

TABLE III :
Classifiers comparison for the service identification task.
[30]e softmax vectors are stored in the set Y k .Note that, being the results of a supervised training of the CNN, we know that the vectors in Y k are all generated by a distribution that is correctly learned during the supervised training phase.•Featureextractionthrough the pre-trained CNN: at runtime, as a new DCI vector x m is measured, it is inputted into the pre-trained CNN, obtaining the corresponding softmax output y m .•ClassificationandOODdetection: vector y m is used with Eq. (5) to assess the most probable class k .At this point, Algorithm 1 of[30]is utilized to assess whether y m is an outlier.In case the vector is classified as an outlier, it is assigned to the OOD class, otherwise it is assigned to class k .Some final remarks are in order.The outlier detection algorithm uses a threshold t ∈ [0, 1], which allows exploring the tradeoff between false alarm rate and detection rate.Instead, the covariance matrix Σ controls the decision boundary for rejecting vectors, driving the tradeoff between The percentage of sessions identified as OOD, for which the