User mobility inference and clustering through LTE PDCCH data analysis

The high penetration of mobile services provides an ample set of data generated by users and network elements. The analysis of such data yields insights on the behaviour of users and their experienced quality, and can be used by mobile operators to improve their mobile networks. In this paper, we design methods to infer user mobility patterns, estimate their channel quality and cluster them based on their Modulation and Coding Scheme (MCS) time evolution. In detail, we propose: i) a mapping between MCS and SNR, useful to assess the quality of the transmissions, ii) a method for deriving users’ approximate velocities and categorising user mobility, and iii) a hierarchical clustering algorithm using Dynamic Time Warping, able to generate meaningful user clusters according to communication length and quality. We apply those proposed methods to real traces collected from more than one week of observations of three operative base stations in Spain. We observe that our solutions successfully provide relevant information about users’ mobility and their channel quality, making them suitable for improvements in the understanding and planning of LTE resources by mobile network operators.


I. INTRODUCTION
The use of mobile data communications has become massive in the era of smartphones and the Internet of Things. With the increasing pressure of more users and devices connected than ever before and a growing concern on reducing energy consumption to the highest possible extent, data analysis has become a cornerstone in the optimisation and engineering of future mobile equipment and networks [1]. Applying data analysis techniques to LTE networks enables proactive and selforganised network management. For example, it is important to be able to predict when a base station will be overloaded or underloaded to properly manage radio resources among users or trigger energy saving modes, respectively. For that purpose, it is key to study user mobility patterns and estimate the quality required by the requested services.
Among different data analysis techniques, Machine Learning has been extensively used for different mobile communication tasks (see [2] for a complete survey). In particular, unsupervised clustering techniques have been used as part of pipelines for user classification tasks [3] [4] [5], and also to learn mobility patterns [6]. In [7], an algorithm to group base stations in different clusters is proposed using aggregated data, and typically K-means [8] and Self-Organising Maps (SOM) [9] have been used for these tasks.
In this work, we use the information on Modulation and Coding Scheme (MCS) included into the DCI messages of the Physical Downlink Control Channel (PDCCH) from a LTE operative network in Spain to infer user mobility patterns, estimate their channel quality and cluster them based on their MCS time evolution. The main contributions of this paper are as follows. First, we propose a mapping between MCS and SNR and find a relation between the number of transmitting users in a base station and the quality they experience. Second, we use the approximate SNR values to infer users' mobility patterns. Finally, a distance based hierarchical clustering framework is used to identify users receiving similar communication patterns over time. The proposed method has the main advantage of being able to cluster time series of different length and uses a distance metric suited for time series specifically. Furthermore, the adoption of hierarchical clustering [10] implies that the number of clusters can be effectively chosen after the clustering procedure is done.
This article is structured as follows. In Section II, we present the data used along with some considerations regarding data preprocessing. In Section III, we present the theory and procedures used for mobility inference and user clustering, and the results of applying such procedures to the presented data appear in Section IV. Finally, a conclusion is given in Section V.

II. DATA AND PREPROCESSING CONSIDERATIONS
The dataset used in this paper was part of the EU-MSCA-SCAVENGE data challenge [11]. The data consists of traces recorded from the PDCCH (Physical Downlink Control Channel) of three eNBs (eNodeBs), henceforth referred as stations "A", "B", and "C", located in three different areas of the city of Barcelona, with over a week of observations for each station. The PDCCH is the channel that a LTE eNB uses, among other purposes, to handshake the resource scheduling information to transmitting UEs (User Equipments) both in uplink and downlink.
The dataset is composed of 4 columns: Timestamp, Transmission direction (UL/DL), Modulation and Coding Scheme (MCS) index (from 0 to 31) and Temporal User Identifier (C-RNTI).
The MCS is a number (in LTE, ranging from 0 to 31) that is related to a pair of Modulation and Code Rate. LTE uses different digital modulations (QPSK, 16-QAM, 64-QAM, 256-QAM), and different code rates to adjust channel quality between an eNB and a user at any time, so that re-transmissions of data are minimised [12]. MCS indices are allocated by the eNB as a function of the Channel Quality Indicator (CQI) sent by the UE. The CQI depends on the SNR experienced by the user, which, among other factors, is generally correlated with the distance between the eNodeB and the UE. Hence, the information on the assigned MCS, may be used to infer user mobility patterns [12].
The C-RNTI is a LTE 16-bit number which, among its several purposes, helps in identifying a UE in shared channels such as the PDCCH [12]. Being temporary, C-RNTIs do not uniquely identify users: since there is only a limited number of RNTIs, two users who communicate sufficiently distant in time might share the same C-RNTI value. The expiration time of C-RNTIs is an eNB vendor specific parameter and, therefore, unknown in the dataset. The analysis of the distribution of typical user communication lengths for different values of this threshold suggested the reasonable value of 1 hour. This avoids the unrealistic cases of too many very short or very long communications. After the data preparation procedures, we can associate each user i with a vector

A. SNR and mobility analysis
The approach to infer user mobility relies on the estimation of users' SNR, which, in turn, can be used to estimate their velocity by analysing its variation over time. We propose a mapping from MCS to SNR consisting of two steps: A MCS-CQI mapping based on LTE specifications [12] and a CQI-SNR mapping partly based on [13]. Table I shows the proposed MCS-SNR mapping. We propose a mapping based on the assumption that the SNR follows a normal distribution, and thus, that it is more reasonable to find, in each modulation order, medium CQI levels than low or high CQI levels. Hence, we propose a mapping which overrepresents medium CQI levels (which are mapped to more MCS levels) and underrepresents low and high CQI levels (which are mapped to less MCS levels), as it can be seen in Table I  are provided for each CQI: "SNR best " which is the SNR in the best configuration scenario, "SNR worst " which is the SNR in the worst configuration scenario, as the Table depicts.
The velocity of user i at its measurement j, v j i , is computed using Equation (1). We first compute users' SNR vectors − − → SNR i from the users' MCS vectors − −− → MCS i , by applying Table  I. The variation of SNR (∆SNR j i ) and of time ( ) are needed. To guarantee that the SNR variation value is significant (i.e., that the value accounts for the SNR margin of uncertainty), it is computed as . The ratio between these two variables is then multiplied by the constant d 1dB to obtain the user's velocity, being d 1dB the variation of the distance for a change of 1 dB in the SNR. d 1dB is derived using Equation (2) for a given frequency. In

B. User clustering based on MCS patterns
The objective of our clustering procedure is to separate the different users' − −− → MCS i vectors into groups based on their communication behaviour (i.e., session length and MCS dynamics). It is important to recall that our dataset represents the different MCS values of a given C-RNTI during the duration  of the user's communication. Hence, the several time-series obtained differ in length and it is impossible to apply the most classical clustering algorithms (K-means, SOM) without modifying the vectors first. Instead, here we employ a pairwise distance based clustering using Dynamic Time Warping (DTW), as the distance function. DTW is able to calculate distances between sequences of different length, and has been widely used before in the time series analysis literature [14] [15] [16]. The pairwise distance matrix obtained from applying DTW to the observations can then be fed into a hierarchical clustering algorithm, which defines a hierarchy of clusters, so the number of clusters can be defined a posteriori. A linkage matrix is created to store the clusters at each step. Here, Ward's variance minimization algorithm [10] is used to calculate the minimum distance between the observations and the clusters. After that, the number of clusters is decided and the cluster assigned to each observation is obtained. The whole procedure is presented in Algorithm 1. In the pseudocode, linkage refers to the hierarchical clustering algorithm linkage function implemented in libraries for popular programming languages such as Python, Matlab or Mathematica, in which the usage of Ward's algorithm needs to be specified. For any 1 ≤ k ≤ n = Number of users, k clusters can be retrieved using the output linkage matrix Z. All of the data analysis and procedures were implemented in Python, mainly relying on the data manipulation library Pandas, and machine learning and scientific computing libraries Scipy and Scikit-learn.
A. SNR and mobility analysis competing users on their SNR. The correlation between the two phenomena is quite clear: a higher number of competing users leads to a worse SNR. Moreover, it is also possible to note a regular pattern during the days: a higher number of messages is always exchanged around midday/early afternoon in eNB A and eNB C, whereas the peaks are during the evening in eNB B.
In Figure 2, a different visualisation of the same SNR analysis is presented. In this case the average SNRs are plotted against the logarithm of the number of incoming communications in a 2D scatter plot in which each point represents 60 minutes of transmissions. Given the limited spectrum available in wireless communications systems, only a limited number of users can be served without degrading their service quality. Thus, there is a region in which the SNR does not primarily depend on the number of users, and a point from which the SNR starts to drop with an increasing number of users. This point defines the beginning of what we define as saturation region. In Figure  2, the left side is the area in which an increase in the number of messages does not lead to a decrease in the SNR of the user communications, which in general remains constant. On the contrary, on the right side, the SNR starts to decrease with log 10 (n t ) where n t is the number of transmissions. The limit is at around 4 × 10 4 transmissions for eNB A and around 3.5 × 10 5 for eNB B and eNB C. However, for eNB B and eNB C, some points are spotted inside the non-saturated region well beyond the mean SNR of the mentioned region (circled in orange). This means that even though no quality degradation is due to the amount of users in the eNB, as the number of transmissions is theoretically supported by the base station, there is something impairing the channel quality. eNB B and eNB C might be covering a more complicated area than eNB A in terms of morphology, where damaging conditions for wireless transmission are present (e.g. narrower streets, higher number of urban obstacles, etc.).
In order to understand better the evolution of the user transmission quality over time, we have introduced four quality categories: Poor, Fair, Good, and Excellent. A Poor quality transmission experience is defined to be the one in which SN R worst is below 0 dB. For Fair quality, SN R worst is at or above 0 dB and below 6 dB. Good quality represents an SN R worst value equal to or above 6 dB and below 12 dB. Finally, an Excellent quality means that the SN R worst meets or exceeds 12 dB. This can be translated to MCS using Table I. Each user MCS sample MCS j i is mapped into its corresponding quality tag defined in the above, and Figure 3 shows the evolution of quality tags over time. When the number of messages exchanged is high, the predominant quality is Poor: the interference between users becomes an important factor, and SNR is lower. When a lower number of users are competing, Good and Excellent quality are more predominant. This can be more clearly noticed for eNB B.
Finally, the average speed of the users is calculated based on the velocity estimation procedure explained in the previous section. With this value, a user is classified into static or  moving, depending if its velocity is close to 0 m/s or greater than 0 m/s, respectively. Inside the moving category, users are divided into 'pedestrian' and 'vehicular' depending on their estimated velocity being below or above 10 km/h (2.8 m/s). Table II    Due to the simplicity of its underlying FSL model it must be noted that this mobility assessment procedure might provide inaccurate data in specific situations of high channel quality disruptions. However, it is simple to implement and the results presented above have shown that it is useful when performing general mobility assessments.

B. User clustering based on MCS patterns
The clustering procedure in Algorithm 1 is applied separately to random subsets of N = 10000 communications on each of the 3 eNB. We take k = 10 clusters for all cases to demonstrate results, but this number can be easily adjusted after the algorithm has run due to the notion of distance between clusters, making it useful for practical applications. Figure 5 shows the resulting dendrograms (truncated after 10 clusters) for each case. The amount of users in each cluster and the distance between clusters are also plotted. The position on the y axis of every fork marks the distance between clusters. If less than 10 clusters are needed, they can be merged starting from bottom to top. Colours represent proximity, with same-coloured clusters being closer between them than to other colours. In order to assess the validity of the clustering algorithm proposed, we plot the mean and standard deviation of the average communication quality and communication length for each cluster at each eNB in Figure 6. We observe that close clusters are similar in observation length, and the difference between them is the mean MCS. This means that the hierarchical clustering separates observations first based on length and then based on quality. This is in agreement with the fact that DTW assigns higher distances to vectors that have different length. This helps in separating outliers such as clusters 7-10 in eNB B and 4-6 in eNB C. Once these are separated, the clustering is able to distinguish between different MCS values for vectors of similar length, as seen, for example, in clusters 8-10 in eNB A, and also, 1-6 in eNB B and 1-3 in eNB C. This reasoning helps us verify that the performed clustering of users is meaningful and could be useful for additional downstream tasks such as classification, prediction, quick analysis of data from new eNBs, etc.

V. CONCLUSIONS
In this paper we have presented methods to analyse the mobility of users, evaluate their channel quality and cluster them into groups given a set of prerecorded PDCCH traces from operative LTE eNBs. The relationship between transmission quality and user mobility has been also studied. We have introduced a mapping between MCS and SNR to assess the quality of the transmissions, a method for deriving users' approximate velocities and categorising user mobility, and a hierarchical clustering algorithm using Dynamic Time Warping able to generate meaningful user clusters according to communication length and quality. The hierarchical clustering algorithm is additionally able to detect data outliers by separating them in very small clusters. Applying the proposed methods to data coming from three different real eNB we observed that they successfully provide relevant information and insights about users' mobility and their channel quality making them suitable for improving the understanding and planning the LTE resources by mobile network operators.