New Cluster Selection and Fine-grained Search for k-Means Clustering and Wi-Fi Fingerprinting

Wi-Fi fingerprinting is a popular technique for Indoor Positioning Systems (IPSs) thanks to its low complexity and the ubiquity of WLAN infrastructures. However, this technique may present scalability issues when the reference dataset (radio map) is very large. To reduce the computational costs, k-Means Clustering has been successfully applied in the past. However, it is a general-purpose algorithm for unsupervised classification. This paper introduces three variants that apply heuristics based on radio propagation knowledge in the coarse and fine-grained searches. Due to the heterogeneity either in the IPS side (including radio map generation) and in the network infrastructure, we used an evaluation framework composed of 16 datasets. In terms of general positioning accuracy and computational costs, the best proposed k-means variant provided better general positioning accuracy and a significantly better computational cost –around 40% lower– than the original k-means.


I. INTRODUCTION
The user's position is key for many current applications and services [1]. While GNSS receivers embedded in modern smartphones enable positioning outdoors, GNSS-denied scenarios such as indoors -where humans spend more than 80% of their time [2,3] -require other technological solutions.
Wi-Fi fingerprinting is a popular technique for position estimation due to its low deployment costs and the simplicity of the positioning algorithm [4]. The notion behind this technique is that a fingerprint -the Received Signal Strength (RSS) from the nearby Access Points (APs) -is representative of the position where it was taken. For a fingerprint taken at an unknown position (operational fingerprint), its position can be computed using the kNearest Neighbour (NN) algorithm and a dataset with reference fingerprints taken at known positions.
Although this solution is widely used, the distance to all the reference fingerprints must be calculated to get the k nearest fingerprints and estimate the final position. Thus, it might suffer from scalability problems if the positioning algorithm is run in a low-profile device (e.g., a smart watch) or provided by a server accessed by multiple concurrent users. Some authors have applied clustering models to group similar fingerprints of the radio map [5,6,7,8]. Later, the computation of the nearest neighbors is split into two searches: the coarse search and the fine-grained search. The coarse search is devoted to calculate the similarity of the operational fingerprint to all the clusters representatives, whereas the fine-grained search is devoted to calculate the similarity of the operational fingerprint to respect all the reference fingerprints belonging to the selected cluster.
Some alternative approaches to clustering use knowledge on the radio signal propagation to filter the radio map on the fly and reduce the computational costs. Some approaches identify the strongest AP in the operational fingerprint and then restrict the comparison to either the reference fingerprints where that AP was detected [9,10,11] or the reference fingerprints where that AP was also the strongest one [12]. In general, those filters present a trade-off between the accuracy and cost dimensions. i.e., the smaller the reduced/filtered radio map is, the worse the positioning error is. Current IPSs require solutions that provide better compromises between the two dimensions.
Although k-Means provides a good trade-off between the two dimensions, we identified two main problems. First, computing the similarity to all the clusters -coarse search -for every positioning request is inefficient if the number of clusters and the environment area are both too large [11]. Second, the fingerprints might not be equally distributed among the clusters. The fine-grained search in clusters much larger than the rest may degrade the benefits obtained from clustering.
We introduce three new more computationally efficient variants of k-means clustering based on knowledge about signal propagation. The main contributions of this paper are: • A new computationally-efficient way to reduce the clusters in the coarse search. • Two new computationally-efficient ways to further reduce the reduced radio maps in the fine-grained search. • A reproducible evaluation that comprises an extensive comparison on different scenarios.
The remaining of this paper is organized as follows. Section II briefly reviews related works on clustering and Wi-Fi fingerprinting. Section III describes the integration of the kmeans clustering algorithm in Wi-Fi fingerprinting and our proposed variants. Section IV introduces the experimental setup and shows the empirical results. Section V draws the main conclusions of this work.
978-1-7281-6455-7/20/$31.00 © 2020 IEEE II. RELATED WORK Given that Wi-Fi fingerprint matching and large radio map sizes account for important computational loads [13,14], several authors applied approaches that solve the load issue while also maintaining or improving the positioning accuracy. Some authors tackled the issue using general-purpose unsupervised learning models. They applied the divide and conquer approach and, somehow, broke down the whole radio map into smaller pieces. This is the case of clustering approaches like k-means [5,6] and Affinity Propagation [7,8]. In contrast to clustering, other authors proposed optimization heuristics based on their knowledge on signal propagation and Wi-Fi fingerprinting [11,12,15,16]. Most of the heuristics are based on the fact that the RSS value somehow indicate the distance of the measurement device (e.g., smartphone or smartwatch) to the AP.
Shin et al. [5] proposed a tracking system that automatically builds a labeled topological map and estimates the users' location. In their place learning stage, they applied k-means to automatically organize the spaces in an unknown environment. According to the authors, the clustered topological radio map could determine the division of the operational area.
Abdullah et al. [17] slightly modified the k-means model by applying the Bregman divergence as distance for clustering formation, but still used the Euclidean distance for cluster determination in the online phase. The authors tested their proposal in terms of positioning accuracy against the original k-means and Affinity Propagation in a medium sized area.
Cramariuc et al. [18] tested k-means using Euclidean distance in the coordinate space and Affinity Propagation using Log-Gaussian distance in the feature (RSS) space for clustering formation in large multi-floor environments. They stated that the Affinity Propagation based on Log-Gaussian RSS distance obtained the largest time reductions while the k-means based on Euclidean coordinates distance obtained the best error, when compared among them and to non-clustering weighted k-NN approach.
Park et al. [19] tested k-means using on Euclidean distance in the feature space for clustering formation in a small environment. The cluster determination in the online phase used a probability distance.
Anuwatkun et al. [20] tried k-means using on Euclidean distance in the feature space for clustering formation in a small environment. Instead of using the RSS values directly, the authors used the strength difference among the APs.
In contrast to the previous works, k-means has also been used for coordinate-based clustering [18,21], floor-wise fingerprint clustering [6] and, even, to cluster the positions of the list of nearest neighbors provided by the k-NN algorithm [22].
All the previous papers have something in common, the knowledge on the radio signal propagation seems not to be fully exploited to, for instance, reduce the computational load on the selection of the best cluster.

III. k-MEANS AND PROPOSED VARIANTS TO REDUCE THE COMPUTATIONAL LOAD IN THE ON-LINE STAGE
The k-means method [23] automatically divides the feature space into k non-overlapping regions (clusters) represented by their centroids (the mean of the cluster's fingerprint vectors). The clusters generation starts with random centroids, which are iteratively adapted by minimizing the intra-cluster distances. The algorithm minimizes the variances of the samples that fall within the cluster.
In this work, we used the enhanced cluster initialization procedure proposed in Arthur et al. [24] rather than the completely random one. Note that the improved initialization is also stochastic and the resulting clusters depend on the initial cluster representatives.
The information from the clusters is integrated in Wi-Fi fingerprinting using two phases: • The off-line phase, which executes k-means over the reference fingerprints, obtaining k clusters. We could say that k-means provides a local version of the radio map for every cluster. • The on-line phase, which finds the reference fingerprints most similar to the operational fingerprint in two steps. The first step selects the cluster whose centroid is the most similar to the operational fingerprint. The second step performs a fine-grained search on the selected cluster's fingerprints.
Under ideal conditions (uniform distribution of samples among the clusters) and choosing k = √ n, the best asymptotic computation time of cluster-based fingerprinting method is O( √ n), where n is the number of samples in the radio map as shown in Figure 1.  Although k-means and k-NN are commonly used together, the meaning of the variable k in both models is quite different. It stands for the number of nearest neighbors to perform a supervised classification/regression in k-NN, whereas it stands for the number of clusters generated by the unsupervised algorithm in k-means.
In the offline stage, the three variants we propose determine the clusters (and their centroids) using k-means. In addition, they analyse the clusters to find information that is relevant for improving the search times in the on-line stage.
A. Proposed Variant 1: Improved coarse search As in the traditional fingerprint model, a scalability problem may occur if the number of clusters is large. Computing the similarity of the operational fingerprint to all the clusters might be too inefficient. We propose an improved coarse search.
In the off-line stage, this variant finds a function f 1 that maps an AP to the set of clusters that are relevant for it, storing all the mappings. A cluster is said to be relevant for i th AP if the cluster contains at least one fingerprint f p = (r 1 , . . . , r na ) for which |r max − r i | ≤ ρ, 1 ≤ i ≤ na, being na the number of detected APs, r max the strongest RSS value of f p and ρ a predefined threshold. The APs that do not map to empty sets are marked as operative.
In the on-line stage, for an operational fingerprint, the operative AP that reports the strongest RSS signal is determined. The function f 1 is then used to get a cluster set for that AP using the pre-calculated mappings. Later, the cluster selection in the coarse search is performed on that cluster set, using the common approach of selecting the cluster whose centroid is the most similar to the operational fingerprint. This variant performs the fine-grained search by applying k-NN directly over the selected cluster's fingerprints.

B. Proposed Variant 2: Soft-filtered fine-grained search
The k-means model does not guarantee that generated clusters are balanced. Therefore, we improved in this variant the fine-grained search for oversized clusters.
The second variant adds to the first variant a filtering step in the fine-grained search. The filtering is applied to oversized clusters whose number of fingerprints exceeds four times n c , where n is the number of reference fingerprints in the entire radio map and c is the number of clusters.
In the off-line stage, this variant determines an additional function f 2 for oversized clusters. This function maps an AP and a cluster to the subset of the fingerprints that are relevant to that AP and belong to that cluster. In this function, a fingerprint is deemed relevant for an AP if it contains a valid RSS value for the AP.
In the online-stage, the AP is determined and a cluster is selected as explained for Variant 1. If the cluster is oversized, f 2 is then used for that cluster and AP to obtain the subset of fingerprints where the fine-grained search is performed, i.e, over which the k-NN is applied. Otherwise, the fine-grained search is applied as explained for Variant 1.

C. Proposed Variant 3: Hard-filtered fine-grained search
The third variant is based on the second one, defining f 2 in a more restrictive way. For this variant, a fingerprint from a cluster is only considered relevant for i th AP if the fingerprint f p = (r 1 , . . . , r na ) satisfies that |r max − r i | ≤ ρ, 1 ≤ i ≤ na, being na, r max and ρ as defined for f 1 in Variant 1.
In the online-stage, the coarse and fine-grained searches are applied as explained for Variant 2.

A. Experimental Setup
Clustering has been explored many times in the IPS literature. However, the diversity in implementation details, evaluation criteria and evaluation scenarios prevents credible comparisons using the reported results. Thus, we created an experimental setup that includes the k-NN as core IPS, two sets of hyperparameters for k-NN (Simple Configuration and Best Configuration), 3 variants for k-means, 16 datasets and 10 execution runs. The clusters have been randomly generated ensuring that k-means and the 3 variants share the same initialization for each dataset and execution run.
The hyperparameters for k-NN are the RSS representation, and the k value and the distance function for k-NN [25]. Simple Conf. stands for k = 1, Manhattan distance and positive data representation. Best Conf. stands for the hyperparameter configuration that reported the lowest positioning error for a dataset after evaluating 144 alternatives.
The datasets were collected at the Tampere University [6,18,26], University Jaume I [27,28], University of Mannheim [29], and University of Minho. Supplementary materials, with method implementation and dataset explanation, are available in Zenodo [30] for research reproducibility.
Finally, the results collected for this paper are the mean 3D positioning error ( 3D ) and the computational time (τ DB ) resulting from processing all the operational fingerprints. Due to the heterogeneity of the datasets, we report the normalized values,˜ 3D andτ DB , against the results from a baseline method -plain k-NN with the Simple Configuration. Due to the length limit, we report the average of the normalized values for the 16 datasets. Table I Table II shows the results for four models of Wi-Fi fingerprinting based on k-NN: (1) plain k-NN, without any optimization; (2) Moreira, which applies the heuristic proposed by Moreira et al. [12], (3) Gallagher, optimized as proposed by Gallagher et al. [11], and (4) k-means. For the later model, we considered 3 values of k for k-means: 25, rfp1 = √ n and rfp2 = n 25 , where n is the number of reference samples. For all models, the Best Configuration is providing significantly better accuracy than the Simple Configuration at the expense of a significantly higher computational cost. The best configuration includes computationally expensive distance metrics, such as Log-Gaussian Distance [18], in some datasets.

B. Results
As expected, k-NN model reports the largest computational times. The Moreira model provides the lowest general computational cost in the two configuration cases. However, it provides the highest mean positioning error. In contrast, the Gallagher model has an accuracy similar to the plain k-NN model but the time cost is just reduced to a third at best.
The solutions based on the k-means model provide a good trade-off between the accuracy and time cost dimensions. Although their mean accuracies are slightly worse than those obtained for the other models, their mean computational cost is reduced more than ten times. Figures 2-4 introduce additional analyses on the clusters generated by k-means, considering all evaluated operational fingerprints.  Figure 2 shows the clusters involved in the coarse search, which can be fixed using the same k in all datasets. However, the number of clusters varies when they depend on a heuristic. For the case of k = rfp1 , the majority of coarse searches involve more than 50 clusters, reaching almost 150 in some cases. A similar behavior is obtained in k = rfp2 , where the coarse search involves more than 200 clusters in 22% of cases.  Figure 3 shows that the number of fingerprints in the coarse search is usually low, less than 200 in the vast majority of cases. In k = 25 , the fine-grained search involve more than 800 reference samples in 19.2% of cases. Having a heavy finegrained search might happen when the dataset is large and k is too low, but also when the clusters are not equally distributed. Size of reduced radio-map with respect expected cluster size with k-means -k=rfp2$ 7.8% Fig. 4. Histogram of the ratio a e . a is the number of fingerprint comparisons (fine-grained search in k-means) and e is number of comparisons to be performed if the clusters had the same size for a radio map (> 4 in red) Figure 4 shows that the relative cluster size with respect the expected size -i.e. equally distributed partition with n c samples per cluster-is usually around 1. However, it is 4 times higher than expected in 20.8% (k=25), 10.8% (k=rfp1 ) and 7.8% (k=rfp2 ) of cases. k-means provides unbalanced subsets of the radio map, specially in complex datasets with multiple devices and a non-regular spatial distribution of reference points    Figure 5 show the general results for the three proposed variants under different parametrization conditions, namely k for k-means and ρ for the relevance calculation.
According to the general results shown in the table, the Variant 3 is always providing the lowest general computational cost. This makes sense as it applies the improved coarse search introduced in Variant 1 and a more restrictive filtering in finegrained search than Variant 2. However, Variant 1 is reporting the best general results for the IPS with the Best Configuration, whereas Variant 2 is better for the Simple configuration. For the three values of k, the variants improve the original k-means in both dimensions, as shown in Figure 5.
Regarding the value of k for k-means, there is still a tradeoff between the value of k and the results. However, the improved coarse and fine-grained searches make the differences between k = rfp1 and k = rfp2 insignificant in terms of positioning accuracy for the Simple Configuration. In general, the lowest computational load is provided when k = rfp2 .
The threshold value ρ of the proposed variants has a significant impact on the results. The time cost increases as ρ increases. The ρ value indicates how restrictive or permissive the relevance function is for the coarse-search filtering. Furthermore, large and low ρ values are not suitable. The lowest threshold (ρ = 0, solid triangles in Figure 5) is too restrictive and relevant fingerprints are discarded for the finegrained search, whereas the highest threshold (ρ = 12) is too permissive so that outliers are included in the position computation.
If we balance the results of all the proposed alternatives, including the different parameters and base IPS configurations, it seems that the proposed Variant 2 with ρ = 3 is a good choice. This particular variant with that threshold value significantly improves the traditional k-means in both dimensions (positioning error and computational time) independently of the value of k (for k-means).

V. CONCLUSIONS
This paper introduced three new variants to improve the coarse and fine-grained search in Wi-Fi fingerprinting when k-means clustering is used to partition the full radio map. The proposed Variant 2, with an improved coarse search and a soft-filtered fine-grained search, seems to be a good choice in terms of positioning accuracy and computational costs.
The optimization of the coarse grained search makes it more computationally efficient, especially when the number of clusters is large. As a side effect, removing non-relevant clusters reduces the presence of outlier centroids and, therefore, the position accuracy is slightly improved. The proposed filtering at coarse search based on relevant clusters works when it is neither so restrictive nor so permissive (i.e. ρ = 3).
The generated clusters may significantly differ in size. The time cost of the fine-grained search depends on the cluster where the operational fingerprint falls into. Some clustering benefits might be lost if the cluster is oversized. Variants 2 and 3 successfully deal with this issue, reducing the computational cost of the traditional k-means to almost a half.
Finally, we consider that this work is just the first step to improve the accuracy of k-means in Wi-Fi fingerprinting problems. The machine learning models, such as k-means and k-NN, were designed for general-purpose problems and, therefore, might not totally fit Wi-Fi fingerprinting. The indoor positioning community should try to have a better understanding of the machine learning models in order to introduce some specific knowledge about, for instance, the signal propagation. Including this knowledge about the strongest AP has improved the accuracy of k-means in both dimensions in our work. As future work, we envision the definition of more refined variants, a comprehensive dataset-wise analysis and the inclusion of other well-known clustering models.