Image and WLAN Bimodal Integration for Indoor User Localization

Recently, we experience the increasing prevalence of wearable cameras, some of which feature Wireless Local Area Network (WLAN) connectivity, and the abundance of mobile devices equipped with on-board camera and WLAN modules. Motivated by this fact, this work presents an indoor localization system that leverages both imagery and WLAN data for enabling and supporting a wide variety of envisaged location-aware applications ranging from ambient and assisted living to indoor mobile gaming and retail analytics. The proposed solution integrates two complementary localization approaches, i.e., one based on WLAN and another one based on image location-dependent data, using a fusion engine. Two fusion strategies are developed and investigated to meet different requirements in terms of accuracy, run time, and power consumption. The one is a light-weight threshold-based approach that combines the location outputs of two localization algorithms, namely a WLAN-based algorithm that processes signal strength readings from the surrounding wireless infrastructure using an extended Naive Bayes approach and an image-based algorithm that follows a novel approach based on hierarchical vocabulary tree of SURF (Speeded Up Robust Features) descriptors. The second fusion strategy employs a particle filter algorithm that operates directly on the WLAN and image readings and also includes prior position estimation information in the localization process. Extensive experimental results using real-life data from an indoor office environment indicate that the proposed fusion strategies perform well and are competitive against standalone WLAN and imaged-based algorithms, as well as alternative fusion localization solutions.


INTRODUCTION
D IFFERENT measurement types currently available from modern commercial portable hardware (cellphones, tablets) are able, if properly fused, to offer diverse information and lead to improved accuracy in indoor environment localization [1], [2], [3]. This work addresses indoor localization using Wireless Local Area Network (WLAN) technology in conjunction with image sensing. Whilst Global Navigation Satellite Systems (GNSS), such as the Global Positioning System (GPS), have become synonymous with user localization, their robustness and availability under certain conditions is questionable. For instance outdoors, satellite signals can be affected by obstacles, multipath propagation and tall buildings that inevitably lead to high location errors. In addition, satellite signals are weak or totally blocked inside buildings.
WLAN technology has demonstrated promising performance in indoor localization; however, it requires accurate modeling of the complex indoor multipath propagation environment and varying signal obstructions or reflections due to motion [4], [5], [6]. Researchers have investigated image-based localization, for example in [7], also associated with challenges such as occlusion, changes in lighting, noise and blur. Many localization methods in the literature are based on the hybridization or fusion of Ultra-Wide Band (UWB) and WLAN, WLAN and Radio Frequency (RF) tags (indoors) and GPS and WLAN (outdoors) [8], [9], [10], [11]. Presently, there is a limited number of localization solutions based on the fusion of RF and image sensing methods [5], [12], [13], [14], [15].
The motivation for combining WLAN and image data to infer user location indoors is that these are fundamentally different and complementary sensor modalities, which in combination may provide rich information on the observed scene and mitigate errors associated with each individual modality. In fact, there is a number of dynamic adjustments during system operation that can be made to meet diverse applicationspecific requirements in terms of positioning error and computational complexity, which is directly linked to battery depletion on mobile devices. Moreover, nowadays, modern sensor-rich smartphones can be easily employed as WLAN and image data acquisition hubs, e.g., see the Campaignr 1 micropublishing platform [16]. Such a localization system can be orientated towards the context-aware needs and capabilities of a user and becomes extremely useful for a multitude of applications including ambient assisted living, i.e., assistive technologies for memory and visually impaired individuals, tourist-oriented services that enhance user experience in museums and galleries, indoor gaming, in-shop advertisement and coupon distribution, as well as health and daily life monitoring. Thus, for example, for memory impaired people, taking and using images in the localization framework works as a memory prosthesis. These images can be automatically segmented and clustered into specific events during a particular time-frame or an activity, thus allowing people to recall different aspects of their daily lives. Other possible uses of the proposed system include the navigation assistance for visually impaired people, touristoriented guidance applications, and health and daily life monitoring especially for the elderly, and indoor localization for enhancing indoor vehicle/robot autonomy.
In this work, the problem of efficiently integrating WLAN Received Signal Strength (RSS) and image information for indoor localization is addressed by two fusion strategies. The high-level block diagram of the proposed system architecture is depicted in Fig. 1. All the algorithms for WLAN localization, image localization and fusion are calculated by a central unit in a Location Server that resides on the network side, for example a standard laptop used in our experimental setup in Section 7. The WLAN-image equipped Mobile Device, e.g., smartphone, robot, etc., collects the measurements and forwards them to the Location Server. We used such device-assisted approach to avoid heavy computation on the device that may drain battery quickly, although the proposed algorithms could run on the mobile device, in a fully device-based architecture, as long as the battery and storage space (for storing the fingerprint and image databases) are not critical. During localization, an image of the surrounding environment (e.g., captured by a smartphone's camera) and the RSS values from WLAN Access Points (APs) in the vicinity, referred to as fingerprint, 2 are provided as inputs to the system.
In the late fusion approach (flow shown in solid lines), the WLAN Localization component computes a location by matching the input RSS fingerprint against the locationtagged fingerprints that have been collected in advance and stored in the fingerprint database. Similarly, the Image Localization component compares the input image with the location-tagged images that span the entire area of interest and are stored in the image database. Then, the Fusion Engine employs the Threshold-based component that combines the WLAN-based and image-based locations to output the final user location.
Alternatively, in the early fusion approach (flow shown in dashed lines), the WLAN and image readings are directly fed into the Fusion Engine that employs the Particle filter component to fuse the location-dependent data with the aid of an underlying user mobility model that introduces prior location information in the localization process. Thus, contrary to the late fusion approach, the computation of intermediate locations by dedicated localization components is avoided. In this work, we focus on the combination of WLAN and image data without exploiting other data (e.g., inertial and magnetic) that are available on modern sensor-rich smartphones. We demonstrate that reasonable accuracy can be achieved only with these two modalities, while outperforming other similar solutions. In particular, the proposed system can be significantly enhanced by incorporating sensor data into the particle filter to improve the underlying kinematic model with more accurate information for the displacement and orientation of the particles.
Beside extending the Naive Bayes approach of [17] to build our WLAN localization algorithm, the main contributions of this work are the following.
For image-based localization, we introduce a novel algorithm that follows an interest point-based approach and employs a variation of a hierarchical vocabulary tree to efficiently match query images with training images. For fusion, two design options are considered to optimally combine WLAN and camera sensory data, namely a light-weight threshold-based scheme and a flexible particle filter algorithm. The use of location quality indicators is also explored for dynamically enabling/disabling a modality acquisition and location computation path in a hybrid fashion to meet different requirements. For instance, if the number of sensed WLAN APs in the measured RSS fingerprint is small (indicating that the WLAN-based location might be inaccurate), then the image sampling and localization path could be enabled to deliver the desired accuracy (otherwise it is disabled to extend battery life-time). The trade-offs between the WLAN and image modalities, as well as the fusion options are investigated and compared in terms of positioning error, computational complexity, and power consumption. Thus, many insights come up that lead to useful guidelines and best practices for optimizing the operation of such fusion localization solution. This paper is structured as follows. Section 2 overviews the related work on indoor localization. Section 3 describes our WLAN-based localization method, while Section 4 introduces the novel image-based localization approach. Threshold-based fusion is presented in Section 5, while fusion by means of particle filter is described in Section 6. Section 7 describes the experimental setup and data collection process. A selection of results is presented and compared in Section 8. Time efficiency pertaining to different options is analyzed in Section 9, followed by a comparison with other methods in Section 10. Finally, conclusions and directions for future work are outlined in Section 11.

WLAN-Based Localization
WLAN has been a very popular technology for indoor location determination, mainly due to the ease of collecting and fingerprinting signal strength measurements with wireless mobile devices from the ubiquitous WLAN infrastructure inside buildings; see [18] and references therein for survey of recent advances. Many localization solutions complement WLAN with inertial sensors (i.e., accelerometer, gyroscope, magnetometer, barometer) and floorplan maps, like the Anyplace indoor navigation service [19], or further augment it with ambient light and sound signals as in SurroundSense [20]. This has been stirred with availability of sensor-rich mobile devices.
UnLoc is an unsupervised indoor localization system that leverages smartphone sensors to compute the displacement and direction of users to avoid the need for war-driving for populating the fingerprint database [21]. LiFS utilizes the spatial relation of RSS fingerprints, so that the collected fingerprints are distributed according to collectors' mutual distances in real world [22]. WILL is another system that combines WLAN fingerprints with user movements to infer user location without site survey or knowledge of AP locations [23].
Authors in [24] present a WLAN-based system that employs principal component analysis in an efficient mechanism for replacing sets and subsets of available APs. In [25], by reducing both the volume of collected data and the number of data collection points, the radio map can be successfully rebuilt using an interpolation approach. Along the same line, SEAMLOC uses a novel interpolation algorithm, based on the specification of robust, range and angle-dependent likelihood functions [4]. Authors in [26] discuss the reduction of severe fluctuations of RSS and propose a scheme that efficiently extracts the signal for user localization.
In this work, for the WLAN-based localization module we build upon and extend the Naive Bayes approach [17]. We explicitly modelled signal strength distributions coming from available APs together with distributions of frequency of appearance of these APs, and eventually used them in our WLAN-based indoor localization framework, as described in Section 3.

Vision-Based Localization
Vision-based localization has drawn attention due to the rich information contained in image measurements, due to its passive nature, and the fact that vision provides the most of the human sensory information. Few methods employ the visual vocabulary tree using Scale Invariant Feature Transform (SIFT) features [27], [28], while some others such as [29], [30] use landmarks to implement indoor localization. The landmarks represent features or group of features detected from the images. During the searching period, features which are detected from the query image are matched to the landmarks.
Based on a series of images or video sequences one is able to construct a topological map, and then to refine it by employing learning vector quantization [31]. In the online phase, similar regions in the query image are detected using a nearest neighbor rule.
Localization based on stereo-imaging has also been studied as stereo images can provide depth insights for 3D reconstruction [29], [32]. An indoor localization algorithm based on an efficient database search using robust matching algorithms is presented in [33].
What differentiates this approach from other image-based localization solutions is in employing a verification step mechanism, based on bidirectional image feature matching of a vocabulary tree framework, which refines the location predictions and thus improves the final user location. Moreover we employ a heuristic to fix the cluster centers of the vocabulary tree.

Hybrid and Fusion-Based Localization
Even though some early vision-based localization systems used only image processing and matching techniques, several recent solutions rely on the combination of location estimates derived with camera and other technologies. For instance, authors in [13] propose a particle filter for fusing positioning information from cellular base stations and images.
Alternatively, imagery and other sensory data can be directly fused to determine user location. For example, RAVEL (Radio And Vision Enhanced Localization) fuses visual information coming from cameras and WLAN readings [34], while [35] proposes a camera-assisted region-based magnetic field fingerprinting technique. Going further by fusing more sensors, Travi-Navi is a vision-guided navigation system that employs magnetic field distortions and WLAN signals to achieve robust and effective user indoor tracking [36]. Similarly, the system proposed in [37] combines opportunistic WLAN signals and magnetic field readings together with camera-based positioning in areas with fewer magnetic disturbances to assist magnetic field positioning. In our previous work [2], we introduced a WLAN-based algorithm and an image matching framework to support image-based localization coupled with a simple hybrid process.
Bayesian filtering is a powerful tool for processing and fusing location-dependent data from diverse sources. For instance, Kalman filter is used for target tracking in collaborative camera sensor networks [38], while an error-state Kalman filter is proposed in [39] that combines measurements from moving vision sensors and radio ranging equipment to estimate user position over time.
Particle filter is a sequential Monte Carlo method based on Bayesian inference that enables fusion of heterogeneous measurements, non-linear relationships between measurements and the target state and estimates non-Gaussian posterior distributions [40]. In the context of indoor localization, apart from [1], [21], and [36], authors in [41] employ particle filter for fusing inertial sensory data on android phones.
Fusion of data from a network of security cameras and RSS fingerprint observations is presented in [14] to enable the simultaneous tracking of multiple individuals inside indoor environments. Another work addresses object tracking with a solution that consists of a camera recording method based on color features of the target and a WLANbased localization algorithm [5].
Authors in [15] describe an object tracking scheme that employs a sensor fusion approach composed of visual and location information estimated from WLAN signal strength values. Switching between fusion-based (i.e., image and WLAN) and purely WLAN-based location is decided as follows: in areas where an image can be taken the system gives priority to fusion, whereas in areas that images cannot cover priority is given to WLAN.
In [12] the authors discuss an approach that combines WLAN-based localization and static camera tracking. The purpose of fusing WLAN and video data is to reduce localization error in the rooms where there is a camera, in contrast to using only WLAN that still offers room level accuracy when no cameras are present.
Our work is closer to the systems discussed in [15] and [12]; however, the proposed solution employs a novel image-based localization algorithm and fuses image and WLAN signals by means of a threshold-based or a particle filter algorithm to trade off positioning error and computational time/energy consumption depending on the application scenario. As compared to previously mentioned approaches these two fusion methods are complementary in terms of how we integrate sensing modalities and interpret results, and also we propose a hybrid fusion as a viable way when trading-off efficiency and accuracy of such a localization system.

WLAN-BASED LOCALIZATION
Probabilistic WLAN localization techniques based on fingerprinting start with the acquisition of training observations consisting of RSS information at Calibration Points (CP) distributed along a dense grid throughout the building [17], [42]. To calculate the probability of a user being at a particular CP given the RSS values that he/she observes, we employ a Naive Bayes method, which represents an extension of the Bayes and Naive Bayes classifiers. This algorithm takes into account the RSS values of WLAN APs and also the frequency of the appearance of these APs.
A signature for each CP is defined as a set of W distributions of RSS values from W APs and a distribution representing the number of appearances of W APs received at this CP. C 2 f1; 2; . . . ; Kg denotes the CP random variable where K is the number of CPs, X m 2 f1; 2; . . . ; Wg represents the mth AP random variable, Y m 2 fs 1 ; . . . ; s V g is the RSS value received from the mth AP, where W is the number of APs, M is the number of APs of an observation and V is the number of discrete RSS values. It is not necessary that each AP produces receivable signals at each CP, and indeed whether an AP signal can be obtained at a CP can vary with time depending on the state of the radio channel. The joint distribution (1) Using the Naive Bayes approach and one testing observation o the likelihood that the user is at location c can be written as Based on (2), we can obtain a ranking of the CPs according to P ðcjoÞ, i.e., the first, the second, the third and so on CP where the user is most likely located.
In the absence of any other information the a priori probability distribution of the user location, P ðC ¼ cÞ, is presumed to be uniform. The distribution of AP x given a location c, P ðX m ¼ xjC ¼ cÞ is multinomial, the probability of signal strength y given location c and AP x, P ðY m ¼ y; C ¼ c; X m ¼ xÞ, is normalized histogram. Using the identity function in a maximal likelihood estimation framework the sufficient statistics are Iðc ðnÞ ; cÞIðx ðnÞ m ; xÞ; (5) in which we observe the frequency of appearance of the APs while in Iðc ðnÞ ; cÞIðx ðnÞ m ; xÞIðy ðnÞ m ; yÞ; we take into account its corresponding RSS values which will be eventually used to calculate conditional probabilities of APs and signal strengths. We evaluate these probabilities as follows. The probability of AP x given location c, P ðX m ¼ xjC ¼ cÞ is given by while the probability of signal strength y given location c and AP x, P ðY m ¼ yjC ¼ c; X m ¼ xÞ is given by These are estimates of the signature parameters, for every AP and also for every RSS value that can be observed from that AP. We rescaled the corresponding probabilities of the candidate CPs in (2) to sum to one and denoted their new values as the CP confidences, p i . To calculate the final user location we used the Minimum Mean Square Error (MMSE) estimation algorithm given by where the first, the second, ... , kth ranked CP positions, corresponding confidence values, and the user location output are denoted by CP max1 , CP max2 , . . . , CP maxK , p max1 , p max2 , . . . , p maxK , and r W , respectively.

NOVEL IMAGE-BASED LOCALIZATION
For the image-based localization, we use a feature point based approach, that employs a variation of a vocabulary tree supported by bidirectional matching, to obtain the user location; see Fig. 2. Beside extending vocabulary tree concept, three novel contributions are proposed: Use of quantized features and a two-brench vocabulary tree to speed up the setup and the localization process.
Re-estimation procedure for fixing cluster centers of the hierarchical vocabulary tree of the SURF features as a part of an extended k-means algorithm. A bidirectional matching approach used to reorder locations previously ranked by the vocabulary-tree based method. Speeded Up Robust Features (SURF) is an image detector and descriptor, 3 robust to lighting, viewpoint changes, and changes in scale [43]. It uses a Haar wavelet approximation of the determinant of Hessian blob detector where L ðl; sÞ is the convolution of the Gaussian second order derivative @ 2 @ 2 gðsÞ with the image I in point l, and similarly for L z ðl; sÞ and L zz ðl; sÞ. and z denote orthogonal coordinate axes of a two dimensional Cartesian coordinate system associated with the image.
A SURF interest point must be selected at distinct location in image (T-junctions, corners, blobs) and its neighborhood is represented by a descriptor vector. Haar wavelet responses in and z direction within circle of radius 6s around that interest point (s is scale at which the interest point was detected) were calculated. The horizontal and vertical responses within the window are summed and yield a local orientation vector. The longest such vector among all windows gives the orientation of the interest point. Then, a square region of size 20s around interest point is split into 16 small sub-squares (4 Â 4 within one square). Then, four Haar wavelet responses at 5 Â 5 regularly spaced sample points are computed respectively: This gives a SURF descriptor vector of length 64 for that interest point. Every interest point in the first image can be compared to every point in the second image by calculating the euclidean distance between their descriptor vectors. A pair (match) is detected, if distance of the nearest is less than T times the distance of the second nearest neighbor. Since this measure is asymmetrical (matching from the second to the first image) those that appear in both directions are called bidirectional matches (see Fig. 3).
The SURF features from all database images were associated with the image and their CP of origin. The features were split into two groups (denoted AE1 respectively) based on the sign of the Laplacian, which halves the search time. For each group, we created a hierarchical tree clustering the descriptor vectors using the extended k-means algorithm repeatedly. This partitioning of U features into k disjoint subsets S j each containing U j features, minimizes the sum-of-squares criterion where l u is a vector representing the uth data point and m m j is the geometric centroid of all data points in S j . The algorithm consists of the re-estimation procedure as follows. Initially, the features are assigned at random to the sets. For step 1, the centroid is computed for each set. In step 2, every feature is assigned to the cluster whose centroid is closest to that feature. These two steps are alternated until a stopping criterion is met. For the first two or even three levels of the hierarchical tree k cluster centers were found by calculating the mean value of several previously calculated cluster Fig. 2. A block diagram of the image-based localization. After extracting the SURF features from an input image, we propagate the features through the vocabulary tree to obtain the ranked CP locations. The ranking list is refined using the bidirectional matching. By employing the MMSE algorithm on the refined ranking list and corresponding confidence values, we obtain the user location. For a query image, its SURF descriptor vectors and the (corresponding) signs of the Laplacian were extracted and a match for each descriptor vector was found using þ1 or À1 hierarchical tree (see Fig. 4). Since the match was labeled with the image and location from which it was extracted it, therefore, casted one vote for its associated location. After each descriptor vector had voted for a location locations were ranked from the most likely to least likely. A verification stage was employed by using the bidirectional matching to reorder the top 5 previously ranked locations. First, ranking obtained by the descriptor vectors was weighted by the normalized bidirectional matching location scores and again normalized, thus associating normalized votes with each CP. A confidence is assigned for each CP, denoted by q i , and defined as the ratio of normalized votes associated with that CP and total number of the normalized votes. Similar to Section 3, we calculate user location, denoted by r I , using the MMSE estimation algorithm given by

THRESHOLD-BASED FUSION METHOD
To perform threshold-based fusion, we take the confidences p i and q i from both sensing modalities P (WLAN) and Q (image) into account. Here, i refers to a given CP. The first ranked, the second ranked, the third ranked, etc. sorted confidences are denoted by p max1 , p max2 , p max3 , etc. respectively (and similarly for Q). Let us define P ij ¼ p maxi À p maxj and similarly Q ij ¼ q maxi À q maxj . We used a separate training and validation dataset to derive the fusion function and to define the threshold values. Observing P 12 and Q 12 in many confidence pairs, which were derived using the validation dataset, we concluded that for values P 12 and/or Q 12 beyond some reliably large thresholds, we were sure that the nearest CP (location) was the 1st ranked one, based either on P or Q (or both). These reliably large thresholds, denoted by T P and T Q for P and Q modality respectively, are equal to and are derived based on the validation datasets (denoted by VAL P and VAL Q for P and Q modality, respectively). Moreover, we deduced that introducing multiplication (p i q i ) and/or addition (p i þ q i ) functions under some conditions, i.e., using more thresholds, can decrease the positioning error even more. But to avoid over-fitting we have not used the additional multiplication/addition eventually. We found that the ranking of the correct location did not fall below some positions in both sets of rankings (the a th P position for P and the a th Q position for Q modality). If none of the conditions is satisfied we decided to take the ranking of the modality to which minða P ; a Q Þ corresponds. The steps in the fusion process are Here f i represents the fusion confidence, while b i is the confidence of the method to which minða P ; a Q Þ corresponds. Similar to Sections 3 and 4, we calculate the user location, denoted by r Ft , using the MMSE estimation algorithm given by This threshold-based fusion approach is evaluated in Section 8 using a test dataset that is different from the training and the validation datasets.

FUSION BASED ON PARTICLE FILTER
In this approach, the sequentially arriving RSS and image measurements are fused with location predictions from the user kinematic model using a Sampling Importance Resampling (SIR) particle filter [44]. The particle filter method projects the state of the user to be tracked (particles) one step ahead. This is followed by RSS and image measurement acquisition to assign weights to particles and generate a probability distribution. Next, the motion and measurement models are introduced together with the description of the particle filter.

Motion Model and Kinematic Prior Propagation
The position and velocity of a user at time step k ¼ 1; . . . ; K within a building interior is described by vector  c dimension, and ð _ x k ; _ c k Þ are the corresponding velocities. The user motion is described as and dt is the time difference between state transitions. Q is a diagonal process noise covariance matrix, and v k denotes a zero-mean, unit variance Gaussian process that models velocity errors. The model in (17) is associated with the kinematic prior distribution P ðX X k jX X kÀ1 Þ.
Particles are then defined which represent realizations of possible user states X X h; H and H is the total number of particles used. Each particle h ¼ 1; . . . ; H is propagated one step ahead using the kinematic model in (17) as which is equivalent to sampling from the kinematic prior distribution P ðX X k jX X h;kÀ1 Þ.

Measurement Models
The training measurement set is used as information on the correspondence of measurements to user locations, which we have incorporated in the WLAN and image measurement models detailed in the following.

WLAN Measurements Model
The signal strength measurements y ðnÞ m , m ¼ 1; . . . ; M, n ¼ 1; . . . ; N of the training set collected at CP i ¼ 1; . . . ; K at location ðx CP;i ; c CP;i Þ are assumed to be Gaussian distributed with mean and variance that are estimated using the training set measurements aŝ The received signal measurement likelihood P rss;i for each CP and from M APs using (7)

Image Measurements Model
For the image measurements the training set measurements is used to provide normalized votes q i on the CP from which it is likely to have obtained a given image y img in the current time step as described in Section 4. Therefore, the normalized votes can be interpreted as how likely it is that a test image was taken at a certain location. The image measurement likelihood is then approximated to be equal to the normalized votes as P img;i ¼ q i : This is then used as probability distribution that indicates the probability that an image was taken at each of the locations ðx CP;i ; c CP;i Þ; i ¼ 1; . . . ; K.

Particle Weighting with Measurements
During localization, the particles are assigned weights based on RSS and image measurements generated based on a true user state X X k ¼ ½x k _ x k c k _ c k T at location ðx k ; c k Þ, which is unknown to the tracker.

Likelihood Based on RSS Measurements
A RSS measurement for the current time step that is due to the true user position ðx k ; c k Þ is taken from the test dataset. The location of a CP ðx CP; i ; c CP; i Þ is identified that is nearest to ðx k ; c k Þ where testing set received signal measurements exist which is indexed by i ¼ argmin i jj½x k ; c k À ½x CP;i ; c CP;i jj 2 2 . Then, a measurement from the test dataset, denoted as y rss;m , is selected for each AP corresponding to CP i. In addition, for each particle proposed location from (x h;k ; c h;k Þ a location for which training measurement data exist is identified as for each particle h and each AP m.

Likelihood Based on Image Measurements
Image measurements that arise due to the true user state are taken from the image test dataset. The index of the CP nearest to the true user state where images were collected is identified as i ¼ argmin i jj½x k ; c k À ½x CP;i ; c CP;i jj 2 2 . Then, an image is drawn uniformly at random from the test dataset of CP i denoted as y img . Then, for each particle proposed state x h;k the location for which training set image measurements exist is identified as ðx CP;ĩ h ; c CP;ĩ h Þ whereĩ h ¼ argmin i jj½x h;k ; c h;k À ½x CP;i ; c CP;i jj 2 2 and the likelihood for each particle n based on the image measurements is taken as the normalized votes in (22) as P img;h ¼ q img;ĩ h .

Particle Weighting
Considering both RSS and image measurements and assuming that they are mutually independent, the weight of each particle using the RSS and image likelihoods is w h;k ¼ P img;h Y M m¼1 P rss;h;m P ðx m ji h Þ: which is followed by particle resampling [44]. The particle filter fusion algorithm is outlined in Table 1. We used 600 particles, when using more the accuracy did not improve. In Section 8 we evaluate the performance of this algorithm and the positioning error is given as the euclidean distance between the estimated position in (26) and the true user position.

EXPERIMENTAL SETUP
For our experiments we use 36 offices (see Fig. 5 where most offices were employed) in the School of Electronic Engineering, Dublin City University, Ireland. Within each office we use 5 CPs, denoted A; B; C; D; E, which are placed at each corner of the office and at its center, as shown in see Fig. 6). In this work, we use the RSS Indicator (RSSI) value reported by the WLAN adapter, which is defined as the absolute RSS value. RSS data were captured using InSSIDer 4 software. An observation consists of RSS readings from up to 14 WLAN APs. Note that these APs are part of the university infrastructure for the provision of wireless connectivity and we did not install any additional APs.
We gathered 120,000 images, of which 83,000 were used for training and 17,000 for testing, and 210,000 signal strengths observations of which 160,000 were used for training and 20,000 for testing. To derive threshold values and the fusion function in the threshold-based fusion method, we have used an independent set of 20,000 images and 30,000 signal strength observations as the validation dataset. During image and RSS data collection, the user was standing still at the CPs. During the training stage image and WLAN data was taken at the CPs, while during the validation and testing stage it was taken at arbitrary points.
Offices are next to each other and look very similar inside thus resulting in very challenging data for both WLAN and image-based localization methods (see examples in Fig. 7). Each CP is associated with several datasets using data taken at different times of the day and different days to demonstrate the robustness of the localization approaches. During localization, the user collected one image and one RSS fingerprint from the test dataset and investigated the distance between the estimated and the true user location.

PERFORMANCE EVALUATION
We assess the performance of our system in terms of the trajectory matching accuracy (in %) defined as the euclidean distance between the true and estimated user locations. Specifically, we report the mean positioning error E p together with the 95 percent confidence interval given by E p AE 1:96s= ffiffiffi n p , where s denotes the standard deviation of the positioning error and n is the number of test samples. Essentially, the 95 percent confidence interval indicates that E p falls within the interval with a high degree of certainty.   Later, our system is compared against other solutions with respect to positioning error and computation time.

Effect of Number of CPs Per Office
In this experiment, we vary the number of CPs per office while we fix the number of WLAN APs to 14 and the number of training RSS fingerprints per CP is equal to 600. The statistics of the positioning error for different methods are depicted in Fig. 8. In particular, the height of each bar and the whiskers indicate the mean value E p and the 95 percent confidence interval, respectively.
It is clear that the fusion of WLAN and images improves accuracy compared to using either WLAN or image as standalone localization methods. In each case E p decreases when the number of CPs per office increases. However, this comes at the expense of higher data collection effort and time for populating the database with training RSS fingerprints and images, thus Fig. 8 provides a guideline for this trade-off. Even though it pays off in terms of positioning error to survey more CPs per room (e.g., E p drops from around 4 m to 2 m when 5 CPs, instead of 1 CP, are considered), some applications might tolerate higher error but have strict setup time constraints (e.g., data collection completed in a few hours, rather than few days).

Effect of Number of APs
In this case, 5 CPs per office were used and we vary the number of WLAN APs while the number of training RSS fingerprints and the number of images per CP is equal to 600 and 32, respectively. Fig. 9 shows the trend of E p for increasing number of APs. The Particle filter and Threshold-based fusion methods are considerably better than the standalone WLAN localization method, reaching around 2 m with 8 APs. For reference, our Image method achieves E p ¼ 2:7 m.
As expected, E p does not decrease significantly when more than 8 APs are considered. This is in line with what has been reported in the related literature in the past; a few APs (i.e., low dimension of the RSS fingerprints) are not enough to sufficiently distinguish between locations, while using more APs beyond a certain point (i.e., high dimension of the RSS fingerprints) does not improve the discriminative capability of the fingerprints. Moreover, factors such as the inherent uncertainty in RSS data, due to noise and measurement errors, and especially the modeling errors that result from the finite number of calibration points, all place a limit to the possible localization accuracy that cannot increase by increasing the number of APs. This result suggests that reasonably accurate localization can be achieved in new WLAN deployments with lower budget and quicker installations.

Effect of Number of Training Images
In this experiment, we consider 14 WLAN APs, 5 CPs per office, and the number of training RSS fingerprints per CP is equal to 600. The bar chart in Fig. 10 illustrates the   improvement in E p as the number of training images per CP increases. Increasing the number of images per CP is achieved by equally using more images per orientation in every CP. As expected, the performance of all methods improves when more training images are considered and the fusion methods consistently attain lower E p by around 1 m compared to the standalone Image localization method. The WLAN method does not depend on the number of training images and delivers E p ¼ 2:4 m in all cases. This accuracy level is reached by the fusion methods using 16 training images per CP, while doubling the number of images further reduces E p by 0.5 m. Therefore, there is again a trade-off between the positioning error and the setup time of the system that increases significantly when more training images are captured.

Effect of Number of Training RSS Data
In this experiment, we consider 14 WLAN APs, 5 CPs per office, and 32 training images per CP. Fig. 11 shows how E p decreases when the amount of training RSS fingerprints increases. Clearly, the two fusion methods utilizing both RSS and image data outperform the WLAN method in all cases. The Image localization method achieves E p ¼ 2:7 m.
Similarly to the number of training images discussed previously, there is a trade-off between the positioning error and the setup time of the system that increases significantly when more training RSS fingerprints are collected in every CP. For instance, collecting 600 RSS fingerprints per CP, instead of 400, only improves E p by around 0.4 m for the fusion methods.

Particle Filter for a Dynamic Scenario
The particle filtering method performs data fusion from multiple heterogeneous sources producing measurements that have a non-linear relationship to the target state. In addition, the particle filtering method is capable of incorporating target kinematic information into the estimation process. Moreover, the particle filter is able to handle measurements that arrive asynchronously or at irregular intervals by continuing to propagate the belief on the target state via regularly updating the state using the kinematic prior and updating the predictions on the target state with knowledge from new measurements when new measurements do become available.
In the results presented so far, where the target kinematic properties did not include a high uncertainty due to the semi-stationary pattern of motion, the particle filter method did not achieve significant accuracy improvement as compared to the threshold-based approach. To demonstrate the effectiveness of the particle filter in a dynamic scenario, we rearranged the order of the collected data appropriately to emulate a user walking along a path that passes from one office to the next assuming a typical user walking speed. We produced 13 different trajectories of the same path by considering test measurements from nearby randomly selected locations within each office. We fixed algorithm-specific parameters to their optimal values, i.e., 5 CPs, 14 APs, 32 images, 600 RSS fingerprints per CP, and 600 particles as in the above experiments. In this case, the average of the mean positioning errors pertaining to these 13 trajectories is 1.566 m with a confidence interval of 0.32 m compared to 1.908 m in static localization, as shown in Figs. 8, 9, 10, and 11.

Hybrid Fusion
As images are one of the most energy-consuming data sources we would like to avoid using them continuously, but rather employ them only when necessary, e.g., in case the WLAN method is not expected to have good accuracy. Therefore, we investigate the effect on positioning error in case we use images only when the number of sensed APs in the observed RSS fingerprint is less than 3. The intuition is that according to the previous analysis in Section 8.2 the positioning error of the WLAN method degrades significantly below that value. Thus, the hybrid approach provides a practical way for indoor localization where users move around freely enjoying reliable WLAN RSS-based location information and stop to take a picture of the surroundings when the system detects that the user is at an unknown location or in a region with few detected APs that may result in poor WLAN RSS-based localization.  We modified the original Particle filter and Thresholdbased fusion methods, which use the captured image in every localization test, to create their hybrid variants that use images sporadically based on the number of sensed APs. In other words, the hybrid methods compute user location using only WLAN RSS data most of the time and fuse it with image data only if one or two APs are sensed in the RSS fingerprint. Table 2 reports E p for the hybrid fusion methods. The hybrid methods deliver higher positioning error by around 0.5 m compared to the original fusion methods. However, this is compensated by time efficiency due to the less frequent sampling and processing of images during localization, as analyzed in the following Section 9.

COMPUTATION TIME
A series of experiments was conducted with a goal to compare computation time of the five localization solutions, namely the WLAN, the image, the threshold-based fusion, the particle filter fusion, and the hybrid fusion. Note that we report results only for the particle filter hybrid fusion solution for brevity. We used the same laptop with the specifications described in Section 7, while all algorithms were implemented in Java.
Computation times are calculated using a number of tests and we report the average value. We assume that each test is equally time consuming. We denote t W and t I the average computation times for obtaining the user's location with WLAN and image modalities given by where t a and t l denote the times required for the acquisition of the corresponding modality and the localization process, respectively. Acquisition of a WLAN RSS takes in our case, t a W ¼ 1s, while determining the user location with the WLAN localization algorithm takes t l W ¼ 0:058 s. Thus, the whole process takes in total t W ¼ 1:058 s. On the other hand, the acquisition of a 640 Â 480 pixel image takes t a I ¼ 0:21 s, while the localization process takes t l I ¼ 3:962 s. In total, this gives t I ¼ 4:172 s. For the threshold-based fusion one has to take one RSS observation and one image for each test and subsequently use the image-based and the WLAN-based location results in the fusion process. In the case of particle filter fusion only acquisitions of the WLAN and the image data are taken into account in addition to the particle filter localization process. For the hybrid fusion, which is based on the particle filter, an image is acquired only when the WLAN-based location is considered unreliable (i.e., when less than 3 APs are sensed in the measured fingerprint) in addition to the localization time of the hybrid fusion. Therefore, the average computation times for the fusion methods are given by where r is a parameter pertaining to the hybrid fusion that denotes the percentage of tests where an image was acquired. In our tests we observed that r ¼ 21:5%. Table 3 summarizes the computation time for each localization method. Among the fusion methods we observe that the Hybrid method is the most power efficient; however, this comes at the expense of around 0.4 m degradation in the positioning error compared to the Particle filter fusion method, as shown before in Table 2. The latter method proves to be the most demanding in terms of run time. Therefore, the Threshold-based fusion method is a good compromise between time-efficiency and positioning error.
We note that using a more powerful computer or server, instead of a laptop, and also applying techniques to parallelize the computations for the particles (now they are performed sequentially) could easily reduce the time to compute the user's location from a few seconds to less than one second. Thus, the proposed system would be applicable to practical real-life applications for localizing individuals that move at normal walking speed inside a building. In this case, the latency of the system (i.e., the time required to compute the user location) depends on the time to acquire a WLAN measurement (i.e., 1 s).
Moreover, energy consumption of a particular method running on a device follows the method's computation time in a monotonically increasing fashion, i.e., as the method's computation time increases, the energy consumption of the method running on the device increases as well. Due to lack of space we omit the corresponding results related to energy consumption of the aforementioned methods running on the laptop.

COMPARISON WITH OTHER METHODS
The proposed fusion methods are compared against two state-of-the-art methods presented in [15] and [12].
The system in [15] uses WLAN-based localization, which follows a Naive Bayes approach, together with image-based localization. In particular, the system estimates the user's location from the scanned WLAN RSS values using a modified version of the centroid algorithm; see [15] for more details. In the image-based localization, a fusion algorithm is employed based on images extracted from video using the  FFmpeg application. The target is modeled as a simple three dimensional cylindrical object but using a single camera with multiview perspective. Images captured from cameras are degenerated to two dimensional planar images.
In [12], WLAN-based localization is achieved by comparing RSS fingerprints in the database (collected offline) and the RSS fingerprint taken by a client (observed online). Similarly to our approach, if an AP is present in the observed fingerprint at the location of the device, but not present in a database fingerprint, then the matching between them should be low and a penalty is applied to handle this. This was also the case if an AP was missing in the observed fingerprint, but was present in the fingerprint in the database. In the image/videobased part, foreground segmentation is employed followed by how human shapes are extracted and mapped to floor plan as it is explained in detail in [12]. When both WLAN and camera data are available, then the two measurements are combined with a naive Bayesian approach. In the following, we compare the Particle filter and Threshold-based fusion methods with the methods in [15] and [12] in terms of mean positioning error E p .
In the first experiment, we consider 5 CPs per office, while the number of training images and RSS fingerprints per CP is 32 and 600, respectively. Fig. 12 shows the performance comparison as we vary the number of APs. The proposed fusion methods outperform the methods in [15] and [12]. In particular, E p is lower by around 1.5 m and 1 m when only one AP or two APs are considered, respectively. Even though the positioning error does not drop considerably when more than 8 APs are considered, fusion methods are still more accurate in the mean by about 0.5 m when 8 or more APs are used.
In the second experiment, we fix the number of APs to 14 and vary the number of training RSS fingerprints per CP. The results are depicted in Fig. 13. We observe similar behavior for all methods and the proposed fusion methods deliver positioning error below 3 m when 300 RSS training fingerprints are used.
Finally, we present results related to computation time in Table 4. We observe that our Threshold-based fusion method is very efficient and achieves significant savings in terms of computation time compared to competing methods in [15] (i.e., 41 percent reduction in computation time) and [12] (i.e., 35 percent reduction). On the other hand, the Particle filter fusion method outperforms the method in [15], but is slightly worse than the method in [12]. In this case, the hybrid version of the Particle filter fusion method performs better than [12] in terms of time, while still providing lower positioning error.

CONCLUSION
In this work we investigate the combination of two complementary data sources for indoor localization and propose a novel image-based localization algorithm, as well as two strategies for fusing either the locations induced by WLAN and image localization algorithms or the raw WLAN and image measurements directly. The results demonstrate that the fusion methods achieve lower positioning error than any individual modality, while outperforming competing fusion approaches.
Both our fusion methods deliver similar accuracy; however, they have different features that make each one of them the preferred solution depending on the application scenario. For instance, the Threshold-based fusion method is more light-weight, i.e., it has lower computational complexity, resulting in lower run time and energy consumption. On the other hand, it requires the collection of a separate validation dataset and subsequent fine-tuning for selecting algorithm-specific thresholds, thus increasing the system setup time. In this case, the more flexible Particle filter fusion method can be used instead. The flexibility of the particle filter algorithm is demonstrated when used as a hybrid fusion approach able to trade off positioning error with reduction  [15], [12], Particle filter fusion and Threshold-based fusion methods as the number of APs increases. Fig. 13. Positioning error of the methods in [15], [12], Particle filter fusion and Threshold-based fusion methods as the number of training RSS fingerprints per CP increases. in computational time. Finally, it can fuse measurements from additional heterogeneous sources (additionally to image and RSS data) if available. Future work will investigate the use of dynamic confidence-based weighting between the WLAN and image modalities in both fusion approaches. Such adaptive fusion scheme is expected to further improve the positioning error at no additional time-energy cost. In addition, the use of different WLAN bands at 2.4 GHz and 5 GHz, and Bluetooth beacons and a fusion of WLAN, IMUs, and image data can be used to improve performance and improve the versatility of our localization system. A possible research direction would be to leverage WLAN Channel State Information (CSI) information instead of RSS as in the Dynamic-MUSIC algorithm [6] to further improve WLAN-based localization accuracy due to the higher resolution of the CSI measurements.
Milan D. Red zi c received the MSc degree in electronic engineering from the Faculty of Electrical Engineering, University of Belgrade, Serbia, in 2006, and the PhD degree in electronic engineering from CLARITY: Centre for Sensor Web Technologies, Dublin City University, Ireland, in 2012. He is currently a principal video intelligence consultant in the Huawei Ireland Research Center in Dublin, Ireland, working on deep learning for visual recognition related projects. Before that, he was with the IBM Connections Lab and SAP Predictive Analytics (both in Dublin, Ireland) where he was involved in machine-learning and big data related topics. During his PhD studies, and after as a post-doctoral researcher, he was working on different location-sensing projects both in indoor and outdoor scenarios. His research interests include indoor/outdoor localization, deep learning, computer vision, multi-sensor fusion, and similarity measures.
Christos Laoudias received the engineering diploma in computer engineering and informatics and the MSc degree in integrated hardware and software systems from the University of Patras, Greece, in 2003 and 2005, respectively, and the PhD degree in computer engineering from the University of Cyprus, in 2014. He is currently a senior research fellow at the KIOS Research Center, University of Cyprus, contributing to various projects related to localization, tracking, and navigation in telecommunication and smart camera networks. Before that, he was leading the geolocation technology research in the Huawei Ireland Research Center, Dublin. During his doctoral studies, he was involved in several award-winning indoor localization prototype systems and received the Alpha Bank Cyprus Award for "Creative Research and Innovation". His research interests include positioning and tracking, fault-tolerant algorithms, mobile and pervasive computing, and locationbased services.