Towards Robust Loop Closure Detection in Weakly Textured Environments using Points and Lines

SLAM approaches rely on loop closure strategies to correct the inconsistencies of the generated map. These inconsistencies are mainly caused by the effect of sensor noise in odometry sources. For the case of visual SLAM, loop detection typically rely on the repetitive detection and matching of texture-based keypoints. Weakly textured environments, however, can lead to scenes lacking these kind of points and, hence, poor-performing loop detectors. An alternative for these environments is the use of geometrical cues such as line segments, which are frequently present within human-made, structured environments. Under this context, in this work, we introduce a novel appearance-based loop closure detection method that integrates lines and points to enhance performance in these scenarios. For this purpose, we build an incremental Bag-of-Binary-Words scheme for each visual cue to retrieve previously seen images from the two complementary perspectives. Furthermore, we rely on a late fusion strategy to combine the image candidates resulting for both visual vocabularies. An effective mechanism to group similar images close in time is applied next to reduce the effort of the image candidate search. Finally, we propose a novel scheme to validate geometrically the loop candidates, integrating lines into the procedure. The proposed approach compares favourably with other state-of-the-art methods for several datasets.


I. INTRODUCTION
Simultaneous Localization and Mapping (SLAM) addresses the problem of building a map of the environment while, at the same time, localizing the robot within the generated map. These approaches typically depend on loop closure strategies, which, by identifying previously seen places, correct the accumulated position error and re-localize the robot when the tracking system fails. When images are involved in this association procedure, this process is referred to as appearance-based loop closure detection [1]- [4].
Many visual SLAM approaches rely on points as visual features [5]. Despite their impressive results in highly-textured scenarios, their performance degrades in weakly-textured environments, where it is typically difficult to find large sets of point features. Under this context, some visual SLAM systems have recently combined points and lines in the loop closure stage [6], [7]. However, these works rely on off-line Bag-of-Words (BoW) models [8]- [10]. This kind of approach This work is partially supported by EU-H2020 projects BUGWRIGHT2 (GA 871260) and ROBINS (GA 779776), and by projects PGC2018-095709-B-C21 (MCIU/AEI/FEDER, UE), and PROCOE/4/2017 (Govern Balear, 50% P.O. FEDER 2014-2020 Illes Balears). This publication reflects only the authors views and the European Union is not liable for any use that may be made of the information contained therein. requires a pre-training step whenever the environment changes with regard to the available, pre-trained visual vocabulary. To overcome this shortcoming, our proposal adopts an incremental dictionary-based approach [1]- [3], [11], [12] that avoids the pre-training. Furthermore, to solve the unavoidable spatial verification process for loop hypothesis validation, our solution relies only on 2D image data, contrary to other studies that require 3D information supplied by either a stereo camera or a previous mapping process [6], [7].
Summing up, this work proposes a novel appearance-based loop closure detection system that achieves a high number of loop detections by combining points and lines. As commented above, we take advantage of an on-line BoW model, based on binary descriptors [3], [9], [10], which reduces the computational effort and avoids the classical training stage of off-line schemes. Two visual dictionaries, one for each type of visual feature, are maintained. To combine the information obtained from each vocabulary, we employ a late fusion strategy based on a ranked voting system. To conclude, we introduce a novel and faster alternative than the traditional RANSAC method for the spatial verification stage, which is in charge to discard false positives obtained from the visual vocabularies as loop candidates. The proposed loop closing approach is validated using multiple datasets, recorded under different environmental conditions, and it is compared against several state-of-the-art methods.

II. LOOP CLOSURE DETECTION
In this section, we introduce our loop closure detection approach. For a start, we detect keypoints and lines for each sampled image, and next compute binary descriptors for each. These descriptors are then used to obtain a list of the most similar images from each visual vocabulary. The two resulting candidate lists are fused using a ranked voting system which integrates visual similarities from both visual perspectives. To avoid consecutive images to compete between them as loop closure candidates, we group them using the concept of dynamic islands [3], and a representative image of the best island is selected as loop candidate. Finally, this image is assessed geometrically against the query image by using points and lines: if the number of inliers resulting from the spatial verification process is higher than a threshold, the loop is accepted; otherwise it is rejected.

A. Image Description
An image I t sampled at time t is described as φ(I t ) = {P t , L t }, being P t a set of local keypoint descriptors and L t a set of line descriptors extracted from the image. Point detection and description is performed using ORB [13], while line segments are detected using LSD [14] and described using a binary form of LBD [15]. The set of the m point descriptors found at image I t is defined as P t = {d t 0 , d t 1 , . . . , d t m−1 }, whereas the set of the n line descriptors at I t are defined as L t = {l t 0 , l t 1 , . . . , l t n−1 }. As will be shown later, the combination of these two descriptors enhances the retrieval results in a wider range of scenarios than only using points. This is due to the fact that some environments may be described more distinctively using lines than points (i.e. weakly-textured, structured scenes), or vice versa.

B. Retrieval of Loop Closure Candidates
Loop closure candidates are obtained using OBIndex2 [3], which combines an incremental Bag-of-Binary-Words (BoBW) scheme jointly with an inverted file to rapidly obtain similar images. OBIndex2 allows managing efficiently an increasing number of visual words using a hierarchical tree structure. In our proposal, we maintain two instances of OBIndex2: one for points and one for lines. When an image I t is available, its features are used to retrieve the list of the most similar images from the two visual dictionaries: on the one hand, the list of m most similar images using points C t p = {I t p0 , . . . , I t pm−1 }, and, on the other hand, the list of the n most similar images using lines is C t l = {I t l0 , . . . , I t ln−1 }. These lists are sorted according to their associated scores s t p (I t , I t j ) and s t l (I t , I t j ), which are based on a term frequencyinverse document frequency (tf-idf) scoring scheme. Next, scores are min-max normalized to the range [0, 1] [3], what allows controlling the differences in range caused by the distribution of the visual words on each vocabulary. Finally, we limit the number of candidates per list filtering those images whose normalized scores t k is lower than a threshold.

C. Fusion of Lists of Candidates
The next step is to merge the two candidate lists C t p and C t l to obtain a joint perspective of the retrieved loop closure candidates. To this end, in this work, we rely on a late fusion approach [16] by means of a ranked voted system using the Borda count [17], a simple data fusion method based on democratic election strategies. In our proposal, a voter is defined for each visual dictionary. Each voter emits an ordered list of candidates C t k of different size. The number of candidates c that votes for each set is the minimum length of the two candidate lists. Next, the top-c images on each list C t k are ranked with a score b k defined as: where j denotes the order of the image I j in the list C t k and s t k I t , I t j is the normalized score of the image in that list. For each image that appears in both lists, a combined Borda score β is computed as the geometric mean of the individual scores using equation 2: The geometric mean allows us to reduce the influence of false positive image candidates that can appear in one of the lists.
The resulting list C t pl combines thus the information of the two visual vocabularies.
Next, to deal with the fact that some environments mostly exhibit one type of the features, images that only appear in one of the lists are also placed into C t pl , although penalized by a constant factor. Finally, C t pl is sorted according to the scores β(I t j ) of all the retrieved image candidates.

D. Computation of Dynamic Islands
An additional temporal consistency verification procedure is next performed to avoid consecutive images to compete among them as loop candidates. To this end, we rely on the concept of dynamic islands [3]. A dynamic island Υ m n is a group of images whose timestamps range from m to n. A set of islands is built for each image I t . To build this set, images I i ∈ C t pl are evaluated sequentially. If its timestamp lies in the [m, n] interval, the image is associated to its corresponding island Υ m n . If its timestamp does not overlap with any of the existing islands, a new island is created. After processing all images in C t pl , a global score g is computed for each island as: The resulting set of islands Γ t is sorted in descending order according to g. This global score represents the average of the Borda scores, integrating hence points and lines information from all images associated to an island. Next, a representative island Υ * (t) is selected among the set of resulting islands to determine which area of the environment is the most likely to close a loop with I t . For this purpose, iBoW-LCD is based on the concept of priority islands. Priority islands are defined as the ones of Γ t that overlap in time with the island selected at time t−1, Υ * (t−1). In iBoW-LCD, the island finally selected corresponds to the priority island with the highest score g, if any. This selection is only based on the appearance of the images. Nonetheless, in some weakly textured environments, this policy can fail, due to perceptual aliasing, leading to incorrect island associations. To overcome this problem, in this proposal, an island is retained for the next time step only if the final selected loop candidate satisfies the spatial verification procedure, as explained in the next section. When the best island Υ * (t) is identified, the image I c with the highest Borda score β of Υ * (t) is selected and used in the next stage to validate the loop.

E. Spatial Verification
Loop closure detection methods based on BoW schemes are only based on appearance and ignore the spatial arrangement of the image features, which can result into false detections. To address this problem, a geometric verification procedure is performed to validate the selected candidate I c . To implement the spatial verification step, RANSAC is typically used through a specific transformation model between mages [1], [3]. Although quite robust, RANSAC is still affected by a large amount of outliers. To minimize this, the Nearest Neighbour Distance Ratio (NNDR) [18] test can be applied before RANSAC to pre-filter certain incorrect matches. However, this test only considers the image appearance and, hence, when using line features, a large amount of correct line matches can be discarded due to the similarity between descriptors. This fact arises particularly in low-textured environments, where a low number of points is detected and lines become the prominent visual feature. New structural matching constraints have been recently introduced, such as Local Geometric Support (LOGOS) [19] or Grid-based Motion Statistics (GMS) [20], to deal with this issue. These methods determine the set of inliers between images without requiring neither RANSAC nor the ratio test. They are based on the existent relationships between local feature neighbourhoods, and, thus, they achieve a higher amount of matches per frame, resulting into a reduction of false positives loop closure detections.
For this work, we introduce an alternative use of GMS to be able to deal with lines. In short, we employ a point representation for each of the two end-points of a line segment, so that a line is regarded as a correct match if GMS accepts one of the two end-points. If the global number of matches produced by GMS is higher than a threshold, then the loop candidate is accepted; otherwise it is rejected. As will be shown in the experiments, this alternative version of GMS offers a good balance between performance and computational times.

III. EXPERIMENTAL RESULTS
This section reports on a set of experiments to validate the proposed approach. We also compare the performance of our approach with other methods of the state of the art. As usual, the evaluation is performed in terms of precision-recall (P-R). To evaluate the combination of points and lines proposed in this approach, we have selected several publicly available datasets of different nature: from weakly-textured scenes, which usually contain more lines than points, to highlytextured scenes with the opposite characteristics, as well as intermediate cases. The datasets considered for the evaluation are: CityCentre [21] (CC), KITTI 00 [22] (K00), KITTI 06 [22] (K06) and Lip6Outdoor [1] (L6O). For each dataset, we use the ground truth from the original authors except for the KITTI sequences, where the ground truth provided by [23] is employed. All experiments were performed on an Intel Core i7-9750H (2.60 GHz) processor with 16 GB RAM.
A. General Performance Figure 1 illustrates loop closures detected using points, lines, the combination of both and the ground truth for the L6O dataset. As can be observed, the combination of both features  increase the number of loop closure detections. However, this combination does not imply increasing the processing time per image in comparison with the use of only points, as in [3]. The average time to process an image for the K00 dataset in iBoW-LCD is 432.38 ms, while for our proposal we need 387.82 ms. This can be attributed to the parallel execution of some parts of our algorithm. Figure 2 shows the P-R curves obtained for each dataset using either GMS (top) and RANSAC (bottom) as method for the spatial verification procedure. Although GMS does not achieve a recall as high as RANSAC, it is more reliable for a SLAM system where false positives are critical. This fact is observed in the precision axis, where lower values at the max-

CC
K00 K06 L6O Bampis [4] 71.14 96.53 n.a. 58.32 Gálvez-López [9] 31.61 n.a. n.a. n.a. Mur-Artal [10] 43.03 n.a. n.a. n.a. Cummins [8] 38.77 49.2 55.34 n.a. Gomez-Ojeda [6] n.a. 75.9 56.9 n.a. Tsintotas [12] n.a. 97.50 n.a. 50.0 Tsintotas [11] n.a. 93.2 n.a. n.a. Angeli [1] n.a. n.a. n.a. 23.59 Gehrig [24] n. imum recall indicate higher false positives. Another advantage of GMS against RANSAC is its reduction in computation time, as can be observed in Table I, where spatial verification times for both cases are shown. Table II shows the maximum recall achieved at 100% precision for each dataset. The proposed method is compared to other off-line and on-line approaches. The reported results come from the original works, except for [6], which has been obtained using the default parameters and the visual vocabularies provided by their authors. Not available results are indicated as n.a.. The proposed method provides in most cases a higher recall than the other solutions. Furthermore, our proposal outperforms the results reported by [6], which is perhaps the most similar solution to our method.

IV. CONCLUSIONS
This paper introduces an appearance-based loop closure detection method that combines points and lines to achieve a higher number of loop closure identifications, especially in weakly textured environments. This is accomplished by means of a dual BoBW scheme, one for each visual feature, to supply similar images from both perspectives in a fast way. Then, a ranked voting system is used for merging both lists of candidates. To validate the loop candidate hypothesis, we propose a geometrical check stage using a modified version of GMS as main approach, adapted to deal with both points and lines. Experimental results to validate our approach have been reported, showing that our proposal compares favourably against several state-of-the-art methods.