Mining Minimal Map-Segments for Visual Place Classifiers

In visual place recognition (VPR), map segmentation (MS) is a preprocessing technique used to partition a given view-sequence map into place classes (i.e., map segments) so that each class has good place-specific training images for a visual place classifier (VPC). Existing approaches to MS implicitly/explicitly suppose that map segments have a certain size, or individual map segments are balanced in size. However, recent VPR systems showed that very small important map segments (minimal map segments) often suffice for VPC, and the remaining large unimportant portion of the map should be discarded to minimize map maintenance cost. Here, a new MS algorithm that can mine minimal map segments from a large view-sequence map is presented. To solve the inherently NP hard problem, MS is formulated as a video-segmentation problem and the efficient point-trajectory based paradigm of video segmentation is used. The proposed map representation was implemented with three types of VPC: deep convolutional neural network, bag-of-words, and object class detector, and each was integrated into a Monte Carlo localization algorithm (MCL) within a topometric VPR framework. Experiments using the publicly available NCLT dataset thoroughly investigate the efficacy of MS in terms of VPR performance.


I. INTRODUCTION
In visual place recognition (VPR), map segmentation (MS) is a preprocessing method used to partition a given viewsequence map into place classes (i.e., map segments), so that each class has good place-specific training images for a visual place classifier (VPC). This MS problem has become an important research topic in the community of robotic mapping and localization [1]- [3], because of the growing interest in VPCs (e.g., deep VPC [4]). Current approaches to MS implicitly/explicitly suppose that map segments have a certain size, or individual map segments are balanced in size [5]. However, recent VPR systems [6] have shown that very small important map segments (minimal map segments) often suffice for VPC, and the remaining large, unimportant portion of the map should be discarded to minimize map maintenance cost. Such important and unimportant map segments are very unbalanced in size, which makes it more difficult to apply the existing MS approaches.
In this work, a new MS algorithm is presented that can mine minimal map segments from a large view-sequence map (Fig.  1). To solve the inherently NP hard problem, MS was formulated as a video-segmentation problem, and the efficient pointtrajectory based paradigm of video segmentation was utilized. The MS task consists of online and offline sub-tasks. Online, a mapper robot navigates the target environment while incrementally constructing a point trajectory graph in real-time by integrating per-frame optical flows [7] and object proposals [8]. Offline, it aims to partition the graph into important minimal map segments and the remaining, large unimportant portion. Thus, the resulting segments should be very unbalanced in size. The predominant approach, spectral clustering-based point trajectory segmentation [9], relies on the assumption of balanced segment sizes, and thus not suited for such unbalanced size settings. To address this issue, we adopt the minimum cost multicut algorithm derived from the field of image/video segmentation [10]- [12]. This made it possible to specify (not only positive, but also) negative affinities between important and unimportant segments and to avoid joining a small segment into a large neighboring segment.
Furthermore, the mined map segments (i.e., place-classes) with VPCs were implemented, and the VPR performance was investigated thoroughly. Specifically, three case studies were conducted on three different VPCs, by plugging each into a Monte Carlo localization algorithm [13] within a topometric localization framework [14]. In the first case, a deep convolutional neural network (CNN) [15] was introduced as a VPC by treating each map segment as (a set of) placespecific training images. In the second case, state-of-the-art bag-of-words (BOW) -based loop closure detection [16] was introduced as an appearance-based VPC by treating each map segment as place-specific visual words. In the third case, an object class detector (OCD) technique [8] was introduced as a segment class detector by treating each map segment as classspecific training images. For training, a method is proposed for self-supervised learning, by which the object bounding boxes (OBBs) of the training images can be automatically annotated. The developed VPC systems were evaluated in challenging cross-season VPR scenarios [17], using the publicly available NCLT dataset [18]. The experimental results showed that the proposed approach frequently achieved comparable VPR performance to the state-of-the-art approaches even though it only used minimal map segments.
II. APPROACH Fig. 2 illustrates the overview of our VPR system. As shown, the MS system consists of online and offline sub-tasks. In the online sub-task, the mapper robot incrementally constructs a point trajectory graph (Section II-A), while it navigates the target environment by integrating per-frame optical flows and object proposals (Section II-B). In the latter sub-task, it partitions the constructed graph into small important segments (i.e., place classes) and unimportant large segments to discard (Section II-C). Because the proposed map representation maintains topological information only for such a small portion of the map, the topometric localization framework [14] was adopted rather than metric or topological localization (Section II-D).

A. Graph-Based Map Segmentation
In our graph-based MS framework, a given view-sequence map is interpreted to a point trajectory graph G = (V, E). V is a set of vertices, each of which represents a point trajectory or an object proposal. E is a set of weighted edges, each of which represents affinity between the vertex pair. Then, the MS problem is formulated as partitioning the graph into an optimal number of segments by minimizing overall cost in terms of edge weight c e : Y is a set of all possible multicuts: {0, 1} E . Note that the number of map segments C is obtained as the number of connected components in the resulting multicut y * . Vertices of V are based on spatial-temporal curves called point trajectories. This design choice is motivated by the fact that such point trajectories are reliably estimated by optical flow techniques (Section II-B). Moreover, recent research on VPR [6], motion segmentation [19], and novelty detection [20] showed that such a point trajectory is often a stable part of the environment (e.g., landmarks).
Weights of edges E can be either positive or negative. Positive edges are those edges that should be joined in graph partitioning. Negative edges are those that should be cut. As shown in video segmentation literature, positive edges are useful for segmenting out important foreground objects (e.g., landmarks) from the background (e.g., Fig.4a top). However, recent works on VPR like [6] showed that not all parts of a foreground object are equally salient (e.g., Fig.4a middle) and many scenes have no foreground object (e.g., Fig.4a bottom). In such general cases, negative edges are useful for dividing foreground/background regions into small salient subregions (i.e., landmarks) and the remaining large non-salient subregions. Consequently, positive and negative edges are useful and maintained in our approach (Section II-B).

B. Incremental Graph Construction
The mapper robot incrementally builds the point trajectory graph by incorporating real-time image measurements during the navigation. This is realized by two sub-tasks. One is incremental estimation of graph vertices. The other is incremental evaluation of the affinity between each vertex pair.
For trajectory estimation, the KLT tracker [7] was adopted. KLT is one of most widely used optical flow estimation techniques in robotics [21] and autonomous driving [22]. Formally, N = 1, 500 features at the initial frame are created and tracked over successive frames. When a few N ′ (≪ N) features are lost (because of occlusions, limited field-of-view, or background clutters), new N ′ features are initialized, and the lost N ′ features are replaced by the new ones.
For affinity evaluation, a semantic cue from the object class detector is used. In preliminary studies [23]- [25], the effectiveness of the other possible cues: color, spatial, and motion cues, was also evaluated in the affinity evaluation. Color is an effective cue for foreground object detection [23], but it is often not invariant and not consistent under varying outdoor illumination conditions. Spatial cues or distance between object locations is useful for image segmentation [24], but determining an appropriate threshold according to individual object sizes is inherently a difficult problem, which makes the graph segmentation unstable. Motion is an effective cue for motion segmentation [25], but it is difficult to segment the relative motions of static objects (i.e., map segments) in our application domain of MS. On the other hand, object bounding boxes (OBBs) from general purpose object class detectors [8], [26], [27] provide stable semantic cues to join or separate segments. Moreover, one can expect OBBs to provide an additional spatial cue -that is, two point trajectories belonging to the same OBB can be considered to be spatially close to each other. Such an OBB based semantic cue was recently used to enhance point feature matching in a different context of 3D reconstruction [28].
Based on the above consideration, semantic OBB measurements from the object class detector are used for affinity evaluation. More formally, an edge between a newly arrived OBB (node) and a point trajectory (node) that belongs to the OBB is inserted. Whether a trajectory being tracked belongs to a newly arrived OBB, can be easily checked using a few simple arithmetic and logical operations. For simplicity, the affinity value of every edge is set to 1. Although typical object detectors provide not only OBBs but also prediction of their object classes, it was decided not to use this additional semantic cue, because even state-of-the-art object detectors frequently fail to predict correct class labels especially for nearly-unseen objects. After the decision, a tiny YOLO detector [29] was chosen, because it provides rapid class-agnostic OBB detection.

C. Graph Partitioning
The remaining problem is how to solve the optimization problem in Eq.1 to partition the graph into the optimal number of segments. A natural optimization approach would be applying spectral clustering [10], [30]- [33] or its recent variants such as multi-label graph-cut [34] or unbalanced energy [35]. Although these methods can easily specify which trajectories should belong to the same segment, they do not specify which should be separated. Therefore, they are not suited for unbalanced size sub-graphs. However, recently developed minimum-cost multicut approaches (e.g., image segmentation [36]- [38], pixel graphs [39], motion trajectory segmentation [12], and pedestrian tracking [40]) can explicitly represent not only positive but also negative affinities between edges, which act as a repulsive force between segments [9]. More formally, we adopt the heuristics in [41] that partitions the graph with complexity O(n 2 log n). Importantly, the sizes of segments can be controlled by subtracting a bias c o from the edge weights (i.e., c e ←c e − c o ) prior to MS. In this study, this bias c o was set to the 20% highest weight over all the edges in the graph, which yields approximately 1/5 smaller map segments than the original map, as demonstrated in experiments in Section III.

D. VPR with Mined Minimal Map Segments (MMMs)
VPCs were implemented using the mined map segments (i.e., place classes) and each was integrated into the MCL [13] within the topometric localization framework [14]. For simplicity, the drift-free motion model [42], was assumed and the number of particles N = D/D o was set according to the map size in terms of travel distance D = 100 m normalized by a constant D o = 1.0. At the initialization step t = 0, the N particles {p i } N i=1 are uniformly distributed over the entire map trajectory with travel distance D, as in previous research [43]. At each time step, the MCL processings of motion update and perception update are performed. The confidence score ∆L(p) for ego-location hypothesis p output by the VPC is normalized to ensure ∑ p ∆L(p) = 1, and used to update the likelihood L(p) in the form: L(p) ← L(p) + ∆L(p).
VPR performance is evaluated by a ranked list of the ego-location hypotheses at the goal location with respect to the ground truth (i.e., GPS). In the spirit of Monte Carlo simulation, the robot navigation with MCL is iterated for N ′ = D/100 different start locations separated by 100m, and the resulting N ′ ranked lists at the goal locations are summarized into the Top-X accuracy performance index (X =10, 20, 50, 100, and 200), using nonmaximum suppression (NMS) [44] to obtain a less redundant hypothesis set. That is, outputting an ego-location hypothesis p is suppressed if a higher-ranked hypothesis p ′ already occupies the location: |p − p ′ | < 10 [m].
For VPC, three different methods were implemented: deep CNN, BOW, and OCD. The CNN method formulates the VPC as a classification task, and employs a deep VPC with the Vgg16 CNN architecture [15]. In this case, each map segment is treated as place-specific train images. For learning and prediction, map images are resized to 256×256 before being input to the CNN. In addition, a different setting is also considered where the above training images are cropped by the bounding boxes of the class-specific point trajectories before being resized. This variant is termed "PartCNN" and was also tested, as described in Section III. To avoid instability in training, an image from a map segment is not considered as the member of the training set if its bounding box (before being resized) is smaller than 100 pixels in width or in height, for both the CNN and PartCNN.
The BOW method formulates the VPC as a BOW image retrieval task, and it employs the state-of-the-art BOW loop closure detection framework from previous work [16]. In this case, each map segment is treated as place-specific visual words. This BOW framework is based on ORB features [45], the TF-IDF scoring scheme [46], and the ratio-test [47] with novel incremental vocabulary [48], and the retrieval outputs are further refined by the island-based place clustering [49] and RANSAC-based geometric verification. In previous work, we studied this BOW framework in a different context of simultaneous mapping and localization [24]. In the current study, it was necessary to modify slightly the framework and implement mapping (i.e., learning) and localization (i.e., prediction) as two separate processes. As in the PartCNN method, a variant, PartBOW, was also considered and tested, where cropped train subimages are used in place of non-cropped original images.
The OCD method formulates the VPC as an alternative image retrieval task using bag of segment classes (in place of BOW) as the cue, and it employs the state-of-the-art object class detector from a prior study [8]. In this case, each map segment is treated as class-specific training images. Unlike the pre-trained generic object detector (in Section II-B), a new detector is fine-tuned on the training images to predict place class directly. The fine-tuning task requires annotations in the form of OBBs. In the present approach, such an OBB can be approximated by the bounding box of the class-specific point trajectories projected onto the image plane. These OBBs can be automatically computed as the byproduct of our graphbased MS. As in the CNN and PartCNN methods, training images with very small bounding boxes are discarded. Once the detector is trained, the bag of segment classes is predicted and then used to index/search an inverted file. The original annotated class labels could be used to index, but it was found that the predicted class labels work better in practice. Such a segment-class-based indexing is an extremely compact logC bit (∈ [6,8] in the experiments) representation for a subimage.

III. EXPERIMENTS
The MS approach was demonstrated using the publicly available NCLT dataset [18]. The main goal of the experiments was to evaluate the MS algorithm in terms of the performance of VPR using MMMs.

A. Dataset and Performance Index
The NCLT dataset is a large-scale, long-term autonomy dataset for robotics research collected at the University of Michigan's North Campus by a Segway vehicle robotic platform. Recently, this dataset has been widely used in robotics communities as an experimental benchmark for various tasks, such as map-merging [50]. The data used in the current study include view image sequences along a vehicle's trajectories acquired by the front-facing camera of the Ladybug3 with GPS. Specifically, four datasets -"2012/1/22 (WI)", "2012/3/31 (SP)", "2012/8/4 (SU)", and "2012/11/17 (AU)" -collected across four different seasons were used. The image size was 1232×1616. Fig. 3 shows the experimental environment and examples of viewpoint trajectories in the dataset.
VPR performance was evaluated by Top-X accuracy [%] according to the viewpoint hypotheses at the goal location in MCL. A correct hypothesis is defined as a viewpoint hypothesis whose distance to the ground truth GPS viewpoint is nearer than 10m. For evaluation, test view-sequences whose overlap ratios to seen viewpoints were lower than 80% or whose overlap ratios to minimal map segments were lower than 10% were discarded from the test set.

B. Implementation Details
For KLT, features in previous work [51] are employed with maxCorners=200, qualityLevel=0.05, minDistance=5.0. The number of features per frame was set to 1500. For YOLO [8], the dimension of the training network and the batch size were modified to 256×256×3, and to 32. The initial learning rate was 0.001 and reduced on plataue (by the factor 0.1, patience 3). Early stopping (patience 10) was used. For Vgg16 [15], the batch size, the number of epochs, and validation samples were set to 32, 10, and 10, respectively. For MCL [13], the number of hypotheses was proportional to map size, and approximately 10 4 . For MS [5], the approach of equal-length subsequences is used as a default baseline method. The proposed graph-construction algorithm yields a bipartile graph rather than a complete graph. That is, no edge exist between point trajectory vertex pairs. This is an important property because dealing with the huge number (e.g., (10 6 ) 2 ) of trajectory vertex pairs is computationally intractable. As a result of MS, the number of classes was 131, 127, 94, and 119 for the season "WI", "SP", "SU", and "AU", respectively, while the number of trajectory vertices per class were 288.8±367.4, 381.4±291.6, 229.5±183.0, and 200.1±114.29, respectively. Fig. 4 shows example results for the proposed MS method for training set "SU". It can be seen that coherent and salient map segments were successfully obtained. MS was successful, even when there was no foreground object (e.g., Fig.4a bottom), and even when not all parts of foreground objects were salient (e.g., Fig.4a middle).

C. MS Results
The map-maintenance cost is described by two quantities R i and R p . R i is the number of map images that belong to any mined map segment, normalized by the number of the entire map images. R p is the number of pixels that belong to any mined map segment, normalized by the number of pixels in the map images. Whether a pixel on an image belongs to a map segment was checked by using the bounding box of the map segment projected onto the image. The results of evaluation were R i = 26.9%, 29.0%, 21.2%, and 28.2%, and R p = 4.93%, 6.20%, 3.74%, and 7.02% for seasons "WI", "SP", "SU", and "AU", respectively. For those map images that belong to any map segment, the mean and standard deviation of the , and 24.9±24.7, for seasons "WI", "SP", "SU", and "AU", respectively. Although the MS was performed independently for different seasons, the mined map segments (e.g., landmarks) were expected to be invariant across seasons to some extent. To investigate the amount of such invariance, the similarity of the map segments between a query and a reference seasons was evaluated. For simplicity, each map-segment was represented as a set of discretized viewpoints and the Jaccard index of the set was evaluated between different seasons. For discretization, a grid of cells with size 10 × 10 m was employed and a cell ID was used as the discretized viewpoint. The similarity between each query segment and its most similar reference segment for all 12 possible query-reference-season-pairs was investigated. The result of the evaluation is summarized as follows. (1) The ratio of query segments with zero similarity values ranges from 0.468 to 0.771 for the 12 season pairs. (2) The maximum, mean and medium similarities of the other segment pairs with non-zero similarity values ranged from 0.601 to 1, 0.202 to 0.367, and 0.149 to 0.333 for the 12 season pairs.

D. MS Performance
Tab. I shows results for cross-season VPR for all the 12 paired live/map seasons. In the table, BOW, CNN, OCD, PartBOW, and PartCNN are the VPR systems using different types of VPC method, as described in Section II-D. MMM-X (X ∈ {BOW, CNN, OCD}) are VPR results using only MMMs. As can be seen, the proposed MMM-BOW and MMM-CNN had comparable performance to those of BOW and CNN, which require the entire original map. In these experiments, the proposed framework successfully captured the stable part of the map (e.g., Fig. 4a) and these stable map segments acted as useful landmarks in VPR. However, MMM-OCD could not perform well in the current experiments. This was mainly the result of the high mis-detection rate of the object class detector. In addition, the cross-season scenario (i.e., trained and tested in different seasons) was very challenging even for the state-of-the-art YOLO detector. Based on the above results, it is concluded that the proposed approach frequently yielded comparable VPR performance to the state-of-the-art VPR methods even though it used only MMMs.

IV. CONCLUSIONS AND FUTURE WORKS
A point trajectory based MS framework ppfor mining minimal map segments from a view-sequence map was proposed. A computationally tractable minimum-cost multicut-based MS algorithm was proposed, that can specify (not only positive but also) negative affinities between important and unimportant segments, to avoid joining a small segment into a large neighboring segment. It can take advantage of optical flows and object proposals, which increases computational efficiency in the online task of constructing a point trajectory graph. Our approach has shown to be effective in providing good place-specific train images for a VPC.
In future work, it is planned to expand the range of MS and include different map models (e.g., bird's eye view map [1], and 3D maps [52]), which we were not considered here. Additionally, the point trajectory model could represent a wide range of segmentation cues, such as color, spatial, and motion cues as discussed in Section II-B. It is planned to explore a more general framework for combining different segmentation cues to improve robustness against noises (e.g., occlusions, limited field of view, or background clutter) and other general-purpose image/video segmentation techniques [53]. The proposed MS framework automatically finds a compact set of landmarks (i.e., minimal map segments); however, the compactness might be improved by introducing landmark selection techniques [54].