Knowledge Transfer from Map to DNN: Use of Graph Convolutional Neural Network for Augmenting Visual Robot Self-localization System

—Graph-based scene model has been receiving increasing attention as a ﬂexible and descriptive scene model for visual robot self-localization. In a typical self-localization application, objects, object features, and object relationship in an environment map are described respectively by nodes, node features, and edges in a scene graph, which are then matched against a query scene graph by a graph matching engine. However, its overhead for computation, storage, and communication, is proportional to the number and feature dimensionality of graph nodes, and can be signiﬁcant in large-scale applications. In this study, we observe that graph-convolutional neural network (GCN) has a potential to become an efﬁcient tool to train and predict with a graph matching engine. However, it is non-trivial to translate a given visual feature to a proper graph feature that contributes to good self-localization performance. To address this issue, we introduce a new knowledge transfer (KT) framework, which introduces an arbitrary self-localization model as a teacher to train the student, GCN-based self-localization system. Our KT framework enables lightweight storage/communication by using compact teacher’s output signals as training data. Results on RobotCar datasets show that the proposed method outperforms existing comparing methods as well as the teacher self-localization system.


I. INTRODUCTION
Graph-based scene model has been receiving increasing attention as a flexible and descriptive scene model for visual robot self-localization. In a typical self-localization application, objects, object features, and object relatonship in an environment map are described respectively by nodes, node features, and edges in a scene graph, which are then matched against a query scene graph by a graph matching engine. Such a scene graph model is sufficiently general and applicable to various types of scene data. For example, in [1], an input scene is segmented into semantic segments which serve as graph nodes and are connected with their neighbors via graph edges. An alternative representative example is to model a view-sequence as a scene graph whose nodes are image frames and edges connect successive image frames [2]. In our experiments, we also focus on such a viewsequence -based scene graph representation (Fig. 1).
This paper is concerned with the scalability of a graphbased representation to large-scale applications, such as longterm map-learning [3]. Firstly, the storage cost for a scene graph is proportional to the number and dimensionality of graph nodes (i.e., #graphs, #nodes per graph, dimensionality of node feature), and grows rapidly with the environment Our work has been supported in part by JSPS KAKENHI Grant-in-Aid for Scientific Research (C) 17K00361, and (C) 20K12008.
The authors are with Graduate School of Engineering, University of Fukui, Japan. {takedakoji00, tanakakanji}@gmail.com size. Moreover, the computational cost for the graph matching engine is proportional to the graph size and often requires approximations such as dimension-reduction to reach reasonable computation speed. Therefore, a new framework is desired for improving efficiency without sacrificing accuracy of a scene graph -based self-localization system. In this study, we observe that graph-convolutional neural network (GCN) [4] has a potential to become an efficient tool to train and predict with a graph matching engine. GCN is recently developed, one of most widely used graph neural networks. In GCN, a graph-convolutional operation is introduced to produce graph features, which are then passed to a graph-summarizing operation to produce higherorder graph features. GCN has been successfully applied to various kinds of graph data applications, including chemical reactivity [5] and web-scale recommender systems [6]. GCN training and prediction process is computationally efficient, whose complexity is in the order of O(m + n) where m and n are #edges and #nodes.
From the perspective of visual robot self-localization, a non-trivial issue is how a robot can translate a given visual feature to a proper graph feature that contributes to good selflocalization performance. This is a non-trivial issue because typical visuial features are originally designed for visual selflocalization task and can often be deteriorated when they are directly used as graph features. We tackle this issue by introducing a novel knowledge transfer (KT) framework, which introduces an arbitrary self-localization model as a teacher to train the student, GCN-based self-localization system. Our KT strategy is inspired by the standard KT framework of knowledge distillation [7]. Our feature learning strategy is derived from the field of multi-media information retrieval (MMIR) [8].
Our contributions in this paper are summarized as follows. (1) We first study the use of GCN for augmenting selflocalization performance, while suppressing computation, storage and communication cost. (2) We formulate the versatile framework for feature learning by introducing a novel teacher-to-student knowledge transfer model. (3) Results on RobotCar datasets show that the proposed method clearly outperforms the existing comparison methods as well as the teacher self-localization system.

II. RELATED WORK
Visual robot self-localization is one of most important issues in mobile robotics and has been studied in many different contexts, including multi-hypothesis pose tracking [9], map matching [10], image retrieval [11], view-sequence matching [3], etc. Our self-localization scenario is most closely related to the view-sequence matching scenario, which takes a short-term live view-sequence as query and searches for corresponding part in the map view-sequence.
Unlike many existing works, the proposed approach formulates self-localization as a classification problem, which (1) Partitions the robot workspace into place classes, (2) Trains a visual place classifier from a class-specific training set, and (3) Predicts the place class for a given query image with the pre-trained classifier. In this line of researches, it is straightforward to train a deep convolutional neural network (DCN), as a visual place classifier, as demonstrated in our previous study [12]. More recently, in [13], DCN has been successfully used for visual place classification in an alternative context of 3D point cloud -based self-localization with the scan-context image representation. However, the current study is different from these existing studies in the following two aspects. (1) We focus on the problem of graphbased view-sequence representation, which can deal with interactions between image frames. (2) We further address knowledge transfer from a teacher self-localization model to the student GCN-based self-localization system.
Graph neural network (GNN) has attracted recent interest in pattern recognition community as a flexible and efficient model for pattern recognition and machine learning. GCN is a most widely used GNN, which generalizes the traditional convolution to data of graph structures. In previous studies, GCN has been successful in applications where traditional DCN are very inefficient or not applicable [5], [6], [14]. On the other hand, in this study, we revisit the traditional important application of visual robot self-localization and aim to augment and outperform existing solutions.

A. System Overview
We formulate self-localization as a classification problem [12], which consists of three distinct stages: (1) Place partitioning that aims to partition the robot workspace into a collection of place classes; (2) Mapping (i.e., training) that takes a visual experience with ground-truth viewpoint information [12] collected in the workspace as training data and trains a visual place classifier; (3) Self-localization (i.e., testing) that takes a query graph representing a short-term live view-sequence with length T , and predicts the place class.
We suppose that the original training data is no longer available at the test stage, but only the trained classifier is available, to save the overall long-term storage cost.
We do not rely on any post-verification stage, such as RANSAC post-verification [15], in order to reduce the overall computational burden. Nevertheless, experimental results will show that the proposed framework is surprisingly robust against outliers in measurements.
We use a standard grid-based place partitioning method to define place classes [12]. First, a 2D regular grid is imposed on the robot workspace (i.e., moving plane). Then, each grid cell is viewed as a place class. It should be noted that the place partitioning could be improved by introducing state-ofthe-art place partitioning techniques, as we also demonstrated in our recent study [16]. 1) A scene graph descriptor that translates an input length T view-sequence to a scene graph. 2) A knowledge transfer module that transfers knowledge from a teacher self-localization model to the student GCN-based self-localization system. 3) A supervised learning module that takes length T viewsequences as training samples, and trains a classifier that predicts the place class given a query sample.

B. Graph Matching Engine
These three modules are detailed in the subsequent subsections III-C, III-D, and III-E. We follow the supervised learning procedure to train the scene graph classifier. In the mapping stage, a collection of overlapping sub-sequences with the same length T is sampled from the visual experience, and divided into place-classspecific training sets according to the available viewpoint information as well as the pre-defined place partitioning.
We emphasize that all the training set can be thrown away once after training the GCN classifier. Since our framework uses overlapping sub-sequences as training data, the total data size as well as the number of graph nodes can be significantly larger than the original training view-sequence. Nevertheless, the training data has no impact on the storage overhead after compressing the training data into a GCN classifier.

C. Scene Graph Descriptor
We build the domain invariance by controlling the length and intervals of map/query (i.e., training/testing) viewsequences (Fig. 3). Firstly, the same length T is assumed for all training/testing view-sequences, to build invariance across different domains. Moreover, these T frames are selected so that travel distances between successive frames are approximately equal to a pre-set value, 2[m], to build invariance against the vehicle's ego-motion speed. We will experimentally show that the above strategy contributes to good self-localization performance. It should be noted that the GCN theory is not restricted to such homogeneous graphs (with the same size and shape), and extending this approach to deal with heterogeneous graphs is a future direction of research.
We build a collection of K different image feature extractors F 1 , · · · , F K , by combining several image processing techniques, such as NetVLAD [17], Canny operation [18], depth regression [19], and semantic segmentation [20], as shown in Section III-D. Then, each graph node n = (t, k, f k [t]) represents an attribute feature vector f k [t] extracted by k-th extractor from t-th image frame. On the other hand, two types of graph edges, time edge and attribute edge, are employed in our approach (Fig. 3). A time edge e = (t,t +1, k) connects two graph nodes with successive time indexes t,t + 1 with the same attribute index k. An attribute edge e = (t, k 1 , k 2 ) connects two graph nodes with different attribute indexes k 1 , k 2 with the same time index t.

D. Knowledge Transfer
We now discuss how a robot can translate input viewimages to graph features that are required by GCN training/testing.
A naive and intuitive way is to use visual features that are originally designed for visual self-localization tasks, directly as graph node features. Designing visual features has been a dominant topic in recent self-localization literature [17], [21], [22]. Many works have proposed compact yet discriminative visual features. Recent examples include autoencoder-based method [21], GAN-based method [22], CNN-based method [17]. In particular, NetVLAD [17] is recently developed and one of most widely used visual features in computer vision and robotics, which is used also in our experiments as a comparing method.
A concern is that typical visual features are not optimized for graph convolutions. In theory, their good performance in the original applications does not guarantee their good performance in GCN-based self-localization. In fact, results in our experiments will show that the self-localization performance is deteriorated when they are directly used as graph node features in a GCN-based self-localization.
To address this, we propose to use class-specific probability distribution vector (PDV) output by the teacher selflocalization model as the training data. This strategy is similar in concept with the standard KT approach of knowledge distillation [7]. It should be noted that the PDV representation ensures the versatile applicability to a broad range of teacher output signals, including tf-idf score in bag-of-words image retrieval [23], RANSAC scores in a post-verification stage [24], as well as mean average intersection-over-union in an object matching system [25].
We convert a node image I to a graph feature vector by using a teacher self-localization system Y and an imageto-feature translator M. The conversion procedure is as follows: (1) Input the node image I to a teacher self-localization system Y , and obtain the output PDV signal o = Y (I) from the teacher system. (2) Map the output PDV o to a graph feature vector f = M(o). This strategy is similar in concept with our previous study in [26], where outputs from a teacher self-localization system are used as visual features for a student self-localization system. The above two functions Y, M are detailed in the following. We build four teacher systems Y 1 , Y 2 , Y 3 , and Y 4 (Fig. 4), by combining four different image filters Z i (I) (i ∈ [1,4]) with a single nearest-neighbor (NN) [27] -based VLAD matching engine Y o , in the form: An NN matching engine represents each place class by a collection of VLAD descriptors, each of which has been extracted from each image in the class-specific training set. Then, it computes the image-to-class distance (i.e., dissimilarity) as the distance from a given query VLAD vector to the nearest-neighbor in the class-specific VLAD vectors. The image filters are implemented as follows. Z 1 is a simple identity mapping function (Fig. 4 "raw image"). Z 2 is Canny image filter that converts an input image to a gradient-emphasized image (Fig. 4 "canny"). Z 3 is depth image regressor that is trained in an unsupervised manner and aims to predict a depth image from a monocular image ( Fig. 4 "depth"). Z 4 is semantic segmentation filter [20] that converts an input image to a semantic label image, whose pixel color represents the pixel-wise class label that is defined in the original color palette in [20] (Fig. 4 "semantic"). We consider and investigate the following four types of mapping functions M 1 (·), M 2 (·), M 3 (·), and M 4 (·). M 1 is an identity mapping that can be used only when Z = Z 1 (Fig. 4 "NetVLAD vector"). M 2 is a class-specific distance value vector, in which each c-th place class for the vector is assigned with the L2 norm from the query feature to its nearest neighbor feature in the class (Fig. 4 "match distance"). M 3 uses a ranking function, in which each c-th place class for a given PDV is assigned with a rank value by sorting the elements in a PDV in ascending order of the probability scores, and then the vector of rank values is used as the node feature vector (Fig. 4 "rank"). This ranking based representation is inspired by the recent findings in the field of MMIR [8], in which rank values are used for information fusion across different domains (e.g., individual users' domains of MMIR). M 4 is different from M 3 only in that inverse-rank values are used in place of the rank values ( Fig. 4 "inverse rank"). This strategy can be viewed as an application of inverse rank fusion methods [28] to our graph feature fusion task.

E. Self-localization
The training stage of the self-localization system basically follows the standard training procedure for GCN in [4].
A Graph is represented as G = (V, E) where V is the set of nodes, and E is the set of edges. Let v i ∈ V denote a node and e i j = (v i , v j ) ∈ E denote edge pointing from v j to v i . We define all graphs as undirected graph, i.e., a special case of directed graphs where there is a pair of edges with inverse directions if two nodes are connected. The neighborhood of a node v is defined as Each node v has a feature vector h ∈ R D . The representation of a node v is generated by aggregating its own features h v and features h u (u ∈ N(v)) of neighboring nodes N(v) that are connected to v via edges, in the following steps. First, each node receives features from the neighbors N(v), which are then summarized via SUM operation. Then, these summarized features are passed to single layer fully connected neural network, which is followed by non-linear ReLU transformation, in the form: where W is weight matrix W ∈ R D×D ′ to apply linear transformation. D and D ′ respectively are the dimensions of feature vector before and after applying this operation. The operation at l-th GCN layer is generalized in the form: This operation is applied to all nodes. In this way, all node features are updated. This operation is repeated L times, where L is the number of layers and set L = 2 in this study. Finally, features of all nodes are summarized by an averaging operation, and then passed to an fully-connected (FC) softmax operation: K is the number of final GCN layers, and h u is the feature at the node u output by the final GCN layer. Our implementation is based on the Deep Graph Library with Pytorch backend in [4].

IV. EVALUATION
We evaluated the proposed algorithm on RobotCar dataset [29]. Table I shows characteristics of the dataset.
For the grid-based place partitioning in III-A, we used a 14×17 grid with the resolution of 0.1 degree in latitude and longitude (approximately 110[m]×70[m] resolution). As a result, the number of training images per place class becomes 80-90 on average. A place class is eliminated from the training/testing set if the number of images belonging to the class equals to or lower than 6. Every image is cropped by a 1080×800 region to eliminate those regions that are occluded by the vehicle itself (i.e., 100 pixels from each size, and 180 pixels from the bottom). The length of a map/query view-sequence is set T =10. The intervals between successive frames in travel distance are set approximately 2[m]. Those sequences that span adjacent places are removed from training/testing set.
Self-localization performance in terms of top-1 accuracy is shown in Tab. II. For comparing method, we use NN matching of NetVLAD descriptor [17], adapting the implementation in [30]. For the dissimilarity measure, the imageto-class distance defined in III-D is reused.
The number of GCN layers is 2. The feature dimensionality of GCN layers are set as C, 256, 256 and C for a size C class set. For the node summarization, SUM and ReLU operations are used. The number of epochs, batch size, and learning rate are set as 5, 32, and 0.001. The GCN operation works in CPU using Intel(R) Xeon(R) GOLC 6130 CPU @ 2.10GHz. The self-localization performance is measured by   Table IV shows results for the proposed method with different choices of the image filter Z i as well as the comparing method. From top to bottom, each line corresponds to the image filters Z 1 , Z 2 , Z 3 , and Z 4 . By comparing the different combinations of image filters, the combination of Z 1 and Z 4 yielded the best performance. It can be seen that the proposed method outperforms the comparing method for almost all settings considered here.
As an ablation study, we performed two experiments. In the first case, we modified the scene graphs by removing all the edges, then train and test the GCN. In the second case, we modified the graph topology by removing one of the two types of edges, either time or attribute edge. Table  III shows results for the ablation study. It is apparent that graph with both time and attribute edges work significantly better for almost all cases. Specifically, the use of time edges often contributed to improve the robustness against partial occlusions against illumination changes between training and test domains. In consequence, the proposed GCN framework was successfully shown to solve the variety of problems by integrating the available cues from different image filters, as well as time and spatial graphs.

V. CONCLUSIONS
In this paper, we first explore the use of GCN for augmenting visual robot self-localization systems, and improve the self-localization performance without sacrificing efficiency in computation, storage and communication. A novel versatile KT framework is introduced for transferring knowledge from a teacher self-localization model to integrate the available cues from different image filters as well as time and spatial contextual information. Results on RobotCar datasets show that the proposed method clearly outperforms the existing comparing methods as well as the teacher self-localization system. In the scope of this paper, we worked with viewsequence -based scene graph representaion, but other scene graph representations may be used such as attribute grammar -based scene graphs [31]. Another direction is to consider more general heterogeneous scene graphs to deal with map/query scene graphs with variable sizes and shapes.