[IROS 2020]Dynamic Attention-based Visual Odometry

IROS 2020 Full Paper and CVPR 2020 Workshop Paper

5 min readMay 13, 2021

Demonstration Video

Introduction

Dynamic Attention-based Visual Odometry framework (DAVO) is a learning-based framework for estimating the ego-motion of a monocular camera.

In the past few years, visual odometry (VO) has been a prosperous research domain in computer vision. Visual Odometry is the process of determining the position and orientation of an object by analyzing the associated camera images. The objective of it is to derive the ego-motion of a camera using learning based approaches such as deep convolutional neural networks (DCNNs).

A common scenario is that each semantic category in a frame may contribute different extents of information when they are used for estimating the trajectory of the camera in different motion scenarios (e.g., straight moves, making turns, etc.). For example, cars or pedestrians are usually considered as dynamic objects that may harm the performance of ego-motion estimation. However, simply eliminating certain semantic categories may limit the performance of VO models. Thus, DAVO dynamically adjusts the attention weights on different semantic categories for different motion scenarios based on optical flow maps.

Related Works

Related works can be summarized into two categories:

(1) Flow-based approaches
(2) Attention-based approaches

Firstly, flow maps can be used as inputs for the VO models, as displacements of pixels (the movements of the objects) between consecutive image frames can be better employed by these models in the process of ego-motion estimation. The P.Muller et.al in Flowdometry[1] introduce the famous FlowNet in their VO module. For feature-based attention methods, attention models are incorporated for adjusting the relative weights of feature channels in the pose estimation DCNNs. In SRNN [2], a guidance attention model is separately applied to the feature channels of the translation and rotation estimation networks. To distinguish, DAVO concentrates on generating attention maps for RGB input frames and flow maps, rather than applying the attention model to the extracted feature embedding

Methodology

Overview

In Figure 1, the regions highlighted in red correspond to our Attention Module and PoseNN. The rest of the DAVO, including the segmentation module (SegNN) and the optical flow estimation module (FlowNN). FlowNN generates optical flow maps by predicting the optical flow between consecutive input frames. SegNN performs pixel-level classifications, which classify each pixel as one of a predefined set of categories and represents the classification results as segmentation channels. These segmentation channels are dynamically weighted by our Attention Module to form an attention map. The attention map is then applied to the input RGB frame and the flow map to generate a weighted version of them. Finally, PoseNN takes the weighted RGB frame as well as the weighted flow map as its inputs, and predicts the translation and the rotation of the relative pose. The two separated branches are named TransNN and RotNN, respectively.

Attention Module

This module takes the flow map and segmentation results as its inputs, and employs an attention network called AttentionNN to generate attention weights for segmentation channels. Next, the module generates the weighted segmentation map by multiplying the attention weights with the segmentation result in a channel-wise fashion. Lastly, the module outputs the attention map by adding the weighted segmentation map channel-wisely. The attention map not only dynamically preserves the semantic categories, but also highlights the relative importance of different regions.

Results

Table 1: Comparison of the evaluated tᵣₑₗ and rᵣₑₗ for different evaluation sequences selected from Kitti dataset

We evaluated DAVO and the baselines on the famous KITTI Visual Odometry and SLAM benchmark suite, which contains eleven annotated video sequences. The performance of the evaluated trajectories for the sequences is measured and reported using a metric called relative trajectory error (RTE). Table 1 compares the evaluation results of DAVO against the baselines. The averaged tᵣₑₗ of DAVO (the bottom row) is slightly (12.70%) higher than that of the previous method [3]. However, DAVO delivers a lower (19.86%) averaged rᵣₑₗ than the previous method [3] without using any recurrent memory cell.

Figure 2: Changes in attention maps for three motion scenarios

As an example, Figure 2 illustrates that the relative importance of the semantic regions in the attention maps may vary when estimating the ego motion of the camera. By leveraging flow maps, the concept discussed above enables derivation of the attention maps without human supervision.

Figure 3: Visualization of the feature maps extracted from TransNN and RotNN for DAVO and its three variants.

In Figure 3, the feature maps extracted from TransNN demonstrate that DAVO correctly focuses on the road for both straight moves and turns. During turning, the feature maps of RotNN concentrate on the sides of the road, enabling DAVO to leverage the changes from the sides of the frames to infer the turning angle.

Conclusion

DAVO is a learning-based framework for estimating the ego-motion of a monocular camera. It is examined and compared with the other contemporary VO approaches on the KITTI Visual Odometry and SLAM benchmark suite. DAVO achieves the state-of-the-art result both quantitatively and qualitatively. As the proposed mechanism that leverages dynamic attention weights on different semantic categories has been validated effective and beneficial in this work, DAVO thus offers a promising direction for future attention-based VO researchers.

Paper Download

Dynamic Attention-based Visual Odometry

Please cite our paper as follows

X.-Y. Kuo, C. Liu, K.-C. Lin, E. Luo, Y.-W. Chen, and C.-Y. Lee, “Dynamic attention-based visual odometry”, in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), Oct. 2020.

Github Link

DAVO

References

[1] P. Muller and A. E. Savakis. Flowdometry: An optical flow and deep learning based approach to visual odometry.
[2] F. Xue, Q. Wang, X. Wang, W. Dong, J. Wang, and H. Zha. Guided feature selection for deep visual odometry.
[3] F. Xue, d S. Li X. Wang, Q. Wang, J. Wang, and H. Zha. Beyond tracking: Selecting memory and refining poses for deep visual odometry. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Jun. 2019.