[CVPR 2018] Dynamic Video Segmentation Network

11 min readAug 26, 2021

2018 CVPR Full Paper

Demonstration

Keywords

Semantic segmentation, optical flow, decision network, DVSNet, confidence score, adaptive key frame scheduling policy, real-time inference.

Introduction

In recent years, semantic image segmentation has achieved an unprecedented performance via using deep convolutional neural networks (DCNNs). Accurate semantic segmentation enables a number of applications which demand pixel-level precision, such as autonomous vehicles, surveillance cameras, and so on. However, due to the real-time requirements, these applications typically require high frame rates per second (fps). Unfortunately, contemporary state-of-the-art CNN models usually employ deep network architectures to extract high-level features from raw data, leading to exceptionally long inference time.

Figure 1: Comparison of frames at timesteps t and t + 10 in two video sequences.

It is unnecessary to reprocess every single pixel of a frame by those deep semantic segmentation models in a video sequence. When comparing the difference between two consecutive frames, it is common that a large portion of them is similar. Fig. 1 illustrates that only a small portion of the frames are apparently different (highlighted by red rectangles), implying that a large portion of the feature maps between these frames is invariant, or just varies slightly. Therefore, performing complex semantic segmentation on the entire video frame can potentially be a waste of time. By keeping or slightly modifying the feature maps of the portions with minor frame differences while performing semantic segmentation for the rest, we may achieve a better efficiency and shorter latency in video semantic segmentation than per-frame approaches.

Figure 2: Using different CNNs for different video scenes.

Another perspective for accelerating the processing speed of semantic video segmentation is by leveraging the temporal correlations between consecutive frames. Consecutive video frames that do not change rapidly have similar high-level semantic features. On the other hand, frames containing multiple moving objects demonstrate disparate feature maps at different timestamps. Fig. 2 illustrates an example of such scenarios. Fig. 2 (a) shows the semantic segmentation performed on a highway which contains fewer objects and thus results in less changes in consecutive segmented images. Fig. 2 (b), on the contrary, corresponds to a video sequence taken from a local street, which contains dozens of moving objects. The former suggests reusing the extracted features and updating them with as few computations as possible (e.g., by a shallower CNN), while the latter requires performing highly accurate semantic segmentation on every single frame (e.g., by a deeper CNN).

Based on the above observations, we propose a new network architecture, called dynamic video segmentation network (DVSNet), to adaptively apply two different neural networks to different regions of the frames, exploiting spatial and temporal redundancies in feature maps as much as possible to accelerate the processing speed. One of the networks is called the segmentation network, which generates highly accurate semantic segmentations, but is deeper and slower. The other is called the flow network. The flow network is much shallower and faster than the segmentation network, but its output requires further processing to generate estimated semantic segmentations (which might be less accurate than the ones generated by the segmentation network).

To define a systematic policy for efficiently assign frame regions to the two networks while maintaining flexibility and customizability, we further propose two techniques:

Adaptive key frame scheduling policy:
It is a technique for determining whether to process an input frame region by the segmentation network or not. An expected confidence score is evaluated for each frame region. The higher the expected confidence score is, the more likely the segmentation generated by the flow network will be similar to that of the segmentation network. The value of the expected confidence score reflects the confidence of the flow network to generate similar results as the segmentation network. If its expected confidence score is higher than a predefined threshold, it is processed by the flow network. Otherwise, it is allocated to the segmentation network.
Decision network (DN):
The function of DN is to determine whether an input frame region has to traverse the segmentation network by estimating its expected confidence score.

Methodology

Overview of the Proposed Framework

Figure 3: DVSNet framework. Iᵢ represents the current frame, Iₖ represents the key frame.

The framework of DVSNet is illustrated in Fig. 3. The DVSNet framework consists of three major steps.

Dividing the input frames into four frame regions.
DN analyzes the frame region pairs between Ii and Ik, and evaluates the expected confidence scores for the four regions separately. DN compares the expected confidence score of each region against a predetermined threshold. If the expected confidence score of a region is lower than the threshold, the corresponding region is sent to a segmentation path (i.e., the segmentation network). Otherwise, it is forward to a spatial warping path, which includes the flow network.
Frame regions are forwarded to different paths to generate their regional semantic segmentations.

Adaptive Key Frame Scheduling

Figure 4: Different key frame scheduling policies.

Fig. 4 illustrates the key frame scheduling policies used by DFF [1] and DVSNet. DFF adopts a fixed update period, as shown in Fig. 4 (a), which is predetermined and does not take quality and efficiency into consideration. It is more efficient to process a frame sequence of similar contents with a longer update period, as the spatial warping path itself is sufficient to produce satisfactory outcomes. On the other hand, when the scene changes dramatically, using the segmentation network is more reasonable. As a result, DVSNet introduces an adaptive key frame scheduling policy by using DN and expected confidence score. The adaptive key frame scheduling policy is illustrated in Fig. 4 (b), in which the update period is not fixed and is determined according to the expected confidence score of that region. DN determines when to update the key frame region r by evaluating if the output of the flow network Fʳ is able to generate a satisfactory regional segmentation Oʳ. If Oʳ is expected to be close to that of the segmentation network Sʳ, Fʳ is forwarded to the spatial warping function to generate Oʳ. Otherwise, the current frame region is sent to the longer segmentation network, and the key frame is updated.

We define a metric, called confidence score, to represent the ground truth difference between Oʳ and Sʳ, while expected confidence score is a value evaluated by DN. The mathematical form of confidence score is defined as follows:

where P is the total number of pixels in frame region r, p is the index of a pixel, Oʳ(p) is the class label of p predicted by the spatial warping path, Sʳ(p) is the class label of pixel p predicted by the segmentation path, and C(u, v) is a function which outputs 1 only when u equals v, otherwise 0.

Given a target threshold t, DN compares its expected confidence score against t. If it is higher than t, Fʳ is considered satisfactory. Otherwise, Iʳ is forwarded to the segmentation path. An advantage of the proposed adaptive policy is that the target threshold t is adjustable. A lower t leads to lower accuracy and higher fps, as more input frame regions traverse the shorter spatial warping path. On the other hand, a higher t results in higher accuracy, trading off speed for quality.

Frame Region Based Execution

Figure 5: Confidence score versus time for the frame regions and the entire frame.

We provide an analytical example to justify the proposed frame region based execution scheme. Fig. 5 plots curves representing the values of confidence score versus time for different frame regions as well as the entire frame for a video sequence extracted from the Cityscape dataset. The smoothed curves by averaging the data points over 15 timesteps are highlighted in solid colors, while the raw data points are plotted in light colors. It can be seen that the confidence score of the entire frame does not fluctuate obviously over time. However, for most of the time, the confidence scores of different frame regions show significant variations. Some frame regions exhibit high confidence scores for a long period of time, indicating that some portions of the frame change slowly during the period. For those scenarios, it is not necessary to feed the entire frame to the segmentation network.

DN and its Training Methodology

Figure 7: Different feature maps for training DN.

Fig. 6 illustrates the network model of DN as well as its training methodology. DN takes as input the feature maps from one of the intermediate layers of the flow network, as illustrated in Fig. 7. DN is trained to perform regression. In the training phase, the goal of the DN is to learn to predict an expected confidence score for a frame region as close to the ground truth confidence score derived by the above section. In the testing phase, the ground truth confidence score is not accessible to both DN and the flow network. The feature maps fed into DN are allowed to come from any of the layers of the flow networks, as plotted in Fig. 7.

Experimental Results and Analyses

We perform experiments on the famous Cityscapes dataset. In our experiments, we pre-trained three semantic segmentation models, DeepLab-Fast, PSPNet, and DeepLab-v2, as our baseline models for the segmentation network. DeepLab-Fast is a modified version of DeepLab-v2 [2], while PSPNet and DeepLab-v2 are reproduced from PSPNet [3] and DeepLab-v2 [2], respectively. The results of the baseline segmentation models on the Cityscape dataset are summarized in Table 1. We further pre-trained FlowNet2-S and FlowNet2-s to serve as our baseline models for the flow network in DVSNet. These two models are reproduced from [4].

Table 1: Comparison of mIoU and fps for various models, where t represents the target threshold of DN.

Validation of DVSNet

Table 1 compares the speed (fps) and accuracy (mIoU) of (DeepLab-Fast, FlowNet2-s), (PSPNet, FlowNet2-s), and (DeepLab-v, FlowNet2-s) for two different modes: a balanced mode and a high-speed mode. The balanced mode requires that the accuracy of a network has to be above 70% mIoU, while the high-speed mode requires that the frame rate has to be higher than 30 fps. It is observed that the DVSNet framework is able to significantly improve the efficiency of the three baseline models. From Table 1, we conclude that decreasing t leads to a drop in mIoU of models, but increases fps significantly.

Figure 8: Accuracy (mIoU) and frame rate (fps) of various DVSNet configurations under different threshold t.

Fig. 8 shows accuracy (mIoU) versus frame rate (fps) for various DVSNet configurations. We plot six curves on Fig. 8, corresponding to six possible combinations of the three baseline segmentation network models and the two baseline flow network models. It can be observed that as t increases, the data points of all curves move toward the upper-left corner, leading to increased mIoU accuracies but decreased fps for all DVSNet configurations. On the contrary, when t decreases, the data points of all curves move toward the opposite bottom-right corner, indicating that more frame regions pass through the shorter spatial warping path. By adjusting the value of t and selecting the baseline models, DVSNet can be configured and customized to meet a wide range of accuracy and frame rate requirements.

Validation of DVSNet’s Adaptive Key Frame Scheduling Policy

Figure 9: Accuracy (mIoU) versus frame rate (fps) under different key frame scheduling policies. t is the target confidence score threshold. l is the key frame update period in DFF **[1]**. d is the frame difference threshold. f is the flow magnitude threshold.

Fig. 9 plots a comparison of performance between the fixed key frame scheduling policy and the adaptive key frame scheduling policy. DVSNet which adopts adaptive scheduling with expected confidence score and DFF which adopts fixed scheduling correspond to the red and blue curves, respectively. We include additional curves in Fig. 9 to compare the impact of another two different decision metrics for the adaptive key frame scheduling policy: frame difference (green curve) and flow magnitude (orange curve).

Frame difference is represented as:

Flow magnitude is represented as:

P is the total number of pixels in a frame or frame region, p represents the index of a pixel, G(∗) is a grayscale operator which converts an RGB image to a grayscale one, and u and v represent the horizontal and vertical movements, respectively.

Fig. 9 reveals that using frame difference as the decision metric for the adaptive key frame scheduling policy is inadequate. On the other hand, it is observed that using the adaptive key scheduling policy with either expected confidence score or flow magnitude as the decision metrics deliver higher mIoU accuracies than the fixed key frame scheduling policy employed by DFF [1], even at high fps. The curves indicate that DVSNet employing the adaptive scheduling policy with expected confidence score as the decision metric results in the best performance.

Conclusion

We presented a DVSNet framework to strike a balance between quality and efficiency for semantic video segmentation. The DVSNet framework consists of two major parts: a segmentation path and a spatial warping path. The former is deeper and slower but highly accurate, while the latter is faster but less accurate. We proposed to divide video frames into frame regions, and perform semantic segmentation for different frame regions by different DVSNet paths. We explored the use of DN to determine which frame regions should be forwarded to which DVSNet paths based on a metric called expected confidence score. We further proposed an adaptive key frame scheduling policy to adaptively adjust the update period of key frames at runtime. Experimental results show that DVSNet outperforms contemporary state-of-the-art semantic segmentation models in terms of efficiency and flexibility.

Paper Download

[CVPR 2018]
[arXiv]

Github

[DVSNet]

Please cite this paper as follows:

Y.-S. Xu, T.-J. Fu, H.-K. Yang, and C.-Y. Lee, “Dynamic video segmentation network,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 6556–6565, Jun. 2018.

Reference

[1] X.Zhu, Y.Xiong, J.Dai, L.Yuan, and Y.Wei,“ Deepfeature flow for video recognition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 4141–4150, Jul. 2017. 1,2,3,4,6,7
[2] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI), Apr. 2017. 1, 2, 3, 6
[3] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 6230–6239, Jul. 2017. 1, 2, 3, 6
[4] E. Ilg et al., “FlowNet 2.0: Evolution of optical flow estimation with deep networks,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 1647–1655, Jul. 2017. 2, 3, 6