[ICCD 2019] A Distributed Scheme for Accelerating Semantic Video Segmentation on An Embedded Cluster

10 min readJul 12, 2021

2019 ICCD Full Paper

Demonstration Video

Note: Our results generated by the proposed methodology are similar to those of DVSNet [1]. Therefore, we suggest the interested readers to refer to DVSNet [1] for more details.

Keywords: DVSNet, edge computing, semantic segmentation, distributed computing, embedded system, embedded cluster, optical flow, hierarchical architecture, decision network, workload distribution.

Introduction

Recent advances in Deep Convolutional Neural Network (DCNN) based semantic video segmentation have made huge improvements in achieving high accuracy. However, these techniques are still not directly applicable to embedded systems because of their significantly longer execution latency and heavier computational workloads. Although several approaches for real-time semantic segmentation have been proposed, they usually suffer from accuracy degradation. In addition, these techniques similarly incur expensive computational workloads, as they are not specifically developed and tailored for embedded processing elements (ePEs). In order to address the issues mentioned above, we propose a distributed methodology built on top of DVSNet [1] for allocating the heavy computational workloads of semantic video segmentation to a cluster containing multiple ePEs.

In this paper, we implement the proposed methodology on an embedded cluster using a distributed framework. The embedded cluster contains a master ePE and a number of slave ePEs. The number of slave ePEs is scalable. The master ePE divides video frames into frame regions, and dynamically distributes these frame regions to the available slave ePEs. In order to balance the workloads and make the most efficient use of the two paths, we further propose a global and local key management scheme. The proposed methodology is compatible with contemporary embedded platforms.

Background Material

Semantic Segmentation

Semantic segmentation is one of the key research directions in the computer vision (CV) research area, which aims at performing pixel-level predictions (i.e., dense predictions) for an image. The accuracy of a semantic segmentation model is commonly measured by a metric called mean Intersection-over-Union (mIoU).

Optical Flow Estimation

Optical flow estimation is a technique for evaluating the motion of objects between a reference image and a target image. It is usually represented as either a sparse or dense vector field, where displacement vectors are assigned to certain pixel positions of the reference image.

Dynamic Video Segmentation Network (DVSNet) (CVPR 2018)

[Medium Link] [CVPR 2018]

DVSNet [1] is a framework which incorporates two distinct DCNNs for enhancing the frame rates of semantic video segmentation tasks while maintaining the accuracy of them. DVSNet achieves such improvements by adaptively processing different frame regions using different DCNNs. The first DCNN is called the segmentation network, which generates highly accurate semantic segmentations, but is deeper and slower. The second DCNN is called the flow network, which employs a warping function to generate approximated semantic segmentations and is much shallower and faster than the segmentation network.

DVSNet takes advantage of the fact that different regions in a video sequence experience different extents of changes to avoid re-processing every single pixel in consecutive frames. Frame regions with huge differences in pixels between consecutive frames, where the contents may have changed significantly, have to pass through the segmentation network. Otherwise, they are processed by the flow network. In other words, different frame regions in a frame may traverse different networks of different lengths when they are presented to DVSNet. In order to determine whether an input frame region has to traverse the segmentation network or not, DVSNet further employs a lightweight decision network (DN) to evaluate a confidence score for each frame region. A confidence score lower than a pre-defined decision threshold indicates that the corresponding frame region is required to be processed by the segmentation network. DVSNet allows the decision threshold for the confidence score to be customizable.

Methodology

Specification of the Notations

Table I: The notations used in this paper.

The Master-Slave Hierarchy

Overview of the Proposed Framework

The main objective of the framework is to enhance the throughput (i.e., the frame rates) of semantic video segmentation by multiple ePEs, while maintaining the mIoU accuracy of the system by exploiting the benefits offered by DVSNet [1]. The framework consists of the following components.

Master ePE

As shown on the left-hand side of Figure 2, the master ePE divides each input frame into four frame regions, and allocates each frame region to an available slave ePE by dynamic scheduling to generate the semantic segmentation of the regions. The unallocated frame regions are stored in a queue managed by a workload manager, which is responsible for selecting an appropriate slave ePE to process the region stored at the head of the queue. The master ePE is also responsible for gathering the semantic segmentations of the frame regions that belong to the same frame from the slave ePEs, and assembling them to generate the final semantic segmentation of the frame.

Slave ePEs

Figure 3: The architecture of the decision network (DN).

In Figure 2 and Figure 3, the slave ePE contains three major components: a segmentation path, a flow path, and a DN. Each slave ePE has the same architecture, except that the execution paths of different slave ePEs are allowed to be different.

The Segmentation Path
The primary function of segmentation path is to directly generate high-quality semantic segmentation from the current frame region r at timestep t. It requires a longer period of processing time, but is able to deliver a higher mIoU accuracy.
The Flow Path
The flow path estimates the optical flow between the current frame and the key frame for a frame region r. The estimated optical flow along with the segmentation logits of the key frame are then processed by a warping function to generate semantic segmentation results. The quality of the result may decrease when the interval between the current frame and the key frame increases.
The Decision Network (DN)
The decision network is a shallow regression model pre-trained for determining which path to go through, (i.e., either the segmentation path or the flow path). In Figure 3, the decision network takes as input the feature map extracted from the former part of the flow path to evaluate a confidence score, which is used as a reference to decide the execution path. When the confidence score is greater than the customized threshold, the flow path is set as the execution path. Otherwise, the segmentation path is selected as the execution path to maintain mIoU accuracy.

The Global and Local Key Management Scheme

The primary objective of the scheme is to maintain a key buffer in each ePE to store segmentation logits of four frame regions and the key frame of four regions, such that they can be used when flow path is selected as the execution path. The key buffer in the master ePE is called the global key buffer, while that in the slave ePE is called the local key buffer. Global key buffer always keeps the information containing the newest key frame region segmentation logits and key frame region respectively for the four frame regions. The master ePE additionally maintains a key table for each of the slave ePEs to monitor key frame timstep of four regions of their local key buffers.

When a slave ePE is selected by the workload manager to process a current frame region, it will first check the time interval between the current frame and the key frame of the region. If the time interval is larger than a given threshold, the master ePE then forwards the segmentation logits of the key frame and the key frame of the region to the slave ePE to update the corresponding entry in the local key buffers. This ensures that the slave ePE has the newest information to perform flow path.

If a slave ePE performs a segmentation path for a frame region, the newly generated segmentation logit is transmitted back to the master ePE to update the Global key buffer as well as the key table.

Experimental Results

We perform our experiments on the Cityscapes dataset, which is composed of urban street scenes from 50 different cities. Table II presents and compares the quantitative results of the proposed methodology with the baseline approaches for a number of system configurations. The baselines are DeeplabV3+ [2], ENet [3], ERFNet [4], and ESPNet [5]. The rest of the results are directly measured on our embedded systems. The baseline entry SegPath denotes that the segmentation path is used for processing every input frame region. The entry Single serves as the reference entry corresponding to the case where the default DVSNet [1] is performed on a single ePE. Each of the rest entries represents the configuration of one master ePE and the corresponding number of slave ePEs. For example, 1+3 means that one master ePE and three slave ePEs are used. The speedup ratio Speedup is the ratio of the entry’s fps with respect to that of Single.

Table II: Comparison of the quantitative results of the baselines and the proposed methodology. The first four rows correspond to the baseline methods, while the remaining seven rows correspond to our methodology. The last six rows, each entry represents the configuration of one master ePE and the corresponding number of slave ePEs.

Comparison of the Qualitative Results

Figure 5: Performance comparison of **mIoU**, **fps**, and **Speedup** for different configurations.

In Figure 5, it can be seen that for most configurations, increasing the number of slave ePEs tend to deliver higher fps, with only a slight decrease in mIoU. However, configuration 1+1 and 1+6 do not follow the above increasing trend due to different reasons.

For configuration 1+1, the decreased performance in fps is primarily caused by the data transmission overhead between the master ePE and the slave ePE. In other words, it does not provide sufficient parallelism for the proposed framework to outweigh the communication latency between ePEs.

For configuration 1+6, the decrease in fps is due to the fact that more slave ePEs may result in higher chances for the frame regions allocated to the slave ePEs to deviate from the key frame regions. As a result, the slave PEs tend to execute segmentation path more often, leading to an overall performance drop in fps.

Figure 5 and Table 2 also reveal that the mIoU accuracy does not decrease significantly as the number of the slave ePEs increases. The decreases in mIoU are due to the fact that more slave ePEs may increase the average timestep interval between the current frame regions and the key frame regions.

Latency Analysis for the Master and Slave ePEs

Figure 6: Latency analysis for the master ePE and the slave ePE.

The breakdowns of the master ePE and the slave ePE are presented on the left-hand side and right-hand side of Figure 6, respectively. On the left-hand side of Figure 6, the Idle time of the master ePE decreases drastically as the number of slave ePEs increases. It can be seen that under configuration 1+1, the master ePE wastes most of its time waiting for the sole slave ePE to finish its tasks. The above observations suggest that more slave ePEs tend to improve the efficiency of the master ePE, and partially validate the trend of fps plotted in Figure 5.

For the slave ePE, it is observed that as the number of slave ePEs increases, the total time of each ePE spent on the execution paths decreases. This is because the number of input frames allocated to each slave ePE decreases.

Conclusion

We presented a framework for performing semantic video segmentation tasks on an embedded cluster. We embraced the advantages provided by DVSNet, and developed a distributed scheme for allocating different frame regions to different PEs. The PEs in the framework are coordinated in a master-slave hierarchy, and are regulated by a global and local key management scheme. Our experimental results demonstrated that the proposed methodology does lead to enhanced performance in terms of fps and Speedup, with little degradation in mIoU.

Paper Download

[IEEE] [Download]

Please cite this paper as follows:

H.-K. Yang, T.-J. Fu, P.-H. Chiang, K.-W. Ho, and C.-Y. Lee, “A distributed scheme for accelerating semantic video segmentation on an embedded cluster,” in Proc. Int Conf. on Computer Design (ICCD), pp. 73–81, Nov. 2019.

Reference

[1] Y.-S. Xu, T.-J. Fu, H.-K. Yang, and C.-Y. Lee, “Dynamic video segmentation network,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 6556–6565, Jun. 2018.
[2] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” arXiv:1802.02611, Mar. 2018.
[3] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation,” arXiv:1606.02147, Jun. 2016.
[4] E. Romera, J. M. Alvarez, L. M. Bergasa and R. Arroyo, “ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmen- tation,” IEEE Trans. Intelligent Transportation Systems, pp. 263–272, Jan. 2018.
[5] J. S. Mehta, M. Rastegari, A. Caspi, L. G. Shapiro and H. Hajishirzi, “ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation,” in Proc. European Conference on Computer Vision (ECCV),” pp. 561–580, Oct. 2018.