[ICRA 2021] Reducing the Deployment-Time Inference Control Costs of Deep Reinforcement Learning Agents via an Asymmetric Architecture

8 min readJun 21, 2021

2021 ICRA Full Paper

Demonstration Video

[ICRA 2021] Reducing the Deployment-Time Inference Control Costs of
Deep Reinforcement Learning Agents via an Asymmetric Architecture Demonstration.

Introduction

Reinforcement learning (RL) is a training process to make a sequence of decisions and try to take actions in an environment to maximize the cumulative reward. Deep reinforcement learning (DRL) combines deep neural networks (DNN) to make breakthroughs in domains such as game and robotic control.

However, the inference phase of a DNN model is a computationally-intensive process, and is one of the major concerns when applied to mobile robots. Although the energy consumption of DNNs could be alleviated by reducing their sizes, smaller DNNs are usually not able to attain the same or comparable performances as larger ones in complex scenarios. On the other hand, the performances of smaller DNNs may still be acceptable in some cases. For example, a small DNN unable to perform complex steering control is still sufficient to handle simple and straight roads. Motivated by this observation, we propose an asymmetric architecture that selects a small DNN to act when conditions are acceptable, while employing a large one when necessary.

Related Works

A number of knowledge distillation based methods have been proposed to reduce the inference costs of DRL agents at the deployment time. These methods typically use a large teacher network to teach a small student network such that the latter is able to mimic the behaviors of the former.

In contrast, our asymmetric architecture is based on the concept of hierarchical reinforcement learning (HRL), a framework consisting of a policy over sub-policies and a number of sub-policies for executing temporally extended actions to solve sub-tasks. Previous HRL works have been concentrating on using temporal abstraction to deal with difficult long-horizon problems. As opposed to those prior works, our proposed method focuses on employing HRL to reduce the inference costs of an RL agent.

Methodology

Specification of the Notations

Table 1: A summary of all notations used in the paper.

Background: Hierarchical Reinforcement Learning

HRL introduces the concept of ‘options’ into the RL framework, where options are temporally extended actions. Assume that there exists a set of options Ω. HRL allows a ‘policy over options’ Ω(π) to determine an option for execution for a certain amount of time. Each option ω ∈ Ω consists of three components (I, π(ω), β(ω)), in which I ⊆ S is an initial set according to π(Ω), π(ω) is a policy following option ω, and β(ω) : S → [0, 1] is a termination function. When an agent enters a state s ∈ I, option ω is adopted, and policy π(ω) is followed until a state where β(ω) → 1. In this paper, we refer to a ‘policy over options’ as master policy, and an ‘option’ as sub-policy. Please refer to Table 1 for the detailed notations used in the paper.

Problem Formulation

The main objective of this research is to develop a cost-aware strategy such that an agent trained by our methodology is able to deliver satisfying performance while reducing its overall inference costs.

The agent is expected to use the smaller sub-policy as often as possible to reduce its computational costs, unless the agent requires complex control of its action. In order to incorporate the consideration of inference costs into our cost-aware strategy, we further assume that each sub-policy is cost-bounded. The cost of a sub-policy is denoted as c(ω), where ω represents the sub-policy used by the agent. The reward function is designed such that the agent is encouraged to select the lightweight sub-policy as frequently as possible to avoid being penalized.

Overview of the Proposed Cost-Aware Framework

Figure 1: An illustration of the workflow of the proposed framework.

We employ an HRL framework consisting of a master policy and two sub-policies of different DNN sizes. The framework is illustrated in Fig. 1. The master policy first takes in the current state from the environment to determine which sub-policy to use. The selected sub-policy is then used to interact with the environment for several timsteps (the length is set to be a constant). The goal of the master policy is to maximize the cumulative rewards during the execution of the sub-policy. To deal with the data imbalance issue of the two sub-policies during the training phase as well as improving data efficiency, our cost-aware framework uses an off-policy RL algorithm for the sub-policies so as to allow two different size of sub-policies to share the common experience replay buffer.

Cost-Aware Training

We apply a regularization term containing the policy cost that penalizes the master policy so as to encourage it to choose small sub-policies with a lower cost as frequently as possible. The regularization term is adjusted by a hyper-parameter λ, which is a cost coefficient for scaling the policy cost. The higher the value of λ is, the more likely the master policy will choose small sub-policies.

Experimental Results

Experimental Setup

Table 2: The detailed settings of the proposed methodology.

We verify the proposed methodology in control tasks from OpenAI Gym Benchmark Suite, and the DeepMind Control Suite simulated by the MuJoCo physics engine. Two different sizes of sub-policies are implemented as Soft Actor-Critic (SAC) [1] agents, while the master policy is implemented as a DQN agent. Both the master policy and the su-policies are implemented as multilayer perceptrons (MLPs) with two hidden layers, but with different sizes of neuron units. For detailed settings, please refer to Table 2.

Baselines

Typical RL method:
We train two policies of different sizes, where the sizes of small one and large one correspond to our two different sizes of sub-policies. These baselines are trained independently from scratch as typical RL methods without the use of the master policy.
Distillation methods:
Two policy distillation approaches are considered in our experiments: Behavior Cloning (BC) [2] and Generative Adversarial Imitation Learning (GAIL) [3]. For these baselines, a larger size policy serves as the teacher network that distills its policy to the smaller size policy. The sizes of small one and large one correspond to our two different sizes of sub-policies

Qualitative Analysis

Figure 2: A timeline for illustrating the sub-policies used for different circumstances in *Swimmer-v3*, where the interleavedly plotted white and yellow dots along the timeline correspond to the small and large sub-policies, respectively.

In Fig. 2, it is observed that the model trained by our methodology tends to use the large sub-policy while performing strokes and the small sub-policy to maintain its posture between two strokes. One reason is that a successful stroke requires lots of delicate changes in each joint while holding a proper posture for drifting merely needs a few joint changes.

Figure 3: A timeline for illustrating the sub-policies used for different circumstances in (a) *MountainCarContinuous-v0*, (b) *FetchPickAndPlace-v1,* and (c)*Walker-stand*.

In Fig. 3 (a), the objective of the car is to reach the flag at the top of the hill on the right-hand side. In order to reach the goal, the car has to accelerate forward and backward and then stop acceleration at the top. It shows that the large-sub-policy is used for adjusting the acceleration, and the small sub-policy is only selected when acceleration is not required.

In Fig. 3 (b), the goal of the robotic arm is to move the black object to a target position. It can be observed that the agent trained by our methodology learns to use the small sub-policy to approach the object, and then switch to the large sub-policy to fetch and move it to the target location. One rationale for this observation is that fetching and moving an object entails fine-grained control of the clipper.

In Fig. 3 (c), the goal of the walker is to stand up and maintain an upright torso. It shows that for circumstances when the forces applied change quickly, the large sub-policy is used. For circumstances where the forces applied change slowly, the small sub-policy is used.

Performance and Cost Reduction

Table 3: A summary of the performances of baselines based on typical RL method, and our method, along with the averaged percentages of the large sub-policy being used by our method during an episode, as well as the averaged percentages of reduction in FLOPs (FLoating-point OPerations).

In Table 3, it can be seen that the average performance of Ours is comparable to π(L−only) and significantly higher than π(S−only) . It can also be observed that our method does switch between the small and the large sub-policies to control the agent, and thus reduce the total cost required for solving the tasks.

Analysis of the Performance and the FLOPs per Inference

Table 4: An analysis of the performances and FLOPs per inference (denoted as FLOPs/Inf) for our method and the baselines. The Avg-FLOPs/Inf column including the FLOPs contributed by both the master policy and the sub-policies.

We compare the proposed methods against the BC and GAIL baselines, and report the results in Table 4. As a reference, we additionally train a policy π(fit) using SAC from scratch based on the same DNN size as the student networks of the distillation baselines. For distillation baselines, both of them employ the pre-trained π(L−only) as their teacher networks. The results in Table 4 suggest that our method is able to reduce inference costs while maintaining sufficient performances.

Table 5: Comparison of the proposed methodology with and without using the cost term.

In Table 5, it shows when the cost term is removed, the main factor that affects the decisions of the master policy is its belief in how good each sub-policy can achieve. Since the large sub-policy is able to obtain high scores on its own, it is observed that the master policy prefers to select the large sub-policy.

Conclusion

We proposed a methodology for performing cost-aware control based on an asymmetric architecture. Our methodology uses a master policy to select between a large sub-policy network and a small sub-policy network. The master policy is trained to take inference costs into its consideration, such that the two sub-policies are used alternately and cooperatively to complete the task. The proposed methodology is validated in a wide set of control environments and the quantitative and qualitative results presented in this paper show that the proposed methodology provides sufficient performances while reducing the inference costs required.

Paper Download: https://arxiv.org/abs/2105.14471

Github Link: Code

Presentation Link:

Reducing the Deployment-Time Inference Control Costs of Deep Reinforcement Learning Agents via an Asymmetric Architecture

Please cite this paper as follows:

C.-J. Chang, Y.-W. Chu, C.-H. Ting, H.-K. Liu, Z.-W. Hong, and C.-Y. Lee, “Reducing the deployment-time inference control costs of deep reinforcement learning agents via an asymmetric architecture”, in Proc. IEEE Int. Conf. on Robotics and Automation (ICRA), May 2021.

Reference

[1] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” ArXiv, vol. abs/1801.01290, 2018.
[2] F. Codevilla, E. Santana, A. Lopez, and A. Gaidon, “Exploring the limitations of behavior cloning for autonomous driving,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 9328–9337.
[3] J. Ho and S. Ermon, “Generative adversarial imitation learning,” In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages 4565–4573, Dec. 2016.