[CoRL 2019] Adversarial Active Exploration for Inverse Dynamics Model Learning

6 min readDec 24, 2021

CoRL 2019 Full Paper

Keywords

deep reinforcement learning, inverse dynamic model, intrinsic reward, adversarial learning, exploration

Introduction

Over the past decade, inverse dynamics models have shown considerable successes in robotic control and even in humanoid robots. The main objective of an inverse dynamics model is to predict the control action between two states. With such a model, it becomes practically feasible to carry out complex control behaviors through inferring a series of control actions given a predefined trajectory of states.

In recent years, researchers have turned to developing inverse dynamics models based on deep neural networks (DNNs) in the hope to effectively cope with high-dimensional state space. However, this line of work demands a tremendous amount of data for training DNNs, posing a considerable challenge on data-efficiency.

As data efficiency is crucial for robot learning in practical applications, an efficient data acquisition methodology is critically essential for inverse dynamics model learning. Nevertheless, previous approaches usually perform such interactions with environments in inefficient manners. For instance, [1] and [2] employ an agent to take random actions in an environment to collect training data for their inverse dynamics models. Random actions, nevertheless, are less likely to result in effective exploration behaviors, and may thus lead to a lack of comprehensiveness in the data samples collected by the agent. A more general exploration strategy called curiosity-driven exploration was later proposed in [3]. While this approach is effective in exploring novel states in an environment, their exploration behavior is solely driven by the prediction errors of the forward dynamics models. Thus, the approach is unsuitable and not specifically tailored for inverse dynamics model learning.

In order to deal with the above issues, we propose a straightforward and efficient active data acquisition method, called adversarial active exploration. We jointly train a deep reinforcement learning (DRL) agent and an inverse dynamics model competing with each other. The former explores the environment to collect training data for the latter, and receives rewards from the latter if the data samples are considered difficult. The latter is trained with the data collected by the former, and only generates rewards when it fails to predict the true actions performed by the former. In order to stabilize the learning curve of the inverse dynamics model, we further propose a reward structure such that the DRL agents are encouraged to explore moderately hard examples for the inverse dynamics model, but refraining from too difficult ones for the latter to learn.

Methodology

Adversarial Active Exploration

Fig. 1 shows a framework that illustrates the proposed adversarial active exploration, which includes a DRL agent P and an inverse dynamics model I. At each timestep t, P collects a 3-tuple training sample for I, while I predicts an action and generate reward for P. We directly use the loss function LI as the reward for P, where reward can be expressed as following:

where β is a scaling factor.

Our method targets at improving both the quality and efficiency of the data collection process performed by P as well as the performance of I via collecting difficult and non-trivial training samples. Therefore, the goal of the proposed framework is twofold. First, P has to learn an adversarial policy such that its accumulated discounted rewards are maximized. Second, I requires to learn an optimal θI such that the above equation is minimized. Minimizing LI leads to decreasing accumulated discounted rewards, forcing P to enhance policy to explore more difficult samples to increase accumulated discounted rewards. This implies that P is motivated to concentrate on discovering I’s weak points in the state space, instead of randomly or even repeatedly collecting ineffective training samples for I.

Stabilization Technique

Although adversarial active exploration is effective in collecting hard samples, it requires additional adjustments if P becomes too strong such that the collected samples are too difficult for I to learn. Overly difficult samples lead to large magnitudes in gradients derived from LI , which in turn cause a performance drop in I and instability in its learning process. To tackle to issue, we propose a training technique that reshapes reward as follows:

where δ is a pre-defined threshold value. This technique poses a restriction on the range of rewards, driving P to gather moderate samples instead of overly hard ones.

Experimental Results and Analysis

Environments

The primary objective of our experiments is to demonstrate the efficiency of the proposed adversarial active exploration in collecting training data (in a self-supervised manner) for inverse dynamics models. We compare our method against a number of data collection methods referred to as the “baselines”, including random[1], demo [2], curiosity [3], and noise [4].

We evaluate our method on a number of robotic arm and hand manipulation tasks via OpenAI gym environments simulated by MuJoCo physics engine. We use the Fetch and Shadow Dexterous Hand for the arm and hand manipulation tasks, respectively. For the arm manipulation tasks include FetchReach, FetchPush, FetchPickAndPlace, and FetchSlide. On the other hand, the hand manipulation task includes HandReach.

Performance Comparison in Robotic Arm Manipulation Tasks

Fig. 2 plots the learning curves for all of the methods. In all of the tasks, our method yields superior or comparable performance to the baselines except for Demo, which is trained directly with expert demonstrations (i.e., human priors). Our method also learns drastically faster than all the other baselines, which confirms that the proposed strategy does improve the performance and efficiency of inverse dynamics model learning.

Performance Comparison in Robotic Hand Manipulation Tasks

From the results shown in Fig. 2 (corresponding to the right most column),
it can be seen that Demo easily stands out from the other methods as the best-performing model, surpassing them all by a considerable extent. Although our method is not as impressive as Demo, it significantly outperforms all of the other baseline methods, achieving a success rate of 0.4 while the others are still stuck at around 0.2.

For more details about the methodology and the experimental results, please refer to the paper [CoRL].

Conclusion

In this paper, we presented an adversarial active exploration, which consists of a DRL agent and an inverse dynamics model competing with each other for efficient data collection. The former is encouraged to actively collect difficult training data for the latter, such that the training efficiency of the latter is significantly enhanced. Experimental results demonstrated that our method substantially improved the data collection efficiency in multiple robotic arm and hand manipulation tasks, and boosted the performance of the inverse dynamics models.

Paper Download

[CoRL]
[arXiv]

Please cite this paper as follows

Z.-W. Hong, T.-J. Fu, T.-Y. Shann, Yi.-H. Chang, and C.-Y. Lee, “Adversarial active exploration for inverse dynamics model learning,” in Proc. Conf. Robot Learning (CoRL), Oct.-Nov., 2019.

Reference

[1]P.Agrawal, A.Nair, P.Abbeel, J.Malik, and S.Levine. Learning to poke by poking: Experiential learning of intuitive physics. In Proc. Advances in Neural Information Processing Systems (NeurIPS), pages pp. 5074–5082, Dec. 2016.
[2] A. Nair, D. Chen, P. Agrawal, P. Isola, P. Abbeel, J. Malik, and S. Levine. Combining self- supervised learning and imitation for vision-based rope manipulation. In Proc. Int. Conf. Robotics and Automation (ICRA), pages pp. 2146–2153, May-Jun. 2017.
[3] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self- supervised prediction. In Proc. Int. Conf. Machine Learning (ICML), Aug. 2017.
[4] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz. Parameter space noise for exploration. In Proc. Int. Conf. Learning Representations (ICLR), Apr.-May 2018.