[NeurIPS 2018] Diversity-Driven Exploration Strategy for Deep Reinforcement Learning

Elsa Lab

7 min readNov 25, 2021

NeurIPS 2018 Full Paper

Demonstration Video

Keywords

Deep reinforcement learning, Exploration

Introduction

Efficient exploration remains a challenging research problem in reinforcement learning (RL), especially when an environment contains large state spaces, deceptive or sparse rewards. In an environment with deceptive rewards, an agent can be trapped in local optima, and never discover alternate strategies to find larger payoffs. In addition, environments with only sparse rewards provide few training signals, making it hard for agents to discover feasible policies.

A common approach for exploration is to adopt simple heuristic methods, such as ϵ-greedy or entropy regularization. However, such strategies are unlikely to yield satisfactory results in tasks with deceptive or sparse rewards. A more sophisticated line of methods provides agents with bonus rewards whenever they visit a novel state. Nevertheless, these methods often require statistical or predictive models to evaluate the novelty of a state, and therefore increase the complexity of the training procedure. In order to deal with the issue of complexity, a few researchers attempt to embrace the idea of random perturbation from evolutionary algorithms. By adding random noise to the parameter space, their methods allow RL agents to perform exploration more consistently without introducing extra computational costs. Despite their simplicity, these methods are less efficient in large state spaces, as random noise changes the behavioral patterns of agents in an unpredictable fashion.

In this paper, we present a diversity-driven exploration strategy, a methodology that encourages a Deep Reinforcement Learning (DRL) agent to attempt policies different from its prior policies. We propose to use a distance measure to modify the loss function to tackle the problems of large state spaces, deceptiveness and sparsity in reward signals. The distance measure evaluates the novelty between the current policy and a set of prior policies. We further propose an adaptive scaling strategy, which dynamically scales the effect of the distance measure for enhancing the overall performance. Moreover, our methodology is complementary and easily applicable to most off- and on-policy DRL algorithms.

Methodology

Diversity-Driven Exploration Strategy

The main objective of the proposed diversity-driven exploration strategy is to encourage a DRL agent to explore different behaviors during the training phase. Diversity-driven exploration is an effective way to motivate an agent to examine a richer set of states, as well as provide it with an approach to escape from sub-optimal policies. It can be achieved by modifying the loss function L(D) as follows:

where L indicates the loss function of any arbitrary DRL algorithms, π is the current policy, π′ is a policy sampled from a limited set of the most recent policies Π′, D is a distance measure between π and π′, and α is a scaling factor for D. The second term in the above equation encourages an agent to update π with gradients towards directions such that π diverges from the samples in Π′.

Above equation provides several favorable properties:

It drives agents to proactively attempt new policies, increasing the opportunities to visit novel states even in the absence of reward signals from ε. This property is especially useful in sparse reward settings, where the reward is zero for most of the states in S.
The distance measure D motivates exploration by modifying an agent’s current policy π, instead of altering its behavior randomly.
It allows an agent to perform either greedy or stochastic policies while exploring effectively in the training phase.

The choice of D can be KL-divergence, L2-norm, or mean square error (MSE).

Adaptive Scaling Strategy

To update ɑ in a way that leads to better overall performance, we consider two adaptive scaling methods:

Distance-based

We relate ɑ to the distance measure D. We adaptively increase or decrease the value of ɑ depending on whether D is below or above a certain threshold δ. The simple approach we use to update ɑ for each training iteration is defined as:

Performance-based

While the distance-based scaling method is straightforward and effective, it alone does not lead to the same performance for on-policy algorithms. The rationale behind this is that we only use the five most recent policies (n = 5) to compute L(D), which often results in high variance, and instability during the training phase. Off-policy algorithms do not suffer from this issue, as they can utilize experience replay to provide a sufficiently large set of past policies. Therefore, we propose to further adjust the value of α for on-policy algorithms according to the performance of past policies to stabilize the learning process. We define α(i) in either one of the following two strategies:

where P (π′) denotes the average performance of π′ over five episodes, and Pmin and Pmax represent the minimum and maximum performance attained by the set of past policies Π′. The proactive strategy incentivizes the current policy π to converge to the high-performing policies in Π′, while keeping away from the poor ones. On the other hand, the reactive strategy only motivates π to stay away from the underperforming policies.

Experimental Results and Analysis

Environment

The baseline methods adopted for comparison vary within different environments. For discrete control tasks, we select vanilla DQN [1], vanilla A2C [3, 4], as well as their noisy net [5] and curiosity-driven [6] variants. For continuous control tasks, vanilla DDPG [2] and its parameter noise [7] variants are taken as the baselines for comparison.

Exploration in Huge Gridworld

Fig. 1 illustrates the deceptive and the sparse grid worlds. In both settings, the agent starts from the top-left corner of the map, with an objective to reach the bottom-right corner to obtain a reward of 1. At each timestep, the agent observes its absolute coordinate, and chooses from four possible actions: move north, move south, move west, and move east. An episode terminates immediately after a reward is collected. In the deceptive reward setting illustrated in Fig. 1 (a), the central area of the map is scattered with small rewards of 0.001 to distract the agent from finding the highest reward in the bottom-right corner. On the other hand, in the sparse reward setting depicted in Fig. 1 (b), there is only a single reward located at the bottom-right corner.

Table 1: Evaluation results of the gridworld experiments.

We report the performance of each method in terms of their average rewards in Table 1. As shown in Table 1, Div-DQN outperforms both vanilla and Noisy-DQN in both settings.

Figure 2: State-visitation counts of the gridworlds.

From Fig 2. (a)(b), it can be observed that baseline methods are easily trapped in the area near the deceptive rewards, and have never visited the optimal reward in the bottom-right corner. On the other hand, it can be seen from Fig. 2 (c) that Div-DQN is able to escape from the area of deceptive rewards, explores all of the four sides of the gridworld, and successfully discovers the optimal reward of one.

From Fig. 2 (d), it can be seen that DQN spends most of its time wandering around the same route. Thus, its search range covers only a small proportion of the state space. Noisy-DQN explores a much broader area of the state space. However, the bright colors in Fig. 2 (e) indicates that Noisy-DQN wastes significant amount of time visiting explored states. On the other hand, Div-DQN is the only method that is capable of exploring the gridworld uniformly and systematically, as illustrated in Fig. 2 (f).

Performance Comparison in Atari2600

We encourage interested readers to refer to the full paper [NeurIPS] for more details.

Conclusion

In this paper, we presented a diversity-driven exploration strategy, which can be effectively combined with current RL algorithms. We proposed to promote exploration by encouraging an agent to engage in different behaviors from its previous ones, and showed that this can be easily achieved through the use of an additional distance measure term to the loss function. We performed experiments in various benchmark environments and demonstrated that our method leads to superior performance in most of the settings.

Paper Download

[NeurIPS]

Please cite this paper as follows:

Z.-W. Hong, T.-Y. Shann, S.-Y. Su, Y.-H. Chang, and C.-Y. Lee, ”Diversity-driven exploration strategy for deep reinforcement learning,” in Proc. the Thirty-Second Conf. Neural Information Processing Systems (NeurIPS), pp. 10510–10521, Dec. 2018.

Reference

[1] V. et al. Mnih. Human-level control through deep reinforcement learning. Nature, vol. 518, no. 7540, pp. 529–533, February 2015.
[2] T. P. et al. Lillicrap. Continuous control with deep reinforcement learning. arXiv:1509.02971, February 2016.
[3] V. et al. Mnih. Asynchronous methods for deep reinforcement learning. In Proc. Int. Conf. Machine Learning (ICML), pages 1928–1937, June 2016.
[4] D. et al. Silver. Deterministic policy gradient algorithms. In Proc. Int. Conf. Machine Learning (ICML), pages 387–395, June 2014.
[5] M.et al.Fortunato.Noisy networks for exploration.In Proc. Int. Conf. Learning Representations (ICLR), May 2018.
[6] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self- supervised prediction. In Proc. Int. Conf. Machine Learning (ICML), pages 2778–2787, August 2017.
[7] M. et al. Plappert. Parameter space noise for exploration. In Proc. Int. Conf. Learning Representations (ICLR), May 2018.