[ACM TELO 2021 / NeurIPS 2020 Workshop] Reusability and Transferability of Macro Actions for Reinforcement Learning

7 min readApr 20, 2022

ACM TELO 2021 Full Paper

Introduction

In conventional Reinforcement Learning (RL) methods, agents are restricted to make decisions at each timestep. However, the rewards in RL training are intrinsically biased toward short-term goals due to discounting. Such situations are further exacerbated by the greedy nature of the agents which simply follow the policy and/or value functions. Therefore, researchers in the past years have proposed a few techniques to generate macro actions. A macro action (or simply “a macro”) is defined as an open-loop policy composed of a finite sequence of primitive actions.

The promise and the goal of a macro action are to improve the search guidance by protecting the intermediate states from the greediness of the RL agents while executing them. Such a temporal abstraction is not achievable by a short-term reward or a single action alone, e.g., risking into a dangerous zone to retrieve a valuable item. This is similar to the embedding effect and the evaluation effect discussed by [1].

Different macro actions have different impacts on an agent. A bad macro may lead the agent to undesirable states. On the other hand, a good macro enables an RL agent to bypass multiple intermediate states and reach a target state quicker and easier. In addition, we further assume that good macros also exhibit invariance among different RL methods and similar environments. The former is called the reusability property, while the latter is called the transferability property.

Methodology

The Proposed Workflow

Figure 1: An illustration of the workflow adopted in this research.

In this section, we present the proposed workflow for investigating the reusability and transferability properties for macro actions, which is the main objective of this research. Fig. 1 illustrates the workflow adopted in this paper. It contains two stages: a macro action generation stage, and a macro action utilization stage for performing various evaluations. The former stage takes into account the RL method, the action space, and the environment, and generates a macro action m that is sufficiently good enough for an RL agent to use as a means of performing temporal abstraction in the environment. The latter stage then encapsulates the generated macro m and the action space into an augmented action space M and utilizes this augmented action space in our evaluation experiments.

Algorithm

We formulate our macro action generation stage as Algorithm 1, where the macro action construction method is based on a genetic algorithm (GA) [2]. Algorithm 1 is established atop three modules: (1) The fitness function, (2) The append operator, and (3) The alteration operator. These three modules serve as essential roles in Algorithm 1, and are additionally formulated as Algorithms 2, 3, and 4, respectively. For more details, please refer to the paper.

Experimental Results and Analysis

Experimental Setup

(1) Setup for the reusability experiments: We evaluate the generated macro actions on the following eight Atari 2600 environments: Asteroids, Beamrider, Breakout, Kung-Fu Master, Ms. Pac-Man, Pong, Q*bert, and Seaquest. We also select advantage actor-critic (A2C) [3] and proximal policy optimization (PPO) [4] as our RL methods for training the agents.

(2) Setup for the transferability experiments: We employ ViZDoom as our environments for examining the transferability property. ViZDoom is a research platform featuring complex three-dimensional first-person perspective environments. For ViZDoom, we evaluate our generated macro on the default task (denoted as “Dense”). Then we further use the “Sparse”, “Very Sparse”, and “Super Sparse” (developed by us) tasks for analyzing the transferability property of the constructed macro. We also implemented an intrinsic curiosity module (ICM) [5] along with A2C (together denoted as “Curiosity”) as our RL method for training the agents.

Validation of the Proposed Workflow

Figure 2: The learning curves w/ and w/o the derived macros.

Fig. 2 uses two different types of environments, Dense from ViZDoom and Enduro from Atari 2600, as example cases to demonstrate the average fitness and improvement of the each generation produced by Algorithm 1. It is observed from the trends that the mean episode rewards obtained by the agents improve with generations, revealing that later generations do inherit the advantageous properties from their parents. Such advantageous properties are retained over generations, pushing the population of the macro actions to evolve toward better fitness.

We first employ the ViZDoom environment “Dense” to discuss the benefits of our generated macros in terms of the embedding effect by comparing the macros generated by Algorithm 1 with the action repeat macro, in which the same primitive actions are repeatedly executed. We employ the proposed macro generation stage to construct the best macro m(D) for “Dense”. Then, in order to construct action repeat macro m(Repeat), we evaluate all possible action repeat macros with the same length equal to m(D). In Fig 2. (a) , it can be observed that the curve of “Curiosity+m(Repeat)” is worse than those of “Curiosity” and “Curiosity+m(D)” in the early stage of the training phase. This observation provides two insights. First, although both m(D) and m(Repeat) allow the RL agents to bypass intermediate states by performing consecutive actions, m(Repeat) does not lead to immediate positive impacts when compared with the vanilla “Curiosity”. This suggests that not all macro generation methods are able to construct a macro that benefits equally from the embedding effect. Second, m(D) enables the agent to perform better than the vanilla “Curiosity”, indicating that the macro generated by our macro generation stage does provide positive impact when the agent is allowed to bypass intermediate states during the training phase of it.

In addition, we compare the learning curves of the A2C agent trained with and without the best macro m(A2C) constructed by our methodology in Fig. 2 (b) for 100M timesteps. It is observed that A2C with m(A2C) outperforms the vanilla A2C, which is hardly able to learn an effective policy throughout the training process.

Reusability and Transferability Properties

Figure 3: The learning curves of the RL methods with and without the generated macros for evaluating *reusability*.

The reusability property is said to exist if a macro action constructed along with one RL method can be used by another RL method for training. The results in Fig. 3 show that the A2C and PPO agents are able to be benefited from the provided macros in most cases. This justifies the existence of the reusability property of the macros. The above observations also suggest that the macros generated by our workflow could exhibit invariance when they are employed by different RL methods during the training phase of the agents.

Figure 4: The learning curves of the RL method with and without the generated macro for evaluating the transferability property.

The transferability property is said to exist if the constructed macros can be leveraged in similar environments with different reward settings. In order to confirm this property, we utilize the macro m(D) = (MOVE_FORWARD, MOVE_FORWARD, TURN_RIGHT) generated from the Dense reward setting to validate the transferability of it in tasks with sparse reward settings, including Sparse, Very Sparse, and Super Sparse. Fig. 4 demonstrates that the agents with macro learn relatively faster than the agents without it. These results thus validate the transferability property of the generated macro action, and suggest that the macro generated in an environment can be utilized in a similar one with different reward settings, even if the reward signal becomes sparser than the original one.

Conclusion

In this paper, we have presented a methodology to examine the reusability and the transferability properties of the generated macros in the RL domain. We presented a workflow to generate macros, and utilized them with various evaluation configurations. We first validated the workflow, and showed that the generated macros exhibit the embedding and evaluation effects. We then examined the reusability property between RL methods and the transferability property among similar environments.

Paper Download

[ACM]

Please cite this paper as follows:

Y.-S. Chang, K.-Y. Chang, H. Kuo, and C.-Y. Lee, “Reusability and transferability of macro actions for reinforcement learning,” ACM Trans. Evolutionary Learning and Optimization (TELO), vol. 2, no. 1, pp. 1–16, Mar. 2022.

Reference

[1] A. Botea, M. Enzenberger, M. Müller, and J. Schaeffer. Macro-FF: Improving AI planning with automatically learned macro-operators. J. Artificial Intelligence Research (JAIR), 24:581–621, Oct. 2005.
[2] Melanie Mitchell. An introduction to genetic algorithms. MIT press, 1998. 3
[3] V.Mnih, A.P.Badia, M.Mirza, A.Graves, T.Lillicrap, T.Harley, D.Silver, and K.Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proc. Int. Conf. Machine Learning (ICML), pages 1928–1937, Jun. 2016. 5, 12
[4] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv:1707.06347, Aug. 2017. 5, 12
[5] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self- supervised prediction. In Proc. Int. Conf. Machine Learning (ICML), Aug. 2017. 5, 12