Training Agents To Play Soccer Game?
A Deep Policy Inference Q-Network for Multi-Agent Systems.
Multi-Agent Reinforcement Learning
Recently, Deep reinforcement learning (DRL), combining RL and deep neural networks (DNNs), has shown great successes in a wide variety of single-agent stationary settings, including Atari games, robot navigation, and Go. These advances lead researchers to start extending DRL to multi-agent domain for more practical use, such as investigating multi-agents’ social behaviors and developing algorithms for improving the training efficiency [1, 2, 3, 4, 5].
What’s the Challenge in Multi-Agent Training?
In a multi-agent system (MAS), agents share a common environment, where they can act and interact with the environment independently to achieve their own objectives. However, the environment perceived by each agent changes over time due to the actions exerted by other agents, causing non-stationarity in each agent’s observations. A non-stationary environment prohibits an agent from assuming that the others have specific strategies and are stationary, leading to increased complexity and difficulty in modeling their behaviors.
Policy-Changing Agents, Environmental Dynamics-Modeling
In collaborative or competitive scenarios, in which agents are required to cooperate or take actions against other agents, modeling environmental dynamics becomes even more challenging. Also, in order to act optimally under such scenarios, an agent needs to predict other agents’ policies and infer their intentions. Furthermore, for scenarios in which other agents’ policies dynamically change over time, an agent’s policy needs to change accordingly. Therefore, a robust methodology to model collaborators or opponents in a MAS is further necessitated.
Our Proposed Methodology — — Deep Policy Inference Q-Network (DPIQN)
To deal with the above issues, we present a detailed design of the deep policy inference Q-network (DPIQN), a deep policy inference Q-network that targets multi-agent systems composed of controllable agents, collaborators, and opponents that interact with each other. DPIQN aims at training and controlling a single agent to interact with the other agents in a MAS, using only high-dimensional raw observations (e.g., images). Also, we apply feature representation learning as auxiliary tasks for much better policy-learning.
Recently, representation learning in the form of auxiliary tasks has been employed in several DRL methods. Auxiliary tasks are combined with DRL by learning additional goals [6, 7, 8]. As auxiliary tasks provide DRL agents much richer feature representations than that of traditional methods, they are potentially more suitable for modeling non-stationary collaborators and opponents in a MAS.
Deep Policy Inference Q-Network (DPIQN)
The main objective of DPIQN is to improve the quality of state feature representations of an agent in multi-agent settings. For better understanding, we first introduce the architecture of DPIQN.
DPIQN is built on top of the famous deep Q-network (DQN) and consists of three major parts: a feature extraction module, a Q-value learning module, and an auxiliary policy feature learning module. The former two modules are responsible for learning the Q values, while the latter module focuses on learning a hidden representation from the other agents’ policies. We call the learned hidden representation “policy features” and propose to incorporate them into the Q-value learning module to derive better Q-values.
As mentioned in the previous section, the environmental dynamics are affected by multiple agents. For the agent to exploit the other agents’ actions in a MAS, DPIQN enhance the hidden representations by learning the other agents’ “policy features” through auxiliary tasks. More specifically and technically, DPIQN incorporates the learned policy features as a hidden vector into its own deep Q-network (DQN), such that it is able to predict better Q-values for the controllable agents than the state-of-the-art deep reinforcement learning models.
Advanced Version of DPIQN — — Deep Recurrent Policy Inference Q-Network (DRPIQN)
We further propose an enhanced version of DPIQN, called deep recurrent policy inference Q-network (DRPIQN), for handling partial observability. Both DPIQN and DRPIQN are trained by an adaptive training procedure, which adjusts the network’s attention to learn the policy features and its own Q-values at different phases of the training process.
DRPIQN emphasizes on decreasing the hidden state representation noise from strategy changing of the other agents in the environment. DRPIQN is proposed to incorporate recurrent units in the baseline DPIQN model to deal with the above issues, as illustrated in Fig. 1 (b). DRPIQN takes a single observation as its input. It similarly employs a convolutional neural network (CNN) to extract spatial features he from the input but uses the LSTM layers to encode their temporal correlations. Because of its capability of learning long-term dependencies, the LSTM layers are able to capture a better policy feature representation. We show that DPIQN demonstrates superior generalizability to unfamiliar agents than that of the baseline models.
Our Training For DPIQN And DRPIQN
Both DPIQN And DRPIQN are generalizable to complex environment!
Consider a multi-agent environment in which both cooperative and competitive agents coexist (e.g., a soccer game). The policies of these agents are diverse in terms of their objectives and tactics. For instance, the aim of the collaborative agents (collaborators) is to work with the controllable agent to achieve a common goal, while that of the competitive agents (opponents) is to act against the controllable agent’s team. In a heterogeneous environment where some of the agents are more offensive and some are more defensive, conditioning the Q-function of the controllable agent on the actions of distinct agents would lead to an explosion of parameters.
How To Evaluate The DPIQN And DRPIQN?
To evaluate the efficiencies of DPIQN and DRPIQN, we perform our experiment on a soccer game environment illustrated in Fig. 3. We train the agent in two representative scenarios: 1 vs. 1 and 2 vs. 2 games for learning policy feature more efficiently. Additionally, the opponents or collaborators are the rule-based agents that work as illustrated in Fig. 4.
Environment — — The Soccer Game
As illustrated in Fig. 3, the soccer field is a grid world composed of multiple grids of 32×32 RGB pixels and is divided into two halves. The game starts with the controllable agent and the collaborator (Fig. 3 (1)) randomly located on the left half of the field, and the opponents (Fig. 3 (2)) randomly located on the right half of the field, except the goals and border zones (Fig. 3 (3)). The initial possession of the ball (highlighted by a blue square) and the modes of the agents (offensive or defensive) are randomly determined for each episode. In each episode, each team’s objective is to deliver the ball to the opposing team’s goal, and the episode terminates immediately once a team scores a goal. Finally, a reward of “+1” is awarded if the controllable agent’s team wins, while a penalty reward of “-1” is given if it loses the match.
1 vs. 1 Scenario
In this scenario, the game is played by a controllable agent and an opponent (Fig. 3). The opponent is a two-mode rule-based agent playing according to the policy illustrated in Fig. 4. In the offensive mode, the opponent focuses on scoring a goal, or stealing the ball from the controllable agent. In the defensive mode, the opponent concentrates on defending its own goal, or moving away from the controllable agent when it possesses the ball.
2 vs. 2 Scenario
In this scenario, each of the two teams contains two agents. The two teams compete against each other on a grid world with larger areas of goals and border zones. We consider two tasks in this scenario: our controllable agent has to collaborate with (1) a rule-based collaborator or (2) a learning agent to compete with the rule-based opponents.
DPIQN And DRPIQN Agents Perform Better Even in the Non-Stationary Environment!
Table 2 compares the controllable agent’s average rewards among three types of the opponent agent’s modes in the testing phase for four types of models, including DQN, DRQN, DPIQN, and DRPIQN. The results show that DPIQN and DRPIQN outperform DQN and DRQN in all cases under the same hyperparameter setting. No matter which mode the opponent belongs to, DPIQN and DRPIQN agents are both able to score a goal for around 99% of the episodes. Moreover, the results indicate that incorporating the policy features of the opponent into the Q-value learning module does help DPIQN and DRPIQN to derive better Q values, compared to those of DQN and DRQN.
We have also observed that DPIQN and DRPIQN agents tend to play aggressively in most of the games, while DQN and DRQN agents are often confused by the opponent’s moves.
Proved Better by Learning Curve!
As illustrated in Fig. 5, it can be seen that DPIQN and DRPIQN learn much faster than DQN and DRQN. DRPIQN’s curve increases slower than DPIQN’s due to the extra parameters from the LSTM layers in DRPIQN’s model.
In Fig. 6, it can again be observed that the learning curves of DPIQN and DRPIQN grow much faster than those of DQN and DRQN. Even at the end of the training phase, the average rewards of our models are still increasing. From the results in Table 3 and Fig. 6, we conclude that DPIQN and DRPIQN are generalizable to complex environments with multiple agents.
Fig. 7 plots the learning curves of the three teams in the training phase. It can be observed that the learning curves of both DPIQN and FPDQN grow much faster and steadier than DQN. On the other hand, DPIQN consistently receives higher average rewards than the other models. The results in Fig. 7 validate the effectiveness of our model in the multi-agent setting with a learning collaborator.
In Multi-Agent Training, DPIQN May Be The New Solution!
We verified that DPIQN is capable of dealing with non-stationarity by conducting experiments where the controllable agent has to cooperate with a learning agent. Also, we showed that DPIQN is superior in collaboration to a recent multi-agent RL approach. We further validated the generalizability of our models in handling unfamiliar collaborators and opponents.
If you are interested in this work, please refer to our arXiv links for more details.
Please cite our paper as:
Zhang-Wei Hong∗ , Shih-Yang Su∗ , Tzu-Yun Shann∗ , Yi-Hsiang Chang, and Chun-Yi Lee. 2018. A Deep Policy Inference Q-Network for Multi-Agent Systems. In Proc. International Conf. Autonomous Agents and Multiagent Systems (AAMAS), Stockholm, Sweden, pages 1388–1396, July 10–15, 2018.
About us: ELSA Lab, Department of Computer Science, National Tsing-Hua University
ELSA Lab focuses on Deep Reinforcement Learning (DRL), Intelligent Robotics, and Computer Vision for Robotics. We are the leading laboratory in Taiwan combining DRL and Intelligent Robotic.
We have won the “NVIDIA Jetson Developer Challenge World Champion” and “NVIDIA Embedded Intelligent Robotics”. Challenge National Champion, “ECCV Person In Context (PIC) Challenge 2nd Place,” and many other awards. In addition, we published several top seminar papers in one year, including NeuroIPS, CVPR, IJCAI, AAMAS, ECCV Workshop, ICLR Workshop, GTC , ICCD, etc.
 S. V. Albrecht and P. Stone, “Autonomous agents modelling other agents: A comprehensive survey and open problems,” Artificial Intelligence, vol. 258, pages 66–95, Sep. 2017.
 B. Collins, “Combining opponent modeling and model-based reinforcement learning in a two-player competitive game,” Master’s Thesis, School of Informatics, University of Edinburgh, 2017.
 H. He, J. Boyd-Graber, K. Kwok, and H. Daumé III, “Opponent modeling in deep reinforcement learning,” In Proc. International Conf. Machine Learning (ICML), pp. 1804–1813, Jun. 2016.
 M. L. Littman, “Markov games as a framework for multi-agent reinforcement learning,” In Proc. International Conf. Machine Learning (ICML). pp. 157–163, Jul. 1994.
 R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” In Proc. International Conf. Neural Information Processing Systems (NeurIPS), pages 6382–6393, Dec. 2017.
 M. Jaderberg, V. Mnih, W. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised auxiliary tasks,” In Proc. Int. Conf. Learning Representations (ICLR), May 2016.
 P. W. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu, D. Kumaran, and R. Hadsell,
“Learning to navigate in complex environments,” In Proc. Int. Conf. Learning Representations (ICLR), May 2016.
 E. Shelhamer, P. Mahmoudieh, M. Argus, and T. Darrell, “Loss is its own reward: Self-supervision for reinforcement learning,” In Proc. Int. Conf. Learning Representations (ICLR), May 2016.