Training Agents To Play Soccer Game?

Multi-Agent Reinforcement Learning

Recently, Deep reinforcement learning (DRL), combining RL and deep neural networks (DNNs), has shown great successes in a wide variety of single-agent stationary settings, including Atari games, robot navigation, and Go. These advances lead researchers to start extending DRL to multi-agent domain for more practical use, such as investigating multi-agents’ social behaviors and developing algorithms for improving the training efficiency [1, 2, 3, 4, 5].

What’s the Challenge in Multi-Agent Training?


In a multi-agent system (MAS), agents share a common environment, where they can act and interact with the environment independently to achieve their own objectives. However, the environment perceived by each agent changes over time due to the actions exerted by other agents, causing non-stationarity in each agent’s observations. A non-stationary environment prohibits an agent from assuming that the others have specific strategies and are stationary, leading to increased complexity and difficulty in modeling their behaviors.

Policy-Changing Agents, Environmental Dynamics-Modeling

In collaborative or competitive scenarios, in which agents are required to cooperate or take actions against other agents, modeling environmental dynamics becomes even more challenging. Also, in order to act optimally under such scenarios, an agent needs to predict other agents’ policies and infer their intentions. Furthermore, for scenarios in which other agents’ policies dynamically change over time, an agent’s policy needs to change accordingly. Therefore, a robust methodology to model collaborators or opponents in a MAS is further necessitated.

Our Proposed Methodology — — Deep Policy Inference Q-Network (DPIQN)

To deal with the above issues, we present a detailed design of the deep policy inference Q-network (DPIQN), a deep policy inference Q-network that targets multi-agent systems composed of controllable agents, collaborators, and opponents that interact with each other. DPIQN aims at training and controlling a single agent to interact with the other agents in a MAS, using only high-dimensional raw observations (e.g., images). Also, we apply feature representation learning as auxiliary tasks for much better policy-learning.

Auxiliary Tasks

Recently, representation learning in the form of auxiliary tasks has been employed in several DRL methods. Auxiliary tasks are combined with DRL by learning additional goals [6, 7, 8]. As auxiliary tasks provide DRL agents much richer feature representations than that of traditional methods, they are potentially more suitable for modeling non-stationary collaborators and opponents in a MAS.

Deep Policy Inference Q-Network (DPIQN)

The main objective of DPIQN is to improve the quality of state feature representations of an agent in multi-agent settings. For better understanding, we first introduce the architecture of DPIQN.

Figure 1: Architectures of (a) DPIQN and (b) DRPIQN.

Advanced Version of DPIQN — — Deep Recurrent Policy Inference Q-Network (DRPIQN)

We further propose an enhanced version of DPIQN, called deep recurrent policy inference Q-network (DRPIQN), for handling partial observability. Both DPIQN and DRPIQN are trained by an adaptive training procedure, which adjusts the network’s attention to learn the policy features and its own Q-values at different phases of the training process.

Our Training For DPIQN And DRPIQN

Algorithm 1 Training Procedure of DPIQN and DRPIQN

Both DPIQN And DRPIQN are generalizable to complex environment!

Consider a multi-agent environment in which both cooperative and competitive agents coexist (e.g., a soccer game). The policies of these agents are diverse in terms of their objectives and tactics. For instance, the aim of the collaborative agents (collaborators) is to work with the controllable agent to achieve a common goal, while that of the competitive agents (opponents) is to act against the controllable agent’s team. In a heterogeneous environment where some of the agents are more offensive and some are more defensive, conditioning the Q-function of the controllable agent on the actions of distinct agents would lead to an explosion of parameters.

Figure 2: Generalized architecture of DPIQN and DRPIQN.

How To Evaluate The DPIQN And DRPIQN?

Figure 3: Illustration of the soccer game in the 1 vs. 1 scenario. (1) is our controllable agent, (2) is the rule-based opponent, (3) is the starting zone of the agents. The agent who possesses the ball is highlighted by a surrounding blue square.
Figure 4: The policy of the rule-based agent(s) in each episode.

Environment — — The Soccer Game

As illustrated in Fig. 3, the soccer field is a grid world composed of multiple grids of 32×32 RGB pixels and is divided into two halves. The game starts with the controllable agent and the collaborator (Fig. 3 (1)) randomly located on the left half of the field, and the opponents (Fig. 3 (2)) randomly located on the right half of the field, except the goals and border zones (Fig. 3 (3)). The initial possession of the ball (highlighted by a blue square) and the modes of the agents (offensive or defensive) are randomly determined for each episode. In each episode, each team’s objective is to deliver the ball to the opposing team’s goal, and the episode terminates immediately once a team scores a goal. Finally, a reward of “+1” is awarded if the controllable agent’s team wins, while a penalty reward of “-1” is given if it loses the match.

1 vs. 1 Scenario

In this scenario, the game is played by a controllable agent and an opponent (Fig. 3). The opponent is a two-mode rule-based agent playing according to the policy illustrated in Fig. 4. In the offensive mode, the opponent focuses on scoring a goal, or stealing the ball from the controllable agent. In the defensive mode, the opponent concentrates on defending its own goal, or moving away from the controllable agent when it possesses the ball.

2 vs. 2 Scenario

In this scenario, each of the two teams contains two agents. The two teams compete against each other on a grid world with larger areas of goals and border zones. We consider two tasks in this scenario: our controllable agent has to collaborate with (1) a rule-based collaborator or (2) a learning agent to compete with the rule-based opponents.

DPIQN And DRPIQN Agents Perform Better Even in the Non-Stationary Environment!

Table 2: Comparison of DPIQN and DRPIQN versus DQN and DRQN

Proved Better by Learning Curve!

Figure 5: Learning curve comparison in the 1 vs. 1 scenario.
Figure 6: Learning curve comparison in the 2 vs. 2 scenario.
Figure 7: Mean rewards of different collaboration teams.

In Multi-Agent Training, DPIQN May Be The New Solution!

We verified that DPIQN is capable of dealing with non-stationarity by conducting experiments where the controllable agent has to cooperate with a learning agent. Also, we showed that DPIQN is superior in collaboration to a recent multi-agent RL approach. We further validated the generalizability of our models in handling unfamiliar collaborators and opponents.

If you are interested in this work, please refer to our arXiv links for more details.



[1] S. V. Albrecht and P. Stone, “Autonomous agents modelling other agents: A comprehensive survey and open problems,” Artificial Intelligence, vol. 258, pages 66–95, Sep. 2017.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Elsa Lab

Elsa Lab


ELSA Lab is a research laboratory focusing on Deep Reinforcement Learning, Intelligent Robotics, and Computer Vision.