[ICML 2021 Spotlight] DFAC Framework: Factorizing the Value Function via Quantile Mixture for Multi-Agent Distributional Q-Learning

7 min readSep 26, 2021

ICML 2021 Full Paper

ICML Presentation Video

Demonstration Video

Keywords

DFAC, Multi-agent reinforcement learning, SMAC, distributional Q-learning, value function factorization, quantile mixture

Introduction

In multi-agent reinforcement learning (MARL), the environments are highly stochastic due to the partial observability of each agent and the continuously changing policies of the other agents. One of popular research directions is to enhance the training procedure of fully cooperative and decentralized agents. In the past few years, a number of MARL researchers turned their attention to centralized training with decentralized execution (CTDE).

Among these CTDE approaches, value function factorization methods are especially promising in terms of their superior performances and data efficiency. Value function factorization methods introduce the assumption of individual-global-max (IGM) [1], which assumes that each agent’s optimal actions result in the optimal joint actions of the entire group. Based on IGM, the total return of a group of agents can be factorized into separate utility functions for each agent. The utilities allow the agents to independently derive their own optimal actions during execution. Unfortunately, current value function factorization methods only concentrate on estimating the expectations of the utilities, overlooking the additional information contained in the full return distributions.

Therefore, distributional RL has been empirically shown to enhance value function estimation in various sing-agent RL (SARL) domains. Instead of estimating a single scalar Q-value, it approximates the probability distribution of the return by either a categorical distribution [2] or a quantile function [3]. Even though the above methods may be beneficial to the MARL domain due to the ability to capture uncertainty, it is inherently incompatible to expected value function factorization methods (e.g. value decomposition network (VDN) [4] and QMIX [5]). The incompatibility arises from two aspects: (1) maintaining IGM in a distributional form, and (2) factorizing the probability distribution of the total return into individual utilities.

In this paper, we propose a Distributional Value Function Factorization (DFAC) framework, to efficiently integrate value function factorization methods with distributional RL. DFAC solves the incompatibility by two techniques: (1) Mean-Shape Decomposition and (2) Quantile Mixture. The former allows the generalization of expected value function factorization methods to their DFAC variants without violating IGM. The latter allows the total return distribution to be factorized into individual utility distributions in a computationally efficient manner.

Methodology

Background

Value-based Methods for Fully Cooperative MARL:
Independent Q-Learning (IQL) is the simplest value-based learning method for MARL, where each agent attempts to maximize the total rewards separately. This causes unstationarity due to the changing policies of the other agents and may not converge. Thus, value function factorization methods are introduced to enable centralized training of factorizable tasks based on the IGM (Individual-Global-Max) condition, where optimal individual actions result in the optimal joint action of the group of agents:

Equation 1: IGM condition. Q is the Q-function, h is the set of joint-action-observation histories, and u is the set of joint actions (subscript k represents the agents k) .

The previous VDN [4] and QMIX [5] methods assume additional premises: additivity and monotonicity, respectively, to simplify the tasks:

Distributional Reinforcement Learning
Distributional RL methods have been proved empirically to outperform expected RL methods in various single-agent RL (SARL) domains. The distributional Bellman operator 𝑇𝜋 is proved to have a contraction in 𝑝-Wasserstein distance 𝑊 , ∀𝑝 ∈ [1, ∞):

Equation 4: Distributional Bellman operator.

Equation 5: p-Wasserstein distance Wp between the probability distributions of random variable X, Y, where F functions are quantile functions of (X, Y).

The Proposed DFAC Framework

The Factorization Network Ψ can be any differentiable factorization function (e.g., VDN, QMIX), while the Shape Network Φ is defined by a quantile mixture:

Mean-Shape Decomposition

The naive generalization of the distributional form of IGM does not satisfy IGM in general. Thus, we introduced the mean-shape decomposition to separate the approximation of the mean and the shape of the return distribution:

Practical Implementation with Quantile Mixture

The factorization network Ψ can be any expected value function factorization method, while the shape network Φ can be approximated by a quantile mixture:

Experimental Results and Analysis

Environment

We verify the DFAC framework in the StarCraft Multi-Agent Challenge (SMAC) benchmark environments [6] built on the popular real-time strategy game StarCraft II. Instead of playing the full game, SMAC is developed for evaluating the effectiveness of MARL micro-management algorithms. Each environment in SMAC contains two teams. One team is controlled by a decentralized MARL algorithm, with the policies of the agents conditioned on their local observation histories. The other team consists of enemy units controlled by the built-in game artificial intelligence. The overall objective is to maximize the win rate for each battle scenario.

The environments in SMAC are categorized into three different levels of difficulties: Easy, Hard and Super Hard scenarios. We focus on all Super Hard scenarios including (a) 6h_vs_8z, (b) 3s5z_vs_3s6z, (C) MMM2, (d) 27m_vs_30m, and (e) corridor. We select IQL [7], VDN [4], and QMIX [5] as our baseline methods, and compare them with their distributional variants in our experiments.

Figure 2: The win rate curves evaluated on the five *Super Hard* maps in SMAC for different CTDE methods.

Table 1: The median win rate % of five independent test runs. Maps (a)-(e) correspond to the maps in **Fig. 2**.

Table 2: The averaged scores of five independent test runs. Maps (a)-(e) correspond to the maps in **Fig. 2**.

In Fig. 2 and Table 1, it can be observed that the learning curves of DDN and DMIX grow faster and achieve higher final win rates than their corresponding baselines. In the most difficult map: 6h_vs_8z, most of the methods fail to learn an effective policy except for DDN and DMIX. In addition to the win rates, Table 2 further presents the final averaged scores given by the SMAC environment of each method, and provides deeper insights into the advantages of the DFAC framework.

Conclusion

In this paper, we provided a distributional perspective on value function factorization methods, and introduced a framework, called DFAC, for integrating distributional RL with MARL domains. In order to validate the effectiveness of DFAC, we presented experimental results performed on all Super Hard scenarios in SMAC for a number of MARL baseline methods as well as their DFAC variants. The results show that DDN and DMIX outperform VDN and QMIX. DFAC can be extended to more value function factorization methods and offers an interesting research direction for future endeavors.

Paper Download

[ICML]

Github

[DFAC]

Please cite this paper as follows:

W.-F. Sun, C.-K. Lee, and C.-Y. Lee, “DFAC framework: Factorizing the value function via quantile mixture for multi-agent distributional Q-learning”, in Proc. Int. Conf. on Machine Learning (ICML), Jul. 2021.

Reference

[1] Son, K., Kim, D., Kang, W. J., Hostallero, D. E., and Yi, Y. QTRAN: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In Proc. Int. Conf. on Machine Learning (ICML), pp. 5887–5896, Jul. 2019.
[2] Bellemare, M. G., Dabney, W., and Munos, R. A distribu- tional perspective on reinforcement learning. In Proc. Int. Conf. on Machine Learning (ICML), pp. 449–458, Jul. 2017.
[3] Dabney, W., Ostrovski, G., Silver, D., and Munos, R. Implicit quantile networks for distributional reinforce- ment learning. In Proc. Int. Conf. on Machine Learning (ICML), pp. 1096–1105, Jul. 2018a.
[4] Sunehag, P. et al. Value-decomposition networks for coop- erative multi-agent learning based on team reward. In Proc. Int. Conf. on Autonomous Agents and MultiAgent Systems (AAMAS), pp. 2085–2087, May 2018.
[5] Rashid, T. et al. QMIX: Monotonic value function factorisa- tion for deep multi-agent reinforcement learning. In Proc. Int. Conf. on Machine Learning (ICML), pp. 4295–4304, Jul. 2018.
[6] Samvelyan, M. et al. The starcraft multi-agent challenge. In Proc. Int. Conf. on Autonomous Agents and MultiAgent Systems (AAMAS), pp. 2186–2188, May 2019.
[7] Tan, M. Multi-agent reinforcement learning: Independent versus cooperative agents. In Proc. Int. Conf. on Ma- chine Learning (ICML), pp. 330–337, Jun. 1993. ISBN 1558603077.