Recent Progresses in Multi-Agent RL Theory

Reinforcement learning (RL) has made substantial empirical progresses in solving hard AI challenges in the past few years. A big portion of these progresses—Go, Dota 2, Starcraft, economic simulation, social behavior learning, and so on—come from multi-agent RL, that is, sequential decision making involving more than one agents. While the theoretical study of (single-agent) RL has a long history and a vastly growing recent interest, multi-agent RL theory is arguably a newer and less developed field, with its own unique challenges and opportunities that we feel very excited about.

In this extended blog post, we present a brief overview of the basics of multi-agent RL theory, along with some recent theoretical developments in the past few years. We will focus on learning Markov games, and cover the basic formulations, learning goals, planning algorithms, as well as recent advances in sample-efficient learning algorithms under the interactive (exploration) setting.

The majority of this blog post will assume familiarity with the basics of RL theory in the single-agent setting (for example, materials in this fantastic book). We will focus on “what is different” when it comes to multi-agent, and discuss the various recent developments and opportunities therein.

$\def\cS{\mathcal S}$ $\def\cA{\mathcal A}$ $\def\cF{\mathcal F}$ $\def\mc{\mathcal}$ $\def\P{\mathbb P}$ $\def\E{\mathbb E}$ $\def\V{\mathbb V}$ $\def\R{\mathbb R}$ $\def\tO{\widetilde{\mathcal{O}}}$ $\def\eps{\varepsilon}$ $\def\epsilon{\varepsilon}$ $\def\setto{\leftarrow}$ $\def\up{\overline}$ $\def\low{\underline}$ $\def\what{\widehat}$ $\def\ba{\boldsymbol{a}}$

1. Formulation: Markov Games

There are various formulations for multi-agent RL. In this blog post, we will focus on Markov Games (MG; Shapley 1953, Littman 1994), a generalization of the widely-used Markov Decision Process (MDP) framework into the case of multiple agents. We remark that there exist various other frameworks for modeling multi-agent sequential decision making, such as extensive-form games, which can be considered as Markov games with special (tree-like) structures in transition dynamics; when combined with imperfect information, this formulation is more suitable for modeling games such as Poker.

Roughly speaking, a Markov Game has the same (state, action) -> (reward, next state) structure as in an MDP, except that now “action” is replaced by the joint actions of all players, and “reward” is replaced by a collection of reward functions so that each player has her own reward. In words:

Markov Games with m players: (state, m actions) -> (m rewards, next state).

Formally, we consider a multi-player general-sum Markov game with $m$ players, which can be described as a tuple $\text{MG}(H, \mathcal{S}, \{\mathcal{A}_{i}\}_{i=1}^{m}, \mathbb{P},\{r_{i}\}_{i=1}^{m})$, where

$H$ is the horizon length,
$\mathcal{S}$ is the state space with $\vert \mathcal{S}\vert=S$.
$\mathcal{A}_i$ is the action space for the $i$-th player with $\vert\mathcal{A}_i\vert=A_i$. All players act simultaneously; we use $\ba=(a_1,\dots,a_m)\in \prod_{i=1}^m \mathcal{A}_i$ to denote a joint action,
$\mathbb{P}=\{\mathbb{P}_h\}$ is the collection of transition probabilities, where $\mathbb{P}_h(s_{h+1} \vert s_h, \ba_h)$ is the probability of transiting to the next state $s_{h+1}$, given state-action pair $(s_h, \ba_h)$ at step $h$,
$r_i=\{r_{i,h}\}$ is the reward function for the $i$-th player. At step $h$, the $i$-th player will experience reward $r_{i,h}(s_h, \ba_h)\in[0, 1]$. For simplicity of presentation, we assume the reward is deterministic.

A trajectory of an MG looks like this: [ s_1, (a_{1, 1}, \dots, a_{m, 1}), (r_{1,1},\dots, r_{m, 1}), s_2, \dots, s_H, (a_{1, H}, \dots, a_{m, H}), (r_{1,H}, \dots, r_{m,H}). ] Finally, a (Markov) policy for the $i$-th player is denoted by $\pi_i=\{\pi_{i,h}\}_{h=1}^H$, where $\pi_{i,h}(a_{i,h}\vert s_h)$ is the probability of taking action $a_{i,h}$ at step $h$ and state $s_h$. We use $V_{i,h}^{\pi}(s_h)$ and $Q_{i,h}^{\pi}(s_h, \ba_h)$ to denote the value and Q-value functions for player $i$ at step $h$, when joint policy $\pi$ is used. (This can either be a product policy $\pi=(\pi_1,\dots,\pi_m)$ which can be executed independently, or a correlated policy which samples the joint action $\ba_h$ for all the players together in a potentially correlated fashion.) We assume the initial state $s_1$ is deterministic, so that $V_{i,1}^{\pi} := V_{i,1}^\pi(s_1)$ is the overall return for the $i$-th player.

1.1 Special case: Zero-sum games

An important special case of Markov games is zero-sum (competitive) MGs:

A zero-sum Markov Game is a Markov Game with $m=2$, and $r_1\equiv -r_2$.

As each player wishes to maximize her own reward, in a zero-sum MG, the two players are in a purely competitive relationship. We thus define the overall value in a zero-sum MG to be player 1’s value function: $V_h:=V_{1,h}$. We call player 1 the max player and player 2 the min player.

Zero-sum MGs model a number of real-world two-player strategic games, such as Go, Chess, Shogi, etc. From a mathematical perspective, zero-sum MGs is perhaps the most immediate extension of a single-agent MDP—solving it can be thought of as a certain min-max optimization, akin to how solving a single-agent MDP is a certain maximization problem.

2. Nash equilibrium

In an MDP, the learning goal for an agent is to maximize her own cumulative reward. In an MG, this is still the learning goal for each individual agent, but we have to additionally decide on some notion of equilibrium to combine these individual optimality conditions.

Throughout the majority of this blog post, we consider the celebrated notion of Nash equilibrium (Nash 1951), which states that for each player $i$, fixing all other player’s policies, her own policy is a best response (i.e. maximizes her own cumulative reward):

Nash equilibrium: Each player plays the best response to all other player’s policies.

It is clear that the Nash equilibrium strictly generalizes the notion of an optimal policy for MDPs—An MG reduces to an MDP if player 1 acts and all other players are dummy (in the sense their actions do not affect transition and rewards). In this case, the best response of player 1 reduces to maximizing her own reward with respect to the environment (MDP) only.

Formally, we define an $\eps$-approximate Nash equilibrium (henceforth also $\eps$-Nash) as follows. Below, $\pi_{-i}$ denotes the collection of all but the $i$-th player’s policy within $\pi$.

Definition: A product policy $\pi=(\pi_1,\dots, \pi_m)$ is an $\eps$-approximate Nash equilibrium, if we have [ \max_{i\in[m]} \big( \max_{\pi_i’} V_{i,1}^ {\pi_i’, \pi_{-i}}- V_{i,1}^{\pi_i, \pi_{-i}} \big) \le \eps. ] We say $\pi$ is an (exact) Nash equilibrium if the above holds with $\eps=0$.

The Nash equilibrium need not be unique for (general-sum) MGs. Also, note that by definition any Nash policy must be a product policy (i.e. all the players execute their own policy independently). There exist other proper notions of equilibria when players do play in a correlated fashion; we will dive deeper into this in Section 5.

2.1 Nash equilibrium in zero-sum games

The Nash equilibrium enjoys additional nice properties in zero-sum MGs. Recall that $V_1^{\pi_1, \pi_2}$ denotes player 1’s value for which $\pi_1$ seeks to maximize and $\pi_2$ seeks to minimize.

Proposition: For any zero-sum MG, we have

(Unique Nash value) There exists a $V_1^{\star}\in[0,H]$ such that all Nash equilibrium $(\pi_1^\star, \pi_2^\star)$ achieves this same value: $V_1^{\pi_1^\star, \pi_2^\star} = V_1^{\star},$ even though the Nash policy $(\pi_1^\star, \pi_2^\star)$ may not be unique.
(Strong duality) Any Nash equilibrium is also a min-max solution and a max-min solution. In other words, the following two inequalities (which holds for any pair of $(\pi_1, \pi_2)$) \[ \left\{ \begin{aligned} & \max_{\pi_1'} V_1^{\pi_1', \pi_2} \ge \min_{\pi_2'} \max_{\pi_1'} V_1^{\pi_1', \pi_2'} \ge \max_{\pi_1'} \min_{\pi_2'} V_1^{\pi_1', \pi_2'} \ge \min_{\pi_2'} V_1^{\pi_1, \pi_2'} \\ & \max_{\pi_1'} V_1^{\pi_1', \pi_2} \ge V_1^{\pi_1, \pi_2} \ge \min_{\pi_2'} V_1^{\pi_1, \pi_2'} \end{aligned} \right. \] become equalities ($=V_1^{\star}$) if $(\pi_1, \pi_2)$ is a Nash equilibrium.

These nice properties make zero-sum MGs an appealing setting for learning Nash equilibria, which will be the focus of our next two sections.

3. Planning algorithms in zero-sum MGs

Given the above setup, a natural first question is: How do we compute the Nash equilibrium in zero-sum MGs assuming known transitions and rewards? In other words, what is a good planning algorithm for learning Nash equilibria? It turns out that a direct multi-agent adaptation of Value Iteration, dating back to (Shapley 1953), gives an algorithm for computing the Nash equilibrium for a known game. We term this algorithm as Nash-VI (Nash Value Iteration).

Similar to how Value Iteration (VI) computes the optimal values and policies for an MDP using backward dynamical programming, Nash-VI performs similar backward updates on an MG except for the policy selection mechanism: the maximization over the vector $Q^\star(s, \cdot)$ in VI is now replaced by a $\text{MatrixNash}$ subroutine over the matrix $Q^{\star}(s, \cdot, \cdot)$. This makes sense given that the learning goal is now learning a Nash equilibrium policy over the two players jointly, instead of the reward maximizing policy for a single player. In words,

Nash-VI = Value Iteration with $\text{Max}(\cdot)$ replaced by $\text{MatrixNash}(\cdot)$.

Algorithm (Nash-VI):

Initialize $V_{H+1}(s)\equiv 0$ for all $s\in\cS$.
For $h=H,\dots,1$, compute the following over $s\in\cS$ and all $(a_1, a_2)\in\cA_1\times \cA_2$:

\[ \begin{aligned} & Q_h^{\star}(s, a_1, a_2) = r_h(s, a_1, a_2) + (\P_h V_{h+1}^\star)(s, a_1, a_2), \\\ & (\pi_{1,h}^\star(\cdot\vert s), \pi_{2,h}^\star(\cdot\vert s)) = \text{MatrixNash}(Q_h^{\star}(s,\cdot,\cdot)), \\\ & V^{\star}_h(s) = \pi_{1,h}^\star(\cdot \vert s)^\top Q_h(s,\cdot,\cdot) \pi_{2,h}^\star(\cdot \vert s). \end{aligned} \]

Return policies $(\pi_1^\star, \pi_2^\star)$ where $\pi_i^\star=\{\pi_{i,h}^\star(\cdot\vert s)\}_{(h,s)\in[H]\times \cS}$.

Above, the $\text{MatrixNash}$ subroutine for any matrix $M\in \R^{A_1\times A_2}$ is defined as \[ \begin{align}\label{equation:matrixnash} \text{MatrixNash}(M) := \arg \big( \max_{\pi_1 \in \Delta_{\cA_1}} \min_{\pi_2 \in \Delta_{\cA_2}} \pi_1^\top M \pi_2 \big), \end{align} \] which can be implemented efficiently using linear programming. Here $\Delta_{\cA_1}$ and $\Delta_{\cA_2}$ are the probability simplex over action sets $\cA_1$ and $\cA_2$ respectively.

Using backward induction over $h$, it is straightforward to show that the policies $(\pi_1^\star, \pi_2^\star)$ returned by Nash-VI is indeed a Nash equilibrium of the desired MG. The $V_h^{\star}(s_h)$ and $Q_h^{\star}(s,a_1,a_2)$ computed above are uniquely defined; we call them Nash values of the game at state $s$ and state-action tuple $(s,a_1,a_2)$ respectively.

Value Iteration is not the only algorithm for computing Nash. As an alternative, the Nash Q-Learning algorithm (Hu & Wellman 2003) performs incremental updates on the Q values, and uses the same matrix Nash subroutine to compute (Nash) V values from the Q values.

Further, both Nash-VI and Nash Q-learning can be adapted when the game transitions and rewards are not known and have to be estimated from samples. For example, using random samples from a simulator (can query any $(s_h, a_{h,1}, a_{h,2})$ and obtain a sample of $(r_h, s_{h+1})$), an $\eps$-Nash can be learned with $\tO(H^3SA_1A_2/\eps^2)$ samples with high probability, using variants of either Nash-VI (Zhang et al. 2020) or Nash Q-learning (Sidford et al. 2019).

Centralization. Finally, we remark that both Nash-VI and Nash Q-learning are centralized training algorithms—that is, the computation for the two players are coupled (due to the joint computation in the matrix Nash subroutine). In order to execute these algorithms, the two players could not act in isolation, and have to coordinate or even live in a same centralized machine (self-play). In Section 4.2, we will present another decentralized algorithm, which is further capable of improving the dependency of sample complexity on the number of actions from $\tO(A_1A_2)$ to $\tO(\max\{A_1, A_2\})$.

4. Sample-efficient learning of Nash equilibria in zero-sum MGs

We now move on to the more challenging setting of learning Nash equilibria from interactive learning environments. Rather than assuming full knowledge or simulators of the game, we now assume that the algorithm can only learn about the game by playing repeated episodes of the game in a trial-and-error fashion (also known as the interactive/exploration setting). This requires the algorithms to further address the challenge of balancing exploration and exploitation. Our goal is to design algorithms that are sample-efficient, i.e. can learn an $\eps$-Nash with as few episodes of play as possible.

4.1 Optimistic Nash-VI

Our first algorithm is an optimistic variant of the Nash-VI algorithm we have seen in Section 3. This algorithm first appeared in (Liu et al. 2020), and builds on earlier sample-efficient Nash-VI type algorithms within (Bai & Jin 2020, Xie et al. 2020).

Algorithm (Optimistic Nash-VI, sketch):

Initialize $\up{Q}_h(s, a_1, a_2)\setto H$ and $\low{Q}_h(s, a_1, a_2)\setto 0$ for all $(s, a_1, a_2, h)$.
For episode $k=1,\dots,K$:
- For step $h=H,\dots,1$:
  - For all $(s, a_1, a_2)\in\cS\times \cA_1\times \cA_2$:
    - Set $t\setto N_h(s, a_1, a_2)$ (visitation count).
    - If $t>0$ then
      - Compute bonus $\beta\setto \text{BernsteinBonus}(t, \what{\V}_h[(\up{V}_{h+1} + \low{V}_{h+1})/2] (s, a_1, a_2)$.
      - Compute additional bonus $\gamma\setto (c/H) \what{\P}_h(\up{V}_{h+1} - \low{V}_{h+1})(s, a_1, a_2)$.
      - $\up{Q}_h(s, a_1, a_2)\setto \min\{ (r_h + \what{P}_h\up{V}_{h+1})(s, a_1, a_2) + \beta + \gamma, H \}$.
      - $\low{Q}_h(s, a_1, a_2)\setto \max\{ (r_h + \what{P}_h\low{V}_{h+1})(s, a_1, a_2) - \beta - \gamma, 0 \}$.
  - For all $s\in \cS$:
    - $\pi_h(\cdot, \cdot \vert s) \setto \text{MatrixCCE}(\up{Q}_h(s, \cdot, \cdot), \low{Q}_h(s, \cdot, \cdot))$.
    - $\up{V}_h(s) \setto \langle \pi_h(\cdot, \cdot \vert s), \up{Q}_h(s, \cdot, \cdot) \rangle$.
    - $\low{V}_h(s) \setto \langle \pi_h(\cdot, \cdot \vert s), \low{Q}_h(s, \cdot, \cdot) \rangle$.
- Play an episode using (correlated) policy $\pi$, and update the empirical models $\what{\P}_h$, $N_h$.
- Set $(\pi_1^k, \pi_2^k)$ to be the marginal policy of the current policy $\pi$.

Main techniques. Optimistic Nash-VI enhances the original Nash-VI in two main aspects in order to perform sample-efficient exploration in an interactive learning setting:

The algorithm maintains two optimistic Q estimates: An upper estimate $\up{Q}_h$, and a lower estimate $\low{Q}_h$. The amount of optimisticity is governed by the two bonus functions $\beta$ and $\gamma$, where $\beta$ is similar to the usual Bernstein bonus for exploration in single-agent settings (Azar et al. 2017, Jin et al. 2018), and $\gamma$ is a specifically designed model-based bonus that allows a tighter analysis, akin to the one used in (Dann et al. 2018). We remark that different from the single-agent setting, the introduction of $\gamma$ is key in achieving $\mathcal{O}(S)$ sample complexity instead of $\mathcal{O}(S^2)$.
The policy is now computed using a matrix CCE subroutine. The use of matrix CCE in the context of learning Markov game is first introduced by (Xie et al. 2020). For any pair of matrices $\up{M}, \low{M}\in\R^{A_1\times A_2}$, $ \text{MatrixCCE}(\up{M}, \low{M})$ is defined as any correlated policy $\pi \in \Delta_{\cA_1\times \cA_2}$ such that the following holds: \[ \begin{aligned} & \E_{(a_1, a_2)\sim \pi} [\up{M}(a_1, a_2)] \ge \max_{a_1'\in \cA_1} \E_{(a_1, a_2)\sim \pi}[\up{M}(a_1', a_2)], \\ & \E_{(a_1, a_2)\sim \pi} [\low{M}(a_1, a_2)] \le \min_{a_2'\in \cA_2} \E_{(a_1, a_2)\sim \pi}[\low{M}(a_1, a_2')]. \end{aligned} \] In other words, if the two players jointly play $\pi$, the max player has no gain in deviating for the payoff matrix $\up{M}$, and the min player has no gain in deviating for the payoff matrix $\low{M}$.

The purpose of this matrix CCE subroutine is to find a good policy with respect to the upper / lower estimates in a computationally efficient fashion (this subroutine can be implemented using linear programming). Indeed, here a general-sum matrix Nash subroutine with respect to $(\up{Q}_h(s,\cdot,\cdot), \low{Q}_h(s,\cdot,\cdot))$ may as well be used (and leads to the same sample complexity results), yet would suffer from computational difficulties for large $A_1,A_2$, as finding Nash in a two-player general-sum matrix game is PPAD hard (Daskalakis et al. 2013).

These techniques enable the optimistic Nash-VI algorithm to achieve the following sample complexity guarantee for learning Nash equilibria in zero-sum MGs.

Theorem (Liu et al. 2020): With high probability, optimistic Nash-VI outputs an $\eps$-Nash within $K=\tO(H^3SA_1A_2/\eps^2)$ episodes of play.

Compared with the minimax sample complexity $H^3SA/\eps^2$ for single-agent MDPs achieved by the UCBVI algorithm (Azar et al. 2017, with matching lower bound in (Jin et al. 2018, Domingues et al. 2021)), the sample complexity achieved by Nash-VI for zero-sum MGs only pays an additional $A_1A_2$, i.e. product of the action space. This may seem natural at first sight as the joint action space in a zero-sum MG does have size $A_1A_2$. However, as we will see in the next subsection, it is possible to design a decentralized learning algorithm that achieves a $\max\{A_1,A_2\}$ dependence on the action spaces.

Centralized training, decentralized execution. Note that the $\pi$ returned by the matrix CCE subroutines and used in the exploration step is a correlated policy. Therefore, the entire algorithm has to be executed in a centralized fashion. However, the final output is the marginals of this correlated policy and thus is an (approximately Nash) product policy that can be deployed in a decentralized way. This is reminiscent of the “centralized training, decentralized execution” paradigm which is also commonly used in empirical multi-agent RL.

4.2 V-Learning

Our second algorithm is a recently proposed algorithm, V-Learning, which first appeared in (Bai et al. 2020) in the name of Nash V-Learning. Here, we present the simplified version of the algorithm in its journal version (Jin et al. 2021c). V-Learning itself can be viewed as a single-agent RL algorithm. To run V-learning in the multiagent setting, we simply let each player run the single-agent V-learning entirely on her own as long as she observes the states, rewards, and next states (as if it were a single-agent MDP). The opponent’s action does not need to be observed. Therefore, V-learning is naturally a decentralized algorithm. We present the algorithm as follows.

Algorithm (V-Learning):

For all $(h, s, a)\in [H]\times \cS\times \cA$, initialize $V_h(s)\setto H-h+1$, $\pi_{h}(a \vert s)\setto 1/A$, and $N_h(s)\setto 0$.
For episode $k=1,\dots,K$:
- Receive $s_1$.
- For $h=1,\dots,H$:
  - Take action $a_{h}\sim \pi_{h}(\cdot \vert s_h)$, observe reward $r_h$ and next state $s_{h+1}$.
    (No need to observe the opponent’s action.)
  - $t=N_h(s_h)\setto N_h(s_h)+1$.
  - $V_{h}(s_h)\setto (1-\alpha_t) V_h(s_h) + \alpha_t (r_h + \min\{V_{h+1}(s_{h+1}), H-h\} + \beta_t)$.
  - $\pi_{h}(\cdot \vert s_h) \setto \text{Adv_Bandit_Update}(a_{h}, \frac{H-r_h-\min\{V_{h+1}(s_{h+1}), H-h\} }{H})$ on the $(s_h, h)$-th adversarial bandit.

Algorithm (Executing output policy of V-Learning):

Sample $k\setto\text{Unif}([K])$.
For step $h=1,\dots,H$:
- Observe $s_h$, and set $t\setto N_h^k(s_h)$.
- Set $k\setto k_h^i(s_h)$, where $k_h^1(s_h)<\dots<k_h^t(s_h)$ are the indices of all past visitations to $(h, s_h)$ prior to episode $k$, and $i\in [t]$ is sampled randomly with probability $\alpha_t^i$.
- Take action $a_{h}\sim \pi_{h}^k(\cdot\vert s_h)$.

The structure of the V-Learning algorithm is rather distinct from either Nash-VI or Nash Q-learning (perhaps its closest relative among classical RL algorithms). We briefly explain its three main components:

Incremental update of $V$: This part is similar to Q-learning, except that we do not model the Q functions, but rather directly model the V functions (hence the name V-learning). Consequently, this update avoids constructing Q functions in the multiagent setting (which is of the size $\cA_1 \cA_2$). It also “ignores” the actions taken by the opponent.
The incremental update uses a step-size sequence $\{\alpha_t\}_{t\ge 0}$. We choose $\alpha_t=(H+1)/(H+t)$ following the analysis of Q-learning in (Jin et al. 2018).
Adversarial bandit subroutine: Recall that in Q-learning (or its optimistic version), the greedy (argmax) rule over the estimated Q function is used at each $(s,h)$ to determine the policy. In V-learning, the policy at each $(s_h, h)$ is rather maintained by an adversarial bandit subroutine (there are $SH$ such adversarial bandits in total).
Concretely, after $(s_h, h)$ is encountered and $(r_h, s_{h+1})$ are received, the corresponding adversarial bandit performs an update step of the form “Action $a_{h}\in\cA$ suffered loss $\frac{H-r_h-\min\{V_{h+1}(s_{h+1}), H-h\} }{H}\in [0,1]$.” The adversarial bandit subroutine is then responsible for determining how to actually compute the new policy given this update.
In V-Learning, we instantiate the adversarial bandit as a weighted follow-the-regularized-leader algorithm, in order to achieve a per-state weighted regret guarantee of a certain form, at each $(s, h)$.
Output policy: The output policy of V-learning is a nested mixture policy: A random integer $k$ is updated by random sampling throughout the execution of this policy, and at each $h$ this $k$ determines a particular policy for drawing the action $a_{h}$. This makes the policy non-Markov in general, which is drastically different from Nash-VI as well as most classical algorithms in single-agent RL which outputs Markov policies.
Technically, such an output policy is critically required in order to extract from the per-state “regret” guarantees at each $(s,h)$ a policy that enjoys an overall near-Nash guarantee on the overall game value. (Hence its original name “certified policy” in (Bai et al. 2020) as this policy “certifies” the value functions appearing in the per-state regret bounds.) Here, the sampling probabilities $\alpha_t^i$ are defined as [ \alpha_t^0 = \prod_{j=1}^t (1-\alpha_j),~~~\alpha_t^i = \alpha_i \prod_{j=i+1}^t (1-\alpha_j). ]

The above techniques lead to the following guarantee on V-Learning for zero-sum MGs.

Theorem (Bai et al. 2020, Jin et al. 2021c): Suppose both players run the V-Learning algorithm with the $\text{Adv_Bandit_Update}$ instantiated as a suitable weighted Follow-The-Regularized-Leader algorithm for each $(s,h)$. Then, running this for $K$ episodes, (the product of) their certified output policies $(\what{\pi}_1, \what{\pi}_2)$ is an $\epsilon$-Nash as long as $K\ge \tO(H^5S\max\{A_1, A_2\}/\eps^2)$.

Compared with optimistic Nash-VI, Nash V-Learning achieves milder dependence on the action space ($\max\{A_1,A_2\}$ vs. $A_1A_2$) thanks to its decentralized nature, at the cost of worse $H$ factors. We remark that the original sample complexity presented in (Bai et al. 2020) is $\tO(H^6S\max\{A_1,A_2\}/\eps^2)$, and one $H$ factor can be improved by adopting the sharper analysis in (Tian et al. 2020).

4.3 Summary of algorithms

The table below summarizes the algorithms we have introduced so far for learning $\eps$-Nash in zero-sum MGs. Note that here Nash Q-Learning (and its optimistic version) was not presented in this blog post; the actual algorithm and theoretical guarantee can be found in Section 3 of (Bai et al. 2020).

Algorithm	Training	Update	Main estimand	Sample complexity
Nash-VI	centralized	model-based	$\P_h(s'\vert s,a_1,a_2)$	$\tO(H^3SA_1A_2/\eps^2)$
Nash Q-Learning	centralized	model-free	$Q_h^{\star}(s,a_1,a_2)$	$\tO(H^5SA_1A_2/\eps^2)$
V-Learning	decentralized	model-free	$V_h^{\star}(s)$	$\tO(H^5S\max\{A_1, A_2\}/\eps^2)$

5. General-sum MGs with many players

We now shift attention to the more general case of general-sum MGs with $m$ players for any $m\ge 2$. For general-sum MGs, we are not only interested in learning near-Nash product policies, but also the broader class of correlated equilibria for correlated policies, which we define as follows.

5.1 Correlated Equilibria (CE) and Coarse Correlated Equilibria (CCE)

Definition (Coarse Correlated Equilibrium): A correlated policy $\pi$ is an $\eps$-Coarse Correlated Equilibrium (CCE), if any player could not improve her own reward by more than $\eps$ by deviating from $\pi$ and playing some other policy on her own: [ \max_{i\in[m]} \max_{\pi_i’} \Big( V_{i,1}^{\pi_i’, \pi_{-i}} - V_{i,1}^{\pi} \Big) \le \eps. ]

Definition (Correlated Equilibrium; Informal): A correlated policy $\pi$ is an $\eps$-Correlated Equilibrium (CE), if any player could not improve her own reward by more than $\eps$ by first observing her own action sampled from the correlated policy at each state, then deviating to some other action: [ \max_{i\in[m]} \max_{\phi\in \Phi_i} \Big( V_{i,1}^{\phi\diamond \pi} - V_{i,1}^{\pi} \Big) \le \eps. ] Above, $\Phi_i$ is the set of all possible strategy modification functions for player $i$ (for a formal definition see (Liu et al. 2020)).

In general, we have {Nash}$\subset${CE}$\subset${CCE}. Therefore, CE, CCE can be thought of as relaxations of Nash and make sense as learning goals on their own right as general-sum Nash suffers from PPAD hardness in the computational efficiency anyway (Daskalakis et al. 2013).

We remark that similar as Nash, the notions of CE and CCE are standard and widely studied in the game theory literature, with many real-world implications (see e.g. this book & Wikipedia). The above definitions are direct extensions of the original definitions (for strategic games) into Markov games.

5.2 Value Iteration algorithms for Nash, CE, and CCE

The model-based optimistic Nash-VI algorithm can be adapted straightforwardly to the case of multiple players, by simply replacing the MatrixCCE subroutine in Optimistic Nash-VI algorithm with (multi-dimensional) matrix {Nash,CE,CCE} subroutines. This gives the following result for learning equilibria in general-sum MGs.

Theorem (Liu et al. 2020): A multi-player adaptation of the optimistic Nash-VI algorithm coupled with (multi-dimensional) matrix {Nash,CE,CCE} subroutines can learn an $\eps$-{Nash, CE, CCE} of a general-sum MG within $\tO(H^4S^2\prod_{i\le m} A_i/\eps^2)$ episodes of play.

The conceptual advantage of this algorithm is that it estimates the full transition model $\P_h(s_{h+1} \vert s_h, \ba_h)$ directly; it can then learn all three kinds of equilibria by plugging in the corresponding (multi-dimensional) matrix subroutines. However, this algorithm suffers from the curse of multiagents: the sample complexity scales as $\prod_{i\le m} A_i$, which is not surprising given that it estimates the full transition model. Also note that, despite the sample efficiency, the multi-player matrix Nash subroutine involved in the above algorithm does not likely admit a poly time implementation due to the PPAD hardness in computing general-sum Nash.

5.3 V-Learning for CE and CCE

Recall that V-Learning is a decentralized algorithm that can be deployed by each player independently, and for zero-sum MGs one could extract from its execution history a suitable output policy that is near Nash. It turns out that this paradigm generalizes nicely to the problems of learning CE and CCE in general-sum MGs. Specifically, recent concurrent works show that V-Learning algorithms instantiated with suitable adversarial bandit subroutines could learn CE/CCE sample-efficiently, without suffering from the curse of multiagents:

Theorem (Song et al. 2021 & Jin et al. 2021c) For $m$-player general-sum MGs, suppose each player run the V-Learning algorithm independently with suitably chosen adversarial bandit subroutines, and we extract a correlated output policy in a suitable fashion. Then:

Instantiating the adversarial bandit subroutine as a regret minimizer (CCE-V-Learning), the correlated output policy is an $\eps$-CCE within $\tO(H^5S\max_{i\le m} A_i/\eps^2)$ episodes of play;

Instantiating the adversarial bandit subroutine as a swap regret minimizer (CE-V-Learning), the correlated output policy is an $\eps$-CE within $\tO(H^5S(\max_{i\le m} A_i)^2/\eps^2)$ episodes of play.

Compared with the VI algorithm from the last section, V-Learning breaks the curse of multiagents as the sample complexity scales polynomially in $\max_{i\le m}A_i$, as opposed to $\prod_{i\le m} A_i$ which is exponential in $m$. This is thanks to the decentralized learning structure of the V-Learning algorithm.

(Bib note: We remark that Song et al. 2021 actually achieves $\tO(H^6S(\max_{i\le m}A_i)^2/\eps^2)$ for learning CE, which is one $H$ factor worse than Jin et al. 2021c due to using a slightly different swap regret minimizer. The concurrent work of Mao & Basar 2021 also achieves a $\tO(H^6S\max_{i\le m}A_i/\eps^2)$ result for learning $\eps$-CCE.)

6. Learning MGs with function approximation

In single-agent MDPs, there is a rich body of work on the theory of function approximation—that is, sample-efficient learning in large state/action spaces equipped with a restricted function class. Here we briefly review some recent advances on function approximation in Markov Games. Throughout this section, we shift back to considering two-player zero-sum MGs.

6.1 Linear function approximation

Similar as a linear MDP, a (zero-sum) linear MG is a Markov Game whose transitions and rewards satisfy the following linear structure:

\[ \begin{aligned} & \P_h(s' | s, a_1, a_2) = \langle \phi(s, a_1, a_2), \mu_h(s') \rangle, \\ & r_h(s, a_1, a_2) = \langle \phi(s, a_1, a_2), \theta_h \rangle, \end{aligned} \]

where $\phi:\cS\times \cA_1\times \cA_2\to \R^d$ is a known feature map and $\mu_h$, $\theta_h$ are (unknown) parameters for the transitions and rewards.

The following result shows that linear MGs can be learned sample-efficiently with mild polynomial dependencies on $d,H$, similar as in linear MDPs. Their algorithm is similar to the optimistic Nash-VI algorithm, and further builds upon the single-agent optimistic least-squares value iteration algorithm (Jin et al. 2020) to utilize the linear structure in the transitions and rewards, by using a bonus function similar to linear bandits. To adapt this algorithm into the multi-agent setting, their algorithm again replaces the maximization over actions by matrix CCE subroutine, and uses both the upper bounds and lower bounds.

Theorem (Xie et al. 2020): For linear zero-sum MG with horizon length $H$ and feature dimension $d$, there exists an algorithm that learns an $\eps$-Nash within $\tO(d^3H^4/\eps^2)$ episodes of play.

We remark that Chen et al. 2021 consider the related model of linear mixture MGs and design an algorithm with episode complexity $\tO(d^2H^3/\eps^2)$.

6.2 General function approximation; The power of exploiter

The above sample complexity results are based on a direct adaptation of Nash VI into the linear setting. Such an approach critically depends on a restrictive assumption called optimistic closure, which is true for tabular MGs and linear MGs, but rarely holds in general function approximation.

To remove such an assumption, in the single-agent setting, we can modify the standard optimistic VI algorithm which computes an upper bound of value functions at each step, to a globally optimistic version which only computes the upper bound of value functions at the first step. The latter approach considers confidence sets at all steps simultaneously, thus providing a much sharper optimistic value, and allowing the algorithm to learn a near-optimal policy sample-efficiently (Zanette et al. 2020, Jin et al. 2021a).

To adapt such an idea to the multi-agent setting, it turns out that we need one more technique—the use of an exploiter. In contrast to the Nash-VI algorithm where both agents compute Nash policies simultaneously, the new type of algorithm treats two agents asymmetrically. It picks one agent as the main (learning) agent, while letting the other agent be the exploiter. During training, the main agent tries to learn the Nash policy, while the exploiter keeps playing the best response to the current policies of the main agents. That is, the exploiter facilitates the learning of the main player by deliberately exploiting her weakness.

The resulting algorithm after combining VI with global optimism and exploiter is the algorithm Golf_with_Exploiter (Jin et al. 2021b). The following result shows that this algorithm is capable of sample-efficient learning if the function class has a low multiagent Bellman Eluder (BE) dimension—a complexity measure adapted from its single-agent version (Jin et al. 2021a).

Theorem (Jin et al. 2021b): For any zero-sum MG with horizon length $H$ equipped with a Q-function class $\cF$ whose multi-agent Bellman Eluder dimension is $\tO(d)$, Golf_with_Exploiter algorithm learns an $\eps$-Nash within $\tO(H^2d\log(\vert \cF\vert)/\eps^2)$ episodes of play.

The concurrent work of Huang et al. 2021 also establish a $\tO(1/\eps^2)$ sample complexity guarantee for zero-sum MG with general function approximation, under the slightly different complexity measure of minimax Eluder dimension.

7. Other research frontiers; End note

In this blog post, we presented a brief overview of recent advances in multi-agent RL theory. We presented several learning goals (Nash, CE, CCE), planning and sample-efficient learning algorithms for these learning goals in two tabular settings (two-player zero-sum MGs, multi-player general-sum MGs), and touched on the related topic of function approximation.

Apart from the aforementioned questions, there are many other active research topics in MARL theory we have not covered in this blog post:

Further design and analysis of decentralized algorithms.
Policy optimization algorithms for Markov Games.
Other notions of equilibria (e.g. Stackelberg equilibria).
Markov potential games.
Imperfect-information games.

Similar to the topics covered in our earlier sections, the above topics also admit unique theoretical challenges brought by multiple agents that are not present in single-agent settings. We believe that tackling these challenges will provide many exciting opportunities for theory research down the road.

(Last edited: May 20, 2022)