Incomplete Information Games and Counterfactual Regret Minimization

*ChatGPT “A person who regret losing a lot of money at poker”

Table of Contents

Introduction
Counterfactual Regret Minimalization
When There Are 3 or More Players
Counterfactual Regret Minimization Example
Summary

Introduction

When calculating the optimal mixed strategy in commercially available poker solvers (which I have never purchased), an algorithm called Counterfactual Regret Minimalization (CFR) appears to be used. The purpose of this article is to introduce the basic ideas behind this algorithm.

Counterfactual Regret Minimalization

In the paper on the multi-armed bandit problem, we stated that it is possible to find optimal decisions in games with incomplete information by assuming that players can only utilize the information obtained by actually choosing actions and making decisions that minimize the regret of each action. In reality, there may be cases where a player is able to know the payoff that would have been obtained had they made that choice, even though they have not actually chosen an action. Typically, this would be a game such as rock-paper-scissors or poker, where you ultimately know the actions your opponents have chosen and the results of those actions.

For example, in Texas Hold’em, if you call your opponent’s bet with a particular hand on the river, you can see your opponent’s hand, so you can also know the payoff if you fold. In other words, in addition to the payoffs from the actions you actually choose (Factual), you can also know the payoffs from actions you did not actually choose (Counterfactual). By calculating all the payoffs for taking actions different from the actual action and defining a mixed strategy that increases the probability of choosing the option with the smallest regret, we can hope that by repeating this trial, we will converge to the optimal strategy, i.e., the equilibrium strategy. This is the basic idea behind an algorithm called counterfactual regret minimization.

In the multi-armed bandit problem, the regret is defined as the value obtained by subtracting the payoff from the action actually selected from the payoff from the ex post optimal action. In counterfactual regret minimization, the regret is defined as the value obtained by subtracting the (factual) payoff from the action actually selected based on the mixed strategy from the (counterfactual) payoff from the action not actually selected for all options. Formally, it is expressed by the following formula. If we denote our own action as \(\small a\) and our opponent’s action as \(\small \dot{a}\), then the counterfactual regret is defined as:

\[ \small R_i(T) = \frac{1}{T}\sum_{t=1}^T u(a_i,\dot{a}_{\text{mixed}}) – u(a_{\text{mixed}},\dot{a}_{\text{mixed}}). \]

Both players adopt mixed strategies, but the regret is defined as the difference between the payoff when one player always adopts a specific strategy and the payoff when the opponent does not change their strategy. In the Texas Hold’em example above, we can calculate the counterfactual regrets of each option as:

\[ \small \begin{align*} &R_{\text{Call}}(T) = \frac{1}{T}\sum_{t=1}^T u(a_{\text{Call}},\dot{a}_{\text{mixed}}) – u(a_{\text{mixed}},\dot{a}_{\text{mixed}}) \\ &R_{\text{Fold}}(T) = \frac{1}{T}\sum_{t=1}^T u(a_{\text{Fold}},\dot{a}_{\text{mixed}}) – u(a_{\text{mixed}},\dot{a}_{\text{mixed}}).
\end{align*} \]

Using the counterfactual regrets that can be calculated for each trial, we set the probability of each player choosing each strategy in their mixed strategy to

\[ \small \begin{align*} &p_{\text{Call}}(T+1) = \frac{\max\{R_{\text{Call}}(T), 0\}}{\max\{R_{\text{Call}}(T), 0\}+\max\{R_{\text{Fold}}(T), 0\}} \\
&p_{\text{Fold}}(T+1) = \frac{\max\{R_{\text{Fold}}(T), 0\}}{\max\{R_{\text{Call}}(T), 0\}+\max\{R_{\text{Fold}}(T), 0\}}, \end{align*} \]

and if we update this each time, it will eventually converge to a constant value. If the denominator is 0, then the two strategies can be chosen with a probability of 1/2. By updating the strategies using this counterfactual regret for both oneself and one’s opponent simultaneously, one can obtain a mixed strategy that minimizes the expected counterfactual regret for both players. It seems that when there are two players, it has been mathematically proven that this convergent strategy corresponds to the Nash equilibrium strategy. In the above, we have shown that there are two action options, but if there are multiple options, we can similarly find a Nash equilibrium strategy. In this case, the probability of choosing each option can be calculated as:

\[ \small p_i(T+1) = \frac{\max\{R_i(T), 0\}}{\sum_{j=1}^m\max\{R_j(T), 0\}}. \]

When the denominator is 0, the number of options is expressed as \(\small |A|=m\), which appears to be represented as:

\[ \small p_i(T+1) = \frac{1}{|A|}. \]

When There Are 3 or More Players

In the previous section, we stated that it has been mathematically proven that in the case of “two players,” the strategy obtained by minimizing counterfactual regret is equal to the Nash equilibrium strategy. Since Texas Hold’em is almost always a multiplayer game, you might be wondering what happens when there are more than two players. Counterfactual regret minimization itself can be applied as an algorithm to cases where there are multiple players, and it is assumed that it is often possible to find a convergent strategy. So what is the problem? Generally, in games with imperfect information involving three or more players, there are an infinite number of Nash equilibrium strategies, and it is impossible to determine a unique one. In other words, the solution obtained by counterfactual regret minimization is likely to be a Nash equilibrium strategy, but it will only be one of an infinite number of possible Nash equilibrium strategies.

For example, if there are two players in the game of rock-paper-scissors, the only Nash equilibrium strategy is to play rock, paper, or scissors, each with a 1/3 probability, and any player who uses a different strategy can exploit this strategy. However, when there are three players, there are an infinite number of strategies that are Nash equilibrium strategies, provided that each player’s strategy is asymmetric. For example, suppose players 1, 2, and 3 are adopting a mixed strategy in which they play Rock, Paper, or Scissors, each with probability:

\[ \small [p_1^{\text{Rock}},p_1^{\text{Paper}},p_1^{\text{Scissors}}] = [0.4, 0.3, 0.3] \\ \small
[p_2^{\text{Rock}},p_2^{\text{Paper}},p_2^{\text{Scissors}}] = [0.3, 0.4, 0.3] \\
\small [p_3^{\text{Rock}},p_3^{\text{Paper}},p_3^{\text{Scissors}}] = [0.3, 0.3, 0.4]. \]

These strategies are also Nash equilibrium strategies, meaning that no player can increase their expected payoff by changing their strategy unless other players also change their strategies. Of course, a strategy in which all players play Rock, Paper, or Scissors with a 1/3 probability each would also be a Nash equilibrium strategy, but this would be just one of the countless Nash equilibrium strategies that exist.

From the above, it can be seen that in games of incomplete information with three or more players, there is no “uniquely strongest strategy,” and it is necessary to read the opponent’s strategy and adopt a strategy that outwits it (reading the other player). There are often people who criticize the GTO strategy in Texas Hold’em for being too overconfident, but the gist of the criticism is that unless you are playing heads-up, the GTO strategy is not necessarily the optimal strategy, and advertising slogans that suggest you do not need to take your opponent’s strategy into account are an exaggeration of the GTO strategy. In fact, many commercially available poker solvers can only calculate strategies when there are two players, or when there are three or more players they make certain assumptions (fixing the opponent’s strategy and finding a solution that is not strictly a Nash equilibrium).

Counterfactual Regret Minimization Example

I came up with a simple game as an example of counterfactual regret minimization, so I’ll show it here. Cards with numbers between 0 and 100 written on them (you can imagine them as the probability of winning in Texas Hold’em) are dealt randomly and face down to each of the \(\small n\) players, with no duplicates. The player with the highest card among \(\small n\) players wins. Each player can see the value of the cards in his/her hand without the other players knowing, and can choose to bet or fold. All players shall simultaneously announce whether to bet or fold. Payoffs in each situation are given in the table below.

	Bet and Win	Bet and Lose	Fold
All players fold	—	—	0bb
Only one player bets, others fold	+(n-1) bb	—	-1bb
m players bet, n-m players fold	+\(\small \alpha\)×(m-1)+(n-m) bb	-\(\small \alpha\) bb	-1bb

While you can choose to fold by accepting a loss of -1bb, if you enter the bet you will have to pay \(\small \alpha > 1\) chips. Let us call \(\small \alpha\) the odds. At this point, the optimal strategy (bet or fold) will vary depending on the hand you are dealt. The maximum number of players is 10. The only state variables are your hand of cards \(\small h\) and the number of players \(\small n\), and we need to determine the strategy function corresponding to

\[ \small \pi(s,a), \;a\in \{\text{bet}, \text{fold} \},\;s = \{h,n | h \in \{0, 1 \cdots, 100 \},n \in \{2,3\cdots, 10 \} \}. \]

The problem is to find the optimal strategy function \(\small \pi(s,a)\) for this game.

Let us assume that all players are symmetric because they declare their decisions at the same time, and that the optimal strategy (Nash equilibrium strategy) is common to all players (this is an assumption, and in reality asymmetric equilibria are possible for \(\small n>2\), but we will ignore this here). This assumption means that there is no need to consider individual strategies for each player; all players decide to bet or fold based on the same strategy function \(\small \pi(s, a)\). The strategy function can be calculated by calculating the average counterfactual regret observed from each decision, and then calculating

\[ \small \begin{align*} &\text{Pr}[\pi(s) = \text{bet}] = \frac{\max\{R_{\text{bet}}(s),0\}}{\max\{R_{\text{bet}}(s),0\}+\max\{R_{\text{fold}}(s),0\}} \\
&\text{Pr}[\pi(s) = \text{fold}] = \frac{\max\{R_{\text{fold}}(s),0\}}{\max\{R_{\text{bet}}(s),0\}+\max\{R_{\text{fold}}(s),0\}}. \end{align*} \]

The specific algorithm is described as follows.

Randomly deal cards to \(\small n\) players.
Each player’s action (bet or fold) is determined according to the strategy function table \(\small \pi_{\text{bet}}[h][n]\).
Calculate the payoff \(\small r_i\) from the cards and the determined action.
The regret of each player is calculated using the following formula and added to the aggregate value (there is no need to distinguish between players due to the assumption, and the regrets of all players are aggregated to update the common strategy function).
\[ \small \begin{align*} &R_{\text{bet}}(h) = \frac{1}{T} \sum_{t=1}^Tr_{i,\text{bet}}(h, t)-r_{i}(h, t) \\ &R_{\text{fold}}(h) = \frac{1}{T} \sum_{t=1}^Tr_{i,\text{fold}}(h, t)-r_{i}(h, t) \end{align*} \]
\(\small r_{i,\text{bet}},r_{i,\text{fold}}\) is the payoff for choosing to bet or fold, regardless of the decision made by the strategy function.Actions that match the strategy function result in zero payoffs, but actions that differ from the strategy function result in different payoffs.
Steps 1 to 4 are repeated for the number of simulations to determine the optimal strategy function.

In this game, bluffing is meaningless because players declare their actions at the same time. The optimal strategy is to bet if the value of your hand is above a certain amount and fold if it is below that amount. Let’s calculate the optimal boundary values for \(\small n\) between 2 and 10 people and odds \(\small \alpha\) of 2, 3, and 4. The program implemented in python is as follows:

import random
import math
import copy

class Card:
    def __init__(self, rank):
        self.rank = rank
        
class Deck:
    #Build a 101-card deck
    def __init__(self):
        self.cards = []
        for i in range(0, 101):
            c = Card(i)
            self.cards.append(c)
    
    #Draw a random card from the cards remaining in the deck
    def draw(self):
        idx = random.randrange(0, len(self.cards), 1)
        return self.cards.pop(idx) 
    
class Game:
    def __init__(self, nPlayers, odds, blind):
        self.deck = Deck()
        self.nPlayers = nPlayers
        self.odds = odds
        self.blind = blind
        self.hand = []
        self.winner = []
        
    def play(self):
        self.hand = []
        for i in range(0, self.nPlayers):
            self.hand.append(self.deck.draw())
            
        #Decide on the winner
        max_hand = self.hand[0].rank
        self.winner = 0
        for i in range(1, self.nPlayers):
            if self.hand[i].rank > max_hand:
                self.winner = i
                max_hand = self.hand[i].rank
                
    def getWinnerAfterAction(self, action):
        #Decide on the winner based on actions
        idx = -1
        for i in range(0, self.nPlayers):
            if action[i] == 1:
                idx = i
                break
        
        if idx == -1:
            return -1
        
        max_hand = self.hand[idx].rank
        winner = idx
        for i in range(idx + 1, self.nPlayers):
            if action[i] == 1:
                if self.hand[i].rank > max_hand:
                    winner = i
                    max_hand = self.hand[i].rank

        return winner
    
    def getRewardAfterAction(self, action):
        pot = 0
        reward = [0] * self.nPlayers
        winner = self.getWinnerAfterAction(action)
        if winner > -1:
            for k in range(0, self.nPlayers):
                if action[k] == 1:
                    pot = pot + self.odds
                    reward[k] = -self.odds
                else:
                    pot = pot + self.blind
                    reward[k] = -self.blind

            for k in range(0, self.nPlayers):
                if action[k] == 1:
                    if k == winner:
                        reward[k] += pot

        return reward
    
class RewardStatistics:
    nAction = 2
    nHand = 101
    def __init__(self):
        self.trial = 0
        self.regret = [[0 for j in range(RewardStatistics.nAction)] \
                       for i in range(RewardStatistics.nHand)]
        
    def addRewardPure(self, hand, action, reward):
        self.regret[hand][action] += reward;
        
    def addRewardMixed(self, hand, action, reward):
        self.regret[hand][0] -= reward;
        self.regret[hand][1] -= reward;
        
    ##action by mixed strategy
    def actionMS(self, hand):   
        regretFold = max(self.regret[hand][0], 0)
        regretBet = max(self.regret[hand][1], 0)
        if regretFold + regretBet > 0:
            threshold = regretFold / (regretFold + regretBet)
        else:
            threshold = 0.5
            
        if (random.random() < threshold):
            return 0;
        
        return 1
    
    def getStrategy(self):
        ret = [];
        for i in range(0, 101):
            regretFold = max(self.regret[i][0], 0)
            regretBet = max(self.regret[i][1], 0)
            if regretFold + regretBet > 0:
                ret.append([i, regretBet / (regretFold + regretBet)])
            else:
                ret.append([i, 0.5])
            
        return ret
    
if __name__ == "__main__":
    nPlayers = 5
    odds = 3
    blind = 1
    rs = RewardStatistics()
    for j in range(0, 100000):
        game = Game(nPlayers, odds, blind)
        game.play()
           
        ##action by mixed strategy
        action = []
        for k in range(0, nPlayers):
            action.append(rs.actionMS(game.hand[k].rank))
        
        ## factual reward
        reward = game.getRewardAfterAction(action)
        for k in range(0, nPlayers):
            rs.addRewardMixed(game.hand[k].rank, action[k], reward[k])
                          
        ##counterfactual regret(fold)
        for k in range(0, nPlayers):
            actionFold = copy.deepcopy(action)
            actionFold[k] = 0
            reward = game.getRewardAfterAction(actionFold)
            rs.addRewardPure(game.hand[k].rank, 0, reward[k])
            
        ##counterfactual regret(bet)
        for k in range(0, nPlayers):
            actionBet = copy.deepcopy(action)
            actionBet[k] = 1
            reward = game.getRewardAfterAction(actionBet)
            rs.addRewardPure(game.hand[k].rank, 1, reward[k]) 
    
    strats = rs.getStrategy()
    for strat in strats:
        print(str(strat[0]) + ',' + str(strat[1]))

The boundaries calculated by the program are approximately the following numbers. The optimal strategy is to bet when the value of your hand is equal to or greater than the values in the table below. You can see that the higher the odds, the more powerful the hand you need to bet, and the more players there are, the more powerful the hand you need to bet.

n	α=2	3	4
2	49	65	72
3	57	70	77
4	64	74	79
5	68	76	81
6	71	78	83
7	73	80	84
8	75	82	85
9	77	83	86
10	79	84	87

Summary

I hope this article has given you a rough understanding of how poker solvers work. It would not be easy to develop a solver for general situations, and I have no intention of doing so, but I feel that it may be possible to develop a program that finds the optimal strategy for the special conditions of Texas Hold’em. I will try this next time.