Libratus – The AI that Defeated Humans in the Game of Poker

简介: This blog describes the four key strategies that describe Libratus' algorithm, which comprehensively defeated four professional poker players.

Libratus and the "Unpopular" NIPS 2017 Best Thesis

Libratus is an artificial intelligence computer program designed to play poker, specifically no-limit Texas hold'em. Its creators, Professor Tuomas Sandholm and his student Noam Brown from CMU intend for it to be like other, non-Poker-specific applications.

In January 2017, Libratus took on four professional poker players, beating all of them handily and winning nearly all the chips in play. Then, in November of 2017, the paper co-authored by Noam Brown and Tuomas Sandholm, "Safe and Nested Endgame Solving for Imperfect-information Games", received the NIPS 2017 Best Thesis Award.

However, the thesis did not spark particularly passionate debate among Chinese AI experts. There are three possible reasons for the lack of enthusiasm:

  • Significant Amount of Mathematical Jargon: An entire section of the paper is completely occupied by definitions of mathematic symbols. Once the mathematic definitions ended, the rest of the paper is a lengthy proof of a theorem. The first time you read it, the thesis feels like an endless fog of mathematics.
  • Absence of a Popular Subject: Currently, the two hot-topics in the field of machine learning are deep learning and reinforcement learning. However, the main topics of this paper are Game Theory and operations research.
  • Methodology Versatility. Some readers are under the impression that one can only apply Libratus' algorithm to Texas hold'em, or at best could one can extend it to other board games. Readers feel that the algorithm has limited application prospects.

In November 8 2017, I had the opportunity to meet Professor Tuomas Sandholm at the AI World 2017 hosted by AI Era, where he delivered a keynote speech. Conversing over dinner, Professor Sandholm explained to me that, from the perspective of application, Libratus' algorithm does not restrict the application to only gambling. Users can use it for negotiations, auction pricing, investment decisions, and much more.

1

Image 1. November 8th, Tuomas Sandholm (middle) arrived at Beijing to attend AI World 2017. He is with the founder and CEO of AI Era Yang Jing (left) and the author of this article Dr. Deng Kan.

Understanding Libratus Algorithm Methodology

In terms of methodology, the Libratus algorithm is a combination of both Game Theory and operations research. Markov decision making process and dynamic planning comprise the theoretical foundation of reinforcement learning. Even though the sources are different, the two will eventually converge.

I believe that the collision of the Libratus algorithm with reinforcement learning and deep learning will be an amazing step forward for AI.
John Von Neumann once said, "Life is not like Go, where you can see all of the pieces on the board. Real life is full of bluffing, deception, and speculation about what is going on in the minds of others. This is the Game Theory that I study."

Four Key Concepts of Libratus Thesis

Understanding the Libratus paper requires you to understand four key concepts.

  • Nash equilibrium strategy
  • Counterfactual best response
  • Abstraction
  • End game

In the following sections, I have described these four components of Libratus thesis.

Nash Equilibrium Strategy: Do Not Give Your Opponent Any Easy Wins

Libratus is currently only capable of playing two-person poker. However, future improvements to the algorithm and the addition of more GPUs could allow it to play against multiple opponents at the same time.
When playing Texas hold'em, each player is first dealt a hand of two cards, and then the poker dealer reveals five more cards in succession. There are four rounds of betting, after which the players compare their hands to see who wins.

2

Image 2. A game of Texas hold'em poker

It can be very difficult to guess the two cards in your opponent's hand. Public information includes the five community cards as well as both parties' betting history. However, your opponents will not necessarily bet strictly according to their hand. Instead, they will mix in tactics meant to deceive you.

How does Libratus Determine its Opponent's Betting Strategy?

Actually, Libratus does not speculate on its opponent's betting strategy.
To be a bit more specific, Libratus' only goal is to reach the Nash Equilibrium Strategy. Nash was a famous mathematician and the subject of the film "A Beautiful Mind." The goal of Nash Equilibrium Strategy is to avoid giving your opponent any easy wins.

Nash Equilibrium strategy is a passive defense strategy, rather than an active attack strategy. It does not attempt to speculate on the opponent's betting strategy.
For example, if the opponent's strategy is to blindly and mindlessly call your bet no matter what, Nash Equilibrium would dictate that you do not raise the stakes when our hand is good in order to take advantage of your opponent's strategy. It does this after playing a few hands in which it discovers your opponent's game plan.

But, what's stopping your opponent from finding vulnerabilities in the Nash Equilibrium Strategy? In theory, Nash Equilibrium Strategy guarantees that after playing a few hands and eliminating potential luck factors, there will be absolutely no vulnerabilities for the opponent to exploit.

The question is whether Libratus adheres strictly to Nash Equilibrium Strategy. The purpose of the mathematical proofs occupying such a large section of this paper is to prove that, at least in one on one Texas hold'em, the Libratus algorithm is a close approximation of Nash Equilibrium Strategy.

How Can You Find the Most Optimal Betting Strategy?

Since the Nash Equilibrium Strategy does not attempt to determine the opponent's strategy, we can use recursive machine Game Theory to analyze every possible hand combination and find the best possible betting strategy.

To determine the most optimal betting strategy, begin by creating a betting table with seven columns and several rows.

Column 1: The two cards in our hand.
Column 2: The community cards already on the table, of which there are five at most.
Column 3: Each party's betting history
Column 4: Place all the remaining chips. We can use the third column (betting history) to calculate the value in column four (our remaining chips). The fourth column is included simply for the sake of convenience.
Column 5: The two cards in our opponent's hand
Column 6: The chips we are planning to bet in the next round

Column 7: Expected profit from the strategy in column six

The first four columns are inputs, called the "situation" in the paper (Information Set, InfoSet). The last three columns are outputs, namely an estimation of hidden information, a strategy, and predicted benefit.
With enough computing power, we would be able to calculate every possible situation (InfoSet).

  • Column 1, our hand (pre-flop) has (5252)/ (21) = 1,326 possible combinations.
  • Column 2, the first three community cards, have (504948)/ (321) = 19600 possible combinations. Then comes the next two community cards (turn and river). Added together, there are a total of 19600 + 19600 47 + 19600 47 * 46 = 43,316,000 possible combinations.

If we add the possible combinations in column one and column two together, we get a total of 1,326 * 43,316,000 = 57,437,316,000 combinations.
There are even more possible combinations of values in column three, betting history. This means that the table is extensively large, in fact 10165 rows in the thesis paper. This data volume is simply too vast. Even with the most powerful computer currently available, running all of the numbers would take several decades of years.

However, if we had enough computing power to traverse the table, we would be able to calculate the average profit from every betting strategy for every card combination. For example, after participants have revealed the fifth community card (the river), the opponent's hand could be one of (4544)/ (21) = 990 combinations of cards. Furthermore, the first three rounds of betting (pre-flop, flop, turn) could take any of N different forms. At this point, during our round of betting, each betting strategy, for example going all in, could yield one of 990N possible profits. The average of these 990N profit values is the average value of the betting strategy.

Faced with a certain combination of cards, we can then calculate the average profit from each possible betting strategy and select the strategy with the highest average. You must be wondering if the betting strategy with the highest average profit is the Nash Equilibrium Strategy? No, it's not.

Why? You see, if our opponent's hand is weak, for example a 3 of hearts and a 6 of clubs, then the chances of him betting a large amount are quite low.
Then how are we supposed to calculate the probability of an opponent's betting strategy?

This problem is more complicated as the opponent determines his betting strategy not only by his own cards and the community cards, but by our betting history and his estimation of our cards.

Counterfactual Best Response: A Popular Algorithm for Finding the Nash Equilibrium Strategy

Counterfactual Best Response (CBR) is an algorithm to find the Nash Equilibrium Strategy which has recently enjoyed quite a bit of popularity.
There are three main elements to Counterfactual Best Response:

  • Simulation
  • Regret
  • Iteration

In the face of a game situation (InfoSet) or the data in the first four columns in our betting table, we start by first assuming that we take a random equal probability approach to choosing our betting strategy. We use this method as we go through a series of dealing and betting cycles, calculating our wins or losses after each round.

After a few million rounds, we will have already covered a wide range of game situations.

Subsequently, we enter the regret phase. Regretting refers to the situation where we stop using random betting and rather choose a specific strategy, for example calling the opponent's bet, in each game situation.

This phase also features a specific game situation or set value for each of the four columns in our betting table. After regretting, the next community cards to be revealed do not change. However, both our and our opponent's betting strategies change.

We then re-simulate the previous rounds. During the re-simulation, both players' hands, the cards on the table, their dealing order, and the betting prior to the regretting phase remain unchanged. However, betting after the regretting phase does change. During each betting phase, aside from the strategy specified during the regretting phase, other betting strategies remain unchanged.

Once this simulation is complete, we can calculate the average profit from the regret strategy.

Moving on, switch to another regret strategy to use under the same game situation. For example, if the prior regret phase used the call strategy, then for the next simulation we might switch to a raise strategy and re-simulate the previous game situations. This way we can calculate the average profit for the second regret strategy.

After running enough of these simulations, we can calculate the average profit for each regret strategy in different game situations. We then repeat the above process to confirm the outcomes of different strategies.

Finally, we use this data to update the betting table (Action Mapping). To use the data start by clearing the fifth column in the table, ensuring that you do not calculate the two cards in the opponent's hand. In the seventh column, enter the average profit for each betting strategy.

This was the first round of simulation and regretting. Next, we initiate the second round of simulation.

During the second round of simulation, we first assume that the opponent is still using probability of random equality to choose their betting strategy. We then select our betting strategy according to the updated betting table.

When we encounter a new game situation, or different values in the first four columns, we use the following method to select a betting strategy:

  • We begin by finding all the relevant rows in our updated betting column according to the values in the first four columns that represent the present situation.
  • In our updated table, the sixth column is the betting strategy, while the seventh column consists of average profit for each strategy. We then progress towards selecting a betting strategy according to average profit. The higher the average profit, the higher the probability we choose the corresponding betting strategy.

Now that we have determined the hands, dealing order, and initial betting history of the players for the second round of simulation, we can begin the regretting phase. The next step includes making another betting table recording with its corresponding betting strategy and average profit under different game situations.

In the third round of simulation, we use the second version of the betting table, while the opponent uses the first version. After several rounds of simulation, we can generate our third betting table.

The last phase is to reiterate this process over and over. After several simulation iterations, the betting table we are left with will be a close approximation of the Nash Equilibrium Strategy.

Simplify the Game Situation to Minimize the Required Computing Power

As we mentioned in the previous section, the two columns alone in the betting table contain a maximum of 57,437,016,000 different combinations of values. Adding every possible combination of betting history will result in 10165 possible value combinations

Furthermore, we must go through several billion simulations when we use Counterfactual Best Response to look for the Nash Equilibrium Strategy.

The massive amount of data in the betting table combined with the billions of simulations required by the Counterfactual Best Response method means that the entire process requires an absurd amount of computing power that would be, at best, incredibly difficult to produce.

One obvious solution would be to produce a smaller betting table by way of simplifying the game situation via abstraction.

For example, imagine you have two pairs. One pair is an 8 of hearts and an 8 of spades, the other pair is an 8 of diamonds and 8 of clubs. For the purposes of poker, these are equal in value. Therefore, we can take every combination of two 8s, (43)/ (21) = 6 combinations in total and condense them into one row in the betting table.

However, calculations of card values must be distinct. For instance, is a pair of 8s more valuable than an 8 and a 9? Presuming that the 5 community cards are revealed to be a 7, a 10, a Jack, and two irrelevant cards, then you can combine the 8 and 9 with the community cards to create a straight from 7 to Jack. While the pair of 8s is still just a pair of 8s.

3

Image 3. The value of pairs is dependent on the value of the cards.

Therefore, when we calculate the value of a pair of cards, we cannot simply look at their own intrinsic value, but must also take into consideration their value in combination with the community cards once those cards are revealed.

We not only have to abstract card values, but you need to abstract bets as well. You can do this by reducing them to the four types – "fold", "call", "raise", and "all in". After going through this abstraction process, the number of data combinations in our table is greatly reduced. Once we have reduced the amount of data we need to operate on, the Counterfactual Best Response calculation becomes much more possible.

Endgame, Off-tree Problems, and Re-solving

Simplifying the game situation can also lead to miscalculations. For example, we could mistakenly equate a pair of 7s to a pair of 8s. Be that as it may, if both parties keep betting and neither party folds, their cards will inevitably be shown, at which point the pair of 7s will lose to the pair of 8s, even though the difference between them is small. This is what we call an off-tree problem.

When we encounter an off-tree problem, we can solve it by replacing abstraction with resolving.

The core concept of resolving is to reverse the players' betting histories to fit data on the opponent's hand and its probability.

For instance, if you represent your hand by h1 and h2 represents your opponent's hand. Additionally, both the players are betting using Counterfactual Best Response strategy. We then can calculate the probabilities of different bets by looking up corresponding information on our betting table.

If our pre-flop hand is a pair of 8s while our opponent's hand is a pair of Aces, then the probability of us raising in the first round of betting is rather high, while the probability of winning is nearly 0. However, the probability of our opponent winning is much higher.

Now, given a specific betting history, we can use the betting table to calculate the probability of different betting strategies in the next step. The betting history is just a series of bets, while the betting history probability is the product of a series of probabilities.

By calculating betting history probabilities, we can not only estimate the cards in our opponent's hand, but we can also calculate the degree of deviation between our opponent's optimal betting strategy and his actual one. We can calculate the probability that our opponent is bluffing.

Once we have guessed our opponent's cards and the probability that he's bluffing, end game re-solving becomes quite simple. It's really nothing more than comparing our hand to our opponent's, calculating our chance at winning, and then using that data to determine how much to bet.

Conclusion

The introduction of an AI system like Libratus is one of many ways through which AI is changing the way we understand human intelligence. I hope this article helped you understand how Libratus functions and it could beat its competitors. I am sure with time we will witness similar programs for other games as well.

目录
相关文章
|
人工智能
Alibaba AI Model Tops Humans in Reading Comprehension
Alibaba’s Institute of Data Science and Technologies (iDST) said Monday its deep neural network model scored 82.
4854 0
|
人工智能 机器学习/深度学习 存储
|
10天前
|
机器学习/深度学习 人工智能 自然语言处理
AI技术深度解析:从基础到应用的全面介绍
人工智能(AI)技术的迅猛发展,正在深刻改变着我们的生活和工作方式。从自然语言处理(NLP)到机器学习,从神经网络到大型语言模型(LLM),AI技术的每一次进步都带来了前所未有的机遇和挑战。本文将从背景、历史、业务场景、Python代码示例、流程图以及如何上手等多个方面,对AI技术中的关键组件进行深度解析,为读者呈现一个全面而深入的AI技术世界。
67 10
|
3天前
|
机器学习/深度学习 人工智能 物联网
AI赋能大学计划·大模型技术与应用实战学生训练营——湖南大学站圆满结营
12月14日,由中国软件行业校园招聘与实习公共服务平台携手魔搭社区共同举办的AI赋能大学计划·大模型技术与产业趋势高校行AIGC项目实战营·湖南大学站圆满结营。
AI赋能大学计划·大模型技术与应用实战学生训练营——湖南大学站圆满结营
|
15天前
|
机器学习/深度学习 人工智能 自然语言处理
转载:【AI系统】AI的领域、场景与行业应用
本文概述了AI的历史、现状及发展趋势,探讨了AI在计算机视觉、自然语言处理、语音识别等领域的应用,以及在金融、医疗、教育、互联网等行业中的实践案例。随着技术进步,AI模型正从单一走向多样化,从小规模到大规模分布式训练,企业级AI系统设计面临更多挑战,同时也带来了新的研究与工程实践机遇。文中强调了AI基础设施的重要性,并鼓励读者深入了解AI系统的设计原则与研究方法,共同推动AI技术的发展。
转载:【AI系统】AI的领域、场景与行业应用
|
10天前
|
机器学习/深度学习 人工智能 算法
探索AI在医疗诊断中的应用与挑战
【10月更文挑战第21天】 本文深入探讨了人工智能(AI)技术在医疗诊断领域的应用现状与面临的挑战,旨在为读者提供一个全面的视角,了解AI如何改变传统医疗模式,以及这一变革过程中所伴随的技术、伦理和法律问题。通过分析AI技术的优势和局限性,本文旨在促进对AI在医疗领域应用的更深层次理解和讨论。
|
15天前
|
人工智能 缓存 异构计算
云原生AI加速生成式人工智能应用的部署构建
本文探讨了云原生技术背景下,尤其是Kubernetes和容器技术的发展,对模型推理服务带来的挑战与优化策略。文中详细介绍了Knative的弹性扩展机制,包括HPA和CronHPA,以及针对传统弹性扩展“滞后”问题提出的AHPA(高级弹性预测)。此外,文章重点介绍了Fluid项目,它通过分布式缓存优化了模型加载的I/O操作,显著缩短了推理服务的冷启动时间,特别是在处理大规模并发请求时表现出色。通过实际案例,展示了Fluid在vLLM和Qwen模型推理中的应用效果,证明了其在提高模型推理效率和响应速度方面的优势。
云原生AI加速生成式人工智能应用的部署构建
|
20天前
|
机器学习/深度学习 人工智能 JSON
【实战干货】AI大模型工程应用于车联网场景的实战总结
本文介绍了图像生成技术在AIGC领域的发展历程、关键技术和当前趋势,以及这些技术如何应用于新能源汽车行业的车联网服务中。
309 34
|
6天前
|
机器学习/深度学习 人工智能 自然语言处理
AI在自然语言处理中的突破:从理论到应用
AI在自然语言处理中的突破:从理论到应用
52 17
|
6天前
|
人工智能 Serverless API
尽享红利,Serverless构建企业AI应用方案与实践
本次课程由阿里云云原生架构师计缘分享,主题为“尽享红利,Serverless构建企业AI应用方案与实践”。课程分为四个部分:1) Serverless技术价值,介绍其发展趋势及优势;2) Serverless函数计算与AI的结合,探讨两者融合的应用场景;3) Serverless函数计算AIGC应用方案,展示具体的技术实现和客户案例;4) 业务初期如何降低使用门槛,提供新用户权益和免费资源。通过这些内容,帮助企业和开发者快速构建高效、低成本的AI应用。
44 12