Second Round (flop) PlayerOpponent Community cards
Slide 4
Third Round (turn) PlayerOpponent Community cards
Slide 5
Final Round (river) PlayerOpponent Community cards
Slide 6
End (We Win) PlayerOpponent Community cards
Slide 7
End Round (Note how initially low hands can win later when more
community cards are added) PlayerOpponent Community cards
Slide 8
The Problem State Space Is Too Big Over 2 million states would
be needed just to represent the possible combinations for a single
5 card poker hand
Slide 9
Our Solution It doesnt matter what your exact cards are, just
their relations to your opponent. The most important piece of
information that it needs is how many possible combinations of two
cards could make a better hand
Slide 10
Our State Representation [Round] [Opponents Last Bet] [#
Possible Better Hands] [Best Obtainable Hand] [4] [3] [10] [
3]
Slide 11
Player Community cards To Calculate The # Better Hands All
other two possible cards Evaluate! Count the number of better hands
on the right side
Slide 12
Q-lambda Implementation (I) The current state of the game is
stored in a variable Each time the community cards are updated, or
the opponent places a bet, we update our current state. For all
states, the Q-values of each betting action is stored in an array.
Some State FoldCheckCallRaise = -0.9954 = 2.014 = 1.745 =
-3.457
Slide 13
Q-lambda Implementation (II) Eligibility Trace: we keep a
vector of state-actions which are responsible for us being in the
current state In State s1 Did Action a1 In State s2 Did Action a2
Now we are in Current-State Each time we make a betting decision,
we add the current state and the action we chose to the eligibility
trace.
Slide 14
Q-lambda Implementation (III) At the end of each game, we get
the money won/lost to reward/punish the state-actions in the
eligibility trace In State s1 Did Action a1 In State s2 Did Action
a2 In State s3 Did Action a3 Got Reward R Update Q[sn, an]
Slide 15
Testing Our Q-lambda Player Play Against the Random Player Play
Against the Bluffer Play Against Itself
Slide 16
Play Against the Random Player Q-lambda learns very fast how to
beat the random player Why does it learn so fast?
Slide 17
Play Against the Random Player (II) Same graph, with up to 9000
games
Slide 18
Play Against the Bluffer The bluffer always raises, unless
raising is not possible, in which case it calls. It is not trivial
to defeat the bluffer, because you need to fold on weaker hands and
keep raising on better hands Our Q-lambda player does very poorly
against the bluffer!
Slide 19
Play Against the Bluffer (II) In our many trials with different
alpha and lambda values, the Q-lambda player always lost with a
linear slope
Slide 20
Play Against the Bluffer (III) Why is Q-lambda losing to the
bluffer? To answer this, we looked at the Q-value tables With the
good hands, Q-lambda has learned to Raise or Call Q-values from
Round = 3, OpptBet = Raise
Slide 21
Play Against the Bluffer (IV) The problem is that even with a
very poor hand in the second round, it still does not learn to fold
and continues to either raise, call, or check. Same problem exists
with poor hands in other rounds Q-values from Round = 1 OpptBet =
Not_Yet BestHandPossible = ok
Slide 22
Play Against Itself We played the Q-lambda player against
itself, hoping that it would eventually converge on some
strategy.
Slide 23
Play Against Itself (II) We also graphed the Q-values of a few
particular states over time, to see if they converge to meaningful
values. The result was mixed. For some states Q- values completely
converge, while for some other states they are almost random
Slide 24
Play Against Itself (III) With a good hand in the last round,
the Q-values have converged so that Calling and after that Raising
are good and folding is very bad
Slide 25
Play Against Itself (IV) With a medium hand in the last round,
the Q-values does not clearly converge. Folding still looks very
bad, but there is no preference between calling and raising.
Slide 26
Play Against Itself (V) With a very bad set of hands in the
last round, the Q- values do not converge at all. This is clearly
wrong, since in an optimal-policy folding would have a higher
value.
Slide 27
Why Q-values Do not Converge? Poker cannot be represented with
our state representation (our states are too broad or are missing
some critical aspects of the game) The ALPHA and LAMBDA factors are
incorrect We have not run the game for long enough time
Slide 28
Conclusion Our State representation and Q-lambda implementation
is able to learn some aspects of poker (for instance it learns to
Raise or Call when it has a good hand in the last round) However,
in our tests it does not converge to an optimal policy. More
experience with the Alpha and Lambda parameters, and the state
representation may result a better convergence.