Blackjack Expert Explains How Card Counting Works
What will I learn?
This article will take you through the logic behind one of the foundational pillars of reinforcement learning, Monte Carlo MC methods.
This classic source to the problem of reinforcement learning will be demonstrated by finding the optimal policy to a simplified version of blackjack.
By the end of this article I hope that you will be able to describe and implement the following topics.
The full code can be found on my.
The city of Monte Carlo, MC is a very simple example of model free learning that only source past experience to learn.
It does this by calculating the average reward of taking a specific action A while in a specific state S over many games.
If you are not familiar with the basics of reinforcement learning I would encourage you to quickly read up on the basics such as the agent life cycle.
My previous article goes through these concepts and can be found.
Also if you are unfamiliar with the game of blackjack checkout this.
This is simply a table containing each possible combination of states in blackjack the sum of your cards and the value of the card being shown by the dealer along with the best action to take hit, stick, double or split according to probability and statistics.
This blackjack simulator in r an example of a policy.
Simple version of the basic strategy policy developed by Edward O.
Thorp, image taken from In our example game we will make it a bit simpler and only have the option to hit or stick.
As well as this, we will divide our state logic into two types, a hand with a usable blackjack simulator in r and a hand without a usable ace.
In blackjack an ace can either have the value of 1 or 11.
Lets say that we have been given a very simple strategy even simpler than the basic strategy above.
Now lets say that we want to know the value of holding a hand of 14 while the dealer is showing a 6.
This is an example of the prediction problem.
To solve this, we are going to use First Visit Monte Carlo.
This method has our agent play through thousands of games using our current policy.
Each time the agent carries out action A in state S for the first time in that game it will calculate the reward of the game from that point onwards.
By doing this, we can determine how valuable it is to be in our current state.
Both of these methods provide similar results.
The steps to implement First Visit Monte Carlo can be seen here.
Lets go through the steps to implement this algorithm.
One last thing that I want to quickly cover before we get into the code is the idea of discounted rewards and Q values.
Discounted Rewards The idea of discounted rewards is to prioritise immediate reward over potential blackjack simulator in r rewards.
This is why when calculating action values we take the cumulative discounted reward the sum of all rewards after the action as opposed to just the immediate reward.
The discount factor is simply a constant number that we multiply our reward by at each time step.
After each time step we increase the power to which we multiply our discount factor.
This gives more priority to the immediate actions and less priority as we get further away from the action taken.
Choosing the value of our discount factor depends on the task at hand, but it must always be between 0 and 1.
The larger the discount factor the higher importance of future rewards and vice versa for a lower discount factor.
In general most a discount factor of 0.
Q Values Q values refer to the value of taking action A while in state S.
We store these values in a table or dictionary and update them as we learn.
Once we have completed our Q table we will always know what action to take based on the current see more we are in.
Implementation Below is a jupyter notebook with the code to implement MC prediction.
Each section is commented and gives more detail about what is going on line by line.
As you can see there is not much to implementing the prediction algorithm and based on the plots shown at the end of the notebook we can see that the algorithm has successfully predicted the values of our very simple blackjack policy.
Next up is control.
This is the more interesting of the two problems because now we are going to use MC to learn the optimal strategy of the game as opposed to just validating a previous policy.
Once again we are going to use the First Visit approach to MC.
This algorithm looks a bit more complicated than the previous prediction algorithm, but at its core it is still very simple.
Because this is a bit more complicated I am going to split the problem up into sections and explain each.
Our agent learns the same way.
In order to learn the best policy we want to blackjack simulator in r a good mix of carrying out what good moves we have learned and exploring new moves.
This is the epsilon greedy strategy that we discussed previously.
As we go through we record the state, action and reward of each episode to pass blackjack simulator in r our update function.
Here we implement the logic for how our agent learns.
The function looks like this.
The update is made up of the cumulative reward of the episode G and subtracting the old Q value.
This is then all multiplied by alpha.
In this case alpha acts as our learning rate.
A large learning rate will mean that we make improvements quickly, but it runs the risk of making changes that are too big.
Although it will initially make progress quickly it may not be able to figure out the more subtle aspects of the task it is learning.
On the other hand if the learning rate is too small, the agent will learn the task, but it could take a ridiculously long time.
As like most things in machine learning, these are important hyper parameters that you will have to fine tune depending on the needs of your project.
Implementation Now that we have blackjack simulator in r through the theory of our control algorithm, blackjack simulator in r can get stuck in with blackjack simulator in r />Now we have successfully generated our own optimal policy for playing blackjack.
You will notice that the plots of the original hard coded policy and our new optimal policy are different and that our new policy reflects Thorps remarkable pure 21.5 blackjack rules something strategy.
Conclusion We now know how to use MC to find an optimal strategy for blackjack.
Unfortunately you wont be winning much money with just this strategy any time soon.
The real complexity of the game is knowing when and how to bet.
An interesting project would be to combine the policy used here with a second policy on how to bet correctly.
I hope you enjoyed the article and found something useful.
Any feedback or comments is always appreciated.
The full code can be found on my References: Sutton R.
On Medium, smart voices and original ideas take center stage - with no ads in sight.
Blackjack Basic Strategy for Infinite Decks
A blackjack simulation was created in GPSS/H to provide both an introduction to the language and a simple domain in which to investigate and ...
I apologise, but, in my opinion, you are not right. I am assured. I suggest it to discuss. Write to me in PM.
Yes, it is solved.
Between us speaking, I would ask the help for users of this forum.
It is usual reserve
I join. So happens. Let's discuss this question.
Where the world slides?
In my opinion it is obvious. I recommend to you to look in google.com
I regret, that I can help nothing. I hope, you will find the correct decision. Do not despair.
Where I can read about it?
It is simply matchless :)
On mine, at someone alphabetic алексия :)
It is reserve, neither it is more, nor it is less
I hope, it's OK
It is very a pity to me, I can help nothing to you. But it is assured, that you will find the correct decision. Do not despair.
To me it is not clear
Improbably. It seems impossible.
I consider, that you commit an error. I suggest it to discuss. Write to me in PM.
You are not right. I am assured. I can defend the position. Write to me in PM.
Here so history!
Tell to me, please - where I can read about it?
Such is a life. There's nothing to be done.
It above my understanding!
Rather valuable phrase
I shall simply keep silent better
I consider, that you are not right. I can defend the position. Write to me in PM, we will communicate.
Also that we would do without your remarkable phrase