Monte Carlo Prediction: Reinforcement Learning with Python (MCP Tutorial)

In this tutorial, we’ll explore Monte Carlo Prediction (MCP) — a fundamental method in Reinforcement Learning used to estimate the value of states using experience.

We’ll apply MCP to the Blackjack-v1 environment from the gymnasium library and walk through the core logic with clear Python code.

Table of Contents


1. What is Monte Carlo Prediction?

Monte Carlo Prediction estimates the value of a state as the average return (total reward) received after visiting that state across multiple episodes.

Key ideas:

  • Run full episodes from start to finish
  • Track returns from each state
  • Compute average return as value estimate

Mathematically, the state value is:

[ V(s) = \mathbb{E}[G_t | S_t = s] ]

Where:

  • ( V(s) ) is the estimated value of state ( s )
  • ( G_t ) is the total return from time ( t ) onward

2. Setup

Install the gymnasium package and ensure Python 3.8+ is available.

1
pip install gymnasium

Then import the libraries:

1
2
3
import gym
import numpy as np
from collections import defaultdict

3. Blackjack Environment

Initialize the Blackjack environment with the sab=True flag for usable state representation.

1
env = gym.make('Blackjack-v1', sab=True)

4. Episode Generator and Policy

Define a simple policy:

  • Stick (0) if player sum ≥ 20
  • Hit (1) otherwise
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def simple_policy(state):
    return 0 if state[0] >= 20 else 1

def generate_episode(env, policy):
    episode = []
    state = env.reset()[0]
    while True:
        action = policy(state)
        next_state, reward, terminated, truncated, _ = env.step(action)
        episode.append((state, action, reward))
        state = next_state
        if terminated or truncated:
            break
    return episode

5. Monte Carlo Prediction Loop

Now we collect episodes and calculate state values using first-visit Monte Carlo prediction.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
returns_sum = defaultdict(float)
returns_count = defaultdict(int)
V = defaultdict(float)

num_episodes = 500000

for i in range(num_episodes):
    episode = generate_episode(env, simple_policy)
    visited_states = set()
    G = 0
    for t in reversed(range(len(episode))):
        state, _, reward = episode[t]
        G = reward + G
        if state not in visited_states:
            returns_sum[state] += G
            returns_count[state] += 1
            V[state] = returns_sum[state] / returns_count[state]
            visited_states.add(state)

6. Sample Output

Print a few estimated state values:

1
2
for state in list(V.keys())[:10]:
    print(f"State: {state}, Estimated Value: {V[state]:.2f}")

Example output:

1
2
3
4
State: (13, 6, False), Estimated Value: -0.48
State: (20, 10, True), Estimated Value: 0.53
State: (18, 5, False), Estimated Value: 0.18
...

7. Summary

  • Monte Carlo Prediction is simple yet powerful for estimating value functions.
  • It works by averaging returns from real episodes, making it suitable when a model of the environment is not available.
  • The approach requires many episodes to converge, but can be effective for small to mid-sized state spaces.

Conclusion

We implemented Monte Carlo Prediction using the Blackjack-v1 environment. This method provides a foundational tool in reinforcement learning and can be adapted to many tasks.

In future posts, we’ll compare this to Temporal Difference Learning (TD) and explore policy evaluation and improvement techniques.


  • “Monte Carlo Prediction in Python: A Beginner-Friendly RL Tutorial”
  • “How to Estimate State Values with Monte Carlo (Blackjack Example)”
  • “From Episodes to Value: Monte Carlo Prediction with Gym”