Q-Learning and CartPole: Your First Reinforcement Learning Agent

If you’ve dipped your toes into reinforcement learning, chances are you’ve encountered Q-Learning — a classic, foundational algorithm that’s simple to understand yet powerful enough to teach you how AI agents can learn from rewards.

In this post, you’ll learn:

What Q-Learning is and how it works
Why it’s great for beginners
How to apply it to a real environment: CartPole from OpenAI Gym
A complete, working Python example

Let’s get started!

1. What Is Q-Learning?

Q-Learning is a model-free reinforcement learning algorithm. That means the agent doesn’t need to know how the environment works — it learns purely from experience.

The core idea is to build a Q-table, which stores the expected reward (or “quality”) for taking an action in a given state.

The Q-Learning Update Rule:

1
Q(s, a) ← Q(s, a) + α * (r + γ * max(Q(s’, a’)) - Q(s, a))

Where:

s: current state
a: action taken
r: immediate reward received
s’: new state after the action
α: learning rate
γ: discount factor (importance of future rewards)

Over time, the Q-values converge toward optimal choices — letting the agent figure out which actions are best in each state.

2. Why Use Q-Learning?

Feature	Benefit
Simple & intuitive	Easy to understand for beginners
No model of environment	Doesn’t need transition functions
Table-based	Transparent and easy to debug

Limitations?

Q-Learning doesn’t scale well to continuous or high-dimensional state spaces. For those cases, we use Deep Q-Networks (DQN) — but Q-Learning remains the best place to start.

3. About the CartPole Environment

In CartPole, a pole is attached to a cart on a track. The agent’s goal is to move the cart left or right to keep the pole balanced.

Why it’s great for RL practice:

Fast feedback (short episodes)
Easy to visualize
Small state/action space

State space (continuous):

Variable	Description
Cart position	Cart’s location on track
Cart velocity	Speed of the cart
Pole angle	Tilt of the pole
Pole velocity	Angular velocity of pole

We’ll discretize these values to build a tabular Q-learning solution.

4. Building a Q-Learning Agent for CartPole

Step 1: Discretize the State

CartPole’s state is continuous. To use Q-tables, we need to map continuous values to discrete bins.

1
buckets = (1, 1, 6, 12)  # number of discrete bins for each variable

Step 2: Setup Environment and Q-Table

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import gym
import numpy as np
import math

env = gym.make("CartPole-v1")

# Discretization setup
buckets = (1, 1, 6, 12)
q_table = np.zeros(buckets + (env.action_space.n,))
min_bounds = env.observation_space.low
max_bounds = env.observation_space.high
max_bounds[1] = 0.5
max_bounds[3] = math.radians(50)
min_bounds[1] = -0.5
min_bounds[3] = -math.radians(50)

Step 3: Discretization Function

1
2
3
4
def discretize(obs):
    ratios = [(obs[i] - min_bounds[i]) / (max_bounds[i] - min_bounds[i]) for i in range(len(obs))]
    new_obs = [int(round((buckets[i] - 1) * min(max(ratios[i]0), 1))) for i in range(len(obs))]
    return tuple(new_obs)

Step 4: Training Loop

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
alpha = 0.1
gamma = 0.99
epsilon = 1.0
epsilon_min = 0.01
epsilon_decay = 0.995
episodes = 1000

for ep in range(episodes):
    obs = discretize(env.reset())
    done = False
    total_reward = 0

    while not done:
        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(q_table[obs])

        next_obs_raw, reward, done, _, = env.step(action)
        next_obs = discretize(next_obs_raw)

        q_old = q_table[obs + (action,)]
        q_max = np.max(q_table[next_obs])
        q_table[obs + (action,)] = q_old + alpha * (reward + gamma * q_max - q_old)

        obs = next_obs
        total_reward += reward

    if epsilon > epsilon_min:
        epsilon *= epsilon_decay

    if (ep + 1) % 100 == 0:
        print(f"Episode {ep + 1}, Total Reward: {total_reward}")

env.close()

5. Output and Evaluation

After a few hundred episodes, the agent starts to get better at balancing the pole. You’ll notice:

Rewards increase steadily
Agent chooses better actions
Episodes last longer

You can tune buckets, alpha, gamma, or epsilon to get even better performance.

6. What Comes After Q-Learning?

Q-Learning works well for small or discretized problems. But what about more complex environments?

Enter DQN (Deep Q-Networks) — which replace the Q-table with a neural network.

We’ll explore that in the next post.

7. Summary

Concept	Recap
Q-Learning	Learns the value of state-action pairs using a table
CartPole	Classic OpenAI Gym environment for learning RL
Discretization	Converts continuous state to discrete bins
Epsilon-greedy	Balances exploration and exploitation

Q-Learning gives you the essential building blocks for understanding how RL agents learn from rewards. It’s an ideal first step into the world of AI agents.

Q-Learning and CartPole: Your First Reinforcement Learning Agent#

Table of Contents#

1. What Is Q-Learning?#

The Q-Learning Update Rule:#

2. Why Use Q-Learning?#

Limitations?#

3. About the CartPole Environment#

Why it’s great for RL practice:#

State space (continuous):#

4. Building a Q-Learning Agent for CartPole#

Step 1: Discretize the State#

Step 2: Setup Environment and Q-Table#

Step 3: Discretization Function#

Step 4: Training Loop#

5. Output and Evaluation#

6. What Comes After Q-Learning?#

7. Summary#