Q-Learning and CartPole: Your First Reinforcement Learning Agent

If you’ve dipped your toes into reinforcement learning, chances are you’ve encountered Q-Learning — a classic, foundational algorithm that’s simple to understand yet powerful enough to teach you how AI agents can learn from rewards.

In this post, you’ll learn:

  • What Q-Learning is and how it works
  • Why it’s great for beginners
  • How to apply it to a real environment: CartPole from OpenAI Gym
  • A complete, working Python example

Let’s get started!

Table of Contents


1. What Is Q-Learning?

Q-Learning is a model-free reinforcement learning algorithm. That means the agent doesn’t need to know how the environment works — it learns purely from experience.

The core idea is to build a Q-table, which stores the expected reward (or “quality”) for taking an action in a given state.

The Q-Learning Update Rule:

1
Q(s, a) ← Q(s, a) + α * (r + γ * max(Q(s’, a’)) - Q(s, a))

Where:

  • s: current state
  • a: action taken
  • r: immediate reward received
  • s’: new state after the action
  • α: learning rate
  • γ: discount factor (importance of future rewards)

Over time, the Q-values converge toward optimal choices — letting the agent figure out which actions are best in each state.


2. Why Use Q-Learning?

FeatureBenefit
Simple & intuitiveEasy to understand for beginners
No model of environmentDoesn’t need transition functions
Table-basedTransparent and easy to debug

Limitations?

Q-Learning doesn’t scale well to continuous or high-dimensional state spaces. For those cases, we use Deep Q-Networks (DQN) — but Q-Learning remains the best place to start.


3. About the CartPole Environment

In CartPole, a pole is attached to a cart on a track. The agent’s goal is to move the cart left or right to keep the pole balanced.

Why it’s great for RL practice:

  • Fast feedback (short episodes)
  • Easy to visualize
  • Small state/action space

State space (continuous):

VariableDescription
Cart positionCart’s location on track
Cart velocitySpeed of the cart
Pole angleTilt of the pole
Pole velocityAngular velocity of pole

We’ll discretize these values to build a tabular Q-learning solution.


4. Building a Q-Learning Agent for CartPole

Step 1: Discretize the State

CartPole’s state is continuous. To use Q-tables, we need to map continuous values to discrete bins.

1
buckets = (1, 1, 6, 12)  # number of discrete bins for each variable

Step 2: Setup Environment and Q-Table

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import gym
import numpy as np
import math

env = gym.make("CartPole-v1")

# Discretization setup
buckets = (1, 1, 6, 12)
q_table = np.zeros(buckets + (env.action_space.n,))
min_bounds = env.observation_space.low
max_bounds = env.observation_space.high
max_bounds[1] = 0.5
max_bounds[3] = math.radians(50)
min_bounds[1] = -0.5
min_bounds[3] = -math.radians(50)

Step 3: Discretization Function

1
2
3
4
def discretize(obs):
    ratios = [(obs[i] - min_bounds[i]) / (max_bounds[i] - min_bounds[i]) for i in range(len(obs))]
    new_obs = [int(round((buckets[i] - 1) * min(max(ratios[i]0), 1))) for i in range(len(obs))]
    return tuple(new_obs)

Step 4: Training Loop

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
alpha = 0.1
gamma = 0.99
epsilon = 1.0
epsilon_min = 0.01
epsilon_decay = 0.995
episodes = 1000

for ep in range(episodes):
    obs = discretize(env.reset())
    done = False
    total_reward = 0

    while not done:
        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(q_table[obs])

        next_obs_raw, reward, done, _, = env.step(action)
        next_obs = discretize(next_obs_raw)

        q_old = q_table[obs + (action,)]
        q_max = np.max(q_table[next_obs])
        q_table[obs + (action,)] = q_old + alpha * (reward + gamma * q_max - q_old)

        obs = next_obs
        total_reward += reward

    if epsilon > epsilon_min:
        epsilon *= epsilon_decay

    if (ep + 1) % 100 == 0:
        print(f"Episode {ep + 1}, Total Reward: {total_reward}")

env.close()

5. Output and Evaluation

After a few hundred episodes, the agent starts to get better at balancing the pole. You’ll notice:

  • Rewards increase steadily
  • Agent chooses better actions
  • Episodes last longer

You can tune buckets, alpha, gamma, or epsilon to get even better performance.


6. What Comes After Q-Learning?

Q-Learning works well for small or discretized problems. But what about more complex environments?

Enter DQN (Deep Q-Networks) — which replace the Q-table with a neural network.

We’ll explore that in the next post.


7. Summary

ConceptRecap
Q-LearningLearns the value of state-action pairs using a table
CartPoleClassic OpenAI Gym environment for learning RL
DiscretizationConverts continuous state to discrete bins
Epsilon-greedyBalances exploration and exploitation

Q-Learning gives you the essential building blocks for understanding how RL agents learn from rewards. It’s an ideal first step into the world of AI agents.