Mastering CartPole with DQN: Deep Reinforcement Learning for Beginners

If you’ve played with reinforcement learning (RL) before, you’ve probably seen the classic CartPole balancing problem. And if you’ve tried solving it with traditional Q-learning, you might have run into some limitations.

That’s where DQN — Deep Q-Network — comes in.

In this guide, we’ll explain what DQN is, why it was a breakthrough in RL, and how to implement it step-by-step to solve the CartPole-v1 environment using OpenAI Gym and PyTorch. Whether you’re new to RL or ready to level up from Q-tables, this tutorial is for you.

1. What is DQN?

Q-Learning works well for problems with small, discrete state spaces. But in the real world — or even a simple simulation like CartPole — the state is continuous, and creating a Q-table for every possible state is infeasible.

DQN solves this by using a neural network to approximate the Q-function. Instead of a table, the network learns to predict the expected reward for each action, given a state.

DQN = Q-Learning + Deep Learning

Component	Purpose
Neural Network	Predict Q-values for each action
Replay Buffer	Store past experiences
Target Network	Improve stability
ε-greedy Policy	Balance exploration vs. exploitation

This combination enables DQN to scale to more complex environments — including Atari games, robotics, and beyond.

2. Recap: The CartPole Problem

In CartPole, your agent controls a cart with a pole attached to it. The goal? Keep the pole from falling over by moving the cart left or right.

Environment Details:

State Space: 4 floating point values (position, velocity, pole angle, angular velocity)
Action Space: 0 (left), 1 (right)
Reward: +1 for every time step the pole remains upright
Done: When the pole falls beyond a threshold angle, or the cart moves too far from center

It’s a great starting point for reinforcement learning.

3. Why Q-Learning Isn’t Enough

Traditional Q-Learning relies on a Q-table that maps state-action pairs to expected rewards. That works for games like FrozenLake or GridWorld, but fails when:

States are continuous
The environment has high dimensionality
We want to generalize across unseen states

DQN overcomes these by using a function approximator — a neural net — to estimate Q-values, enabling RL to move beyond toy problems.

4. DQN Components Explained

Let’s break down what you’ll need to build a working DQN agent.

1. Neural Network

The core of DQN is a network that takes in a state and outputs Q-values for all possible actions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import torch
import torch.nn as nn
import torch.nn.functional as F

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.out = nn.Linear(128, action_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.out(x)

Stores past experiences and samples them randomly to break temporal correlation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from collections import deque
import random

class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)

    def push(self, transition):
        self.buffer.append(transition)

    def sample(self, batch_size):
        return random.sample(self.buffer, batch_size)

    def __len__(self):
        return len(self.buffer)

The agent explores randomly at first, then gradually exploits what it has learned.

1
2
3
4
5
def select_action(state, epsilon, policy_net):
    if random.random() < epsilon:
        return random.randint(0, 1)
    with torch.no_grad():
        return policy_net(state).argmax().item()

5. Training DQN on CartPole

Step-by-Step Loop:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import gym
import numpy as np
import torch.optim as optim

env = gym.make("CartPole-v1")
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

policy_net = DQN(state_dim, action_dim)
target_net = DQN(state_dim, action_dim)
target_net.load_state_dict(policy_net.state_dict())

optimizer = optim.Adam(policy_net.parameters(), lr=1e-3)
replay_buffer = ReplayBuffer(10000)

batch_size = 64
gamma = 0.99
epsilon = 1.0

for episode in range(300):
    state = torch.FloatTensor(env.reset())
    total_reward = 0

    for t in range(500):
        action = select_action(state, epsilon, policy_net)
        next_state, reward, done, _ = env.step(action)
        next_state = torch.FloatTensor(next_state)

        replay_buffer.push((state, action, reward, next_state, done))
        state = next_state
        total_reward += reward

        if len(replay_buffer) >= batch_size:
            transitions = replay_buffer.sample(batch_size)
            s, a, r, ns, d = zip(*transitions)

            s = torch.stack(s)
            a = torch.LongTensor(a).unsqueeze(1)
            r = torch.FloatTensor(r).unsqueeze(1)
            ns = torch.stack(ns)
            d = torch.FloatTensor(d).unsqueeze(1)

            q_values = policy_net(s).gather(1, a)
            next_q = target_net(ns).max(1)[0].unsqueeze(1).detach()
            target = r + gamma * next_q * (1 - d)

            loss = F.mse_loss(q_values, target)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        if done:
            break

    epsilon = max(0.01, epsilon * 0.995)

    if episode % 10 == 0:
        target_net.load_state_dict(policy_net.state_dict())

    print(f"Episode {episode}, Total Reward: {total_reward}")

6. Observations and Performance

As training progresses:

The total reward per episode increases
The agent starts keeping the pole upright for longer durations
Eventually, it consistently reaches the max score of 200

This is a solid indicator that the DQN is learning to solve the task effectively.

7. DQN Limitations and Next Steps

While DQN is powerful, it’s not perfect:

It can be unstable or divergent without tricks
It struggles with continuous action spaces
It treats all experiences equally in the replay buffer

Enhancements (aka “Better DQN”):

Double DQN: Reduces overestimation bias
Dueling DQN: Separates state value and action advantage
Prioritized Experience Replay: Focus on important transitions
Rainbow DQN: Combines all of the above

We’ll explore these in future posts.

8. Conclusion

You’ve just implemented your first Deep Q-Network from scratch and trained it to solve CartPole. This is a big step toward mastering reinforcement learning with deep learning.

By understanding both the code and the concepts, you’re ready to explore more complex environments and powerful variants of DQN.

Mastering CartPole with DQN: Deep Reinforcement Learning for Beginners#

Table of Contents#

1. What is DQN?#

DQN = Q-Learning + Deep Learning#

2. Recap: The CartPole Problem#

Environment Details:#

3. Why Q-Learning Isn’t Enough#

4. DQN Components Explained#

1. Neural Network#

5. Training DQN on CartPole#

Step-by-Step Loop:#

6. Observations and Performance#

7. DQN Limitations and Next Steps#

Enhancements (aka “Better DQN”):#

8. Conclusion#