Q-Learning and CartPole: Your First Reinforcement Learning Agent
If you’ve dipped your toes into reinforcement learning, chances are you’ve encountered Q-Learning — a classic, foundational algorithm that’s simple to understand yet powerful enough to teach you how AI agents can learn from rewards.
In this post, you’ll learn:
- What Q-Learning is and how it works
- Why it’s great for beginners
- How to apply it to a real environment: CartPole from OpenAI Gym
- A complete, working Python example
Let’s get started!
Table of Contents
1. What Is Q-Learning?
Q-Learning is a model-free reinforcement learning algorithm. That means the agent doesn’t need to know how the environment works — it learns purely from experience.
The core idea is to build a Q-table, which stores the expected reward (or “quality”) for taking an action in a given state.
The Q-Learning Update Rule:
|
|
Where:
s
: current statea
: action takenr
: immediate reward receiveds’
: new state after the actionα
: learning rateγ
: discount factor (importance of future rewards)
Over time, the Q-values converge toward optimal choices — letting the agent figure out which actions are best in each state.
2. Why Use Q-Learning?
Feature | Benefit |
---|---|
Simple & intuitive | Easy to understand for beginners |
No model of environment | Doesn’t need transition functions |
Table-based | Transparent and easy to debug |
Limitations?
Q-Learning doesn’t scale well to continuous or high-dimensional state spaces. For those cases, we use Deep Q-Networks (DQN) — but Q-Learning remains the best place to start.
3. About the CartPole Environment
In CartPole, a pole is attached to a cart on a track. The agent’s goal is to move the cart left or right to keep the pole balanced.
Why it’s great for RL practice:
- Fast feedback (short episodes)
- Easy to visualize
- Small state/action space
State space (continuous):
Variable | Description |
---|---|
Cart position | Cart’s location on track |
Cart velocity | Speed of the cart |
Pole angle | Tilt of the pole |
Pole velocity | Angular velocity of pole |
We’ll discretize these values to build a tabular Q-learning solution.
4. Building a Q-Learning Agent for CartPole
Step 1: Discretize the State
CartPole’s state is continuous. To use Q-tables, we need to map continuous values to discrete bins.
|
|
Step 2: Setup Environment and Q-Table
|
|
Step 3: Discretization Function
|
|
Step 4: Training Loop
|
|
5. Output and Evaluation
After a few hundred episodes, the agent starts to get better at balancing the pole. You’ll notice:
- Rewards increase steadily
- Agent chooses better actions
- Episodes last longer
You can tune buckets
, alpha
, gamma
, or epsilon
to get even better performance.
6. What Comes After Q-Learning?
Q-Learning works well for small or discretized problems. But what about more complex environments?
Enter DQN (Deep Q-Networks) — which replace the Q-table with a neural network.
We’ll explore that in the next post.
7. Summary
Concept | Recap |
---|---|
Q-Learning | Learns the value of state-action pairs using a table |
CartPole | Classic OpenAI Gym environment for learning RL |
Discretization | Converts continuous state to discrete bins |
Epsilon-greedy | Balances exploration and exploitation |
Q-Learning gives you the essential building blocks for understanding how RL agents learn from rewards. It’s an ideal first step into the world of AI agents.