Reinforcement Learning Explained Simply (with Diagrams and Real Code)

Reinforcement Learning: the essentials in one article — real code, diagrams and concrete steps, excerpts from a 45-lesson course.

Reinforcement Learning Explained Simply (with Diagrams and Real Code)

A guide that gets straight to the point: Reinforcement Learning broken down with diagrams, concrete examples, and tested commands. Everything comes from a structured 16-chapter course — here are the best parts.

tl;dr
  • Introduction and Installation
  • Fundamentals of Reinforcement Learning
  • Markov Decision Process
  • Classic Q-Learning
  • Dynamic Programming
~$ cat ./parcours.md # Reinforcement Learning — 15 chapters
01
Introduction and Installation
→ Course presentation and why RL?→ Install Python, Gymnasium and PyTorch+ 1 more lessons
02
Fundamentals of Reinforcement Learning
→ RL vs supervised vs unsupervised→ Agent, environment, state, action, reward+ 2 more lessons
03
Markov Decision Process
→ Markov Decision Process (MDP) explained→ Intuitive Bellman Equation+ 2 more lessons
04
Classical Q-Learning
→ Q-Learning: intuition and formula→ Implement a Q-table in Python+ 2 more lessons
05
Dynamic Programming
06
SARSA and Variants
→ SARSA: the on-policy cousin of Q-Learning→ Q-Learning vs SARSA: when to use what+ 2 more lessons
07
Deep Q-Networks
→ Why a neural network?→ DQN architecture with PyTorch+ 2 more lessons
08
Monte Carlo Methods
→ Monte Carlo Prediction→ Monte Carlo Control+ 1 more lessons
🏁
Final project (+ 7 chapters along the way)
→ You leave with a concrete and demonstrable project

First Gymnasium Environment (FrozenLake)

NOTEObjective — Hands-on practice with your first Gymnasium environment. Understand the universal reset / step interface, the concept of an episode, and run a random agent on FrozenLake.

Learning Objectives

TIPBy the end of this module
  • Create an environment with gym.make
  • Understand the reset and step methods
  • Identify the observation space and action space
  • Write a complete episode loop
  • Run a random agent and observe its score

FrozenLake: The Playground

FrozenLake is a 4x4 grid representing a frozen lake. The agent starts at the starting cell (S), must reach the gift (G) by walking on solid ice (F), while avoiding holes (H) where it falls and loses. Simple, visual, perfect for beginners.

NOTENote: By default, FrozenLake is “slippery” (is_slippery=True). The agent that tries to go right may slip and end up elsewhere. This adds randomness and makes the problem more interesting. It can be disabled for beginners.

The Universal Interface: reset and step

All Gymnasium environments share the same interface, which makes writing agents so convenient. Two methods are sufficient.

reset()

Resets the environment to its initial state and returns the first state. It is called at the beginning of each episode (a complete game).

step(action)

Executes an action and returns five values: the new state, the reward, whether the episode is terminated, whether it is truncated, and info.

WARNINGWarning: Always remember to call env.close() at the end, especially with graphical rendering. Otherwise, ghost windows may remain open and consume memory.

Visualizing the Game

To watch the agent play on screen, add the render mode:

Solving FrozenLake with Value Iteration

NOTEObjective — Apply Value Iteration on FrozenLake end-to-end: compute V*, extract the optimal policy, run the agent, and measure its win rate. Your first agent that truly wins.

Learning Objectives

TIPBy the end of this module
  • Write complete Value Iteration for FrozenLake
  • Extract the optimal policy from V*
  • Evaluate the agent over many episodes
  • Interpret the obtained values and policy
  • Understand the effect of slippery ice

Step 1: Access the FrozenLake Model

FrozenLake provides its full model via env.unwrapped.P. It is a dictionary that gives, for each state and each action, the list of possible transitions.

DQN Architecture with PyTorch

NOTEObjective — Build the DQN neural network with PyTorch. Understand the architecture (input, hidden layers, output), the role of activation functions, and how the network predicts Q-values.

Learning Objectives

TIPBy the end of this module
  • Define a network with nn.Module
  • Choose input and output sizes according to the environment
  • Understand the role of hidden layers and ReLU
  • Make a prediction (forward pass)
  • Understand that the output gives one Q-value per action

The Anatomy of a DQN

A DQN for CartPole is a very simple network: it takes the 4 state numbers, passes them through two hidden layers, and produces 2 Q-values (one per action: left, right).

go-further

This article covers the most useful excerpts — the full Reinforcement Learning course (16 chapters, 45 lessons, corrected exercises, and final project) takes you all the way.

./access-the-full-course free course: Mastering Claude Code

FAQ

How long does it take to learn Reinforcement Learning?
With a structured progression (16 chapters, 45 short and practical lessons), you reach an operational level in a few weeks at 30 to 60 minutes per day. The key is to practice each concept immediately.
Are there any prerequisites?
It is best to be comfortable with the fundamentals of the field: this content goes in depth, with real-world cases.
Where to start concretely?
Reproduce the commands from this article, then follow the full Reinforcement Learning course: it chains the 45 lessons in order, with exercises and a final project.

📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.