Machine & Deep Learning

Get Started with Reinforcement Learning: Your First Concrete Step Today

Getting Started with Reinforcement Learning: The Essentials in One Article — Real Code, Diagrams and Concrete Steps, Excerpts from a 42-Lesson Course.

REHOUMA Haythem

12 Jun 2026 • 15 min read

The best way to learn Getting Started with Reinforcement Learning is by doing. This article gives you a head start with practical excerpts from a 42-lesson course — enough to get your first result today.

tl;dr

Course Resources and Install the Environment
Introduction to Reinforcement Learning
Fundamental Concepts of RL
MDP Part 1 - States Actions Rewards
MDP Part 2 - Policy and Bellman Equations

~$ cat ./parcours.md # Starting Reinforcement Learning — 11 chapters

Course Resources and Install the Environment

→ Course Resources & Introduction→ Install Python, Anaconda and TensorFlow+ 3 more lessons

Introduction to Reinforcement Learning

→ What is Reinforcement Learning?→ Comparison Supervised Learning, Unsupervised and RL+ 2 more lessons

Fundamental Concepts of RL

→ Agent-Environment Interaction→ Exploration vs Exploitation — The Big Dilemma+ 2 more lessons

MDP Part 1 - States Actions Rewards

→ Definition of MDPs and Markov Property→ States, Actions and Rewards in an MDP+ 2 more lessons

MDP Part 2 - Policy and Bellman Equations

→ Optimal vs Sub-optimal Policy→ Value Functions V(s) and Q(s,a)+ 2 more lessons

Q-Learning

→ Fundamentals of Q-Learning→ Updating Q-values and Convergence+ 1 more lessons

Monte Carlo Methods

→ Introduction to Monte Carlo Methods→ Monte Carlo vs TD Learning+ 1 more lessons

Dynamic Programming and TD Learning

→ Introduction to Dynamic Programming→ TD(0) and TD(λ) Methods+ 1 more lessons

🏁

Final project (+ 3 chapters along the way)

→ You leave with a concrete and demonstrable project

Chapter 02 – Complete Terminology Cheat Sheet

NOTEPurpose of this sheet — Have all the fundamental RL definitions in one place, from the notion of state to the algorithms. This sheet serves as a reference throughout the course.

Machine learning paradigm in which an agent learns to behave in an environment by taking actions and receiving rewards or penalties in return.

TIPMechanism:

The agent observes the state of the environment
It chooses an action to perform
It observes the outcome: new state + reward (positive or negative)
By repeating this cycle, it learns which actions maximize long-term reward

State — State s_t Core component

Complete description of the current situation of the environment at time t. This is the information the agent has to decide which action to take.

State type	Description	Example
Starting state	Initial position at the beginning of an episode	Cell (1,1) in a maze
Intermediate state	Any state during an episode (non-terminal)	Cells traversed in the maze
Terminal state	End-of-episode state (win or loss)	Maze exit or hole

NOTEDiscrete vs Continuous: A state can be discrete (e.g. numbered cells in a maze — countable) or continuous (e.g. GPS position in meters, angle in degrees — infinite numeric values).

Action a_t Core component

What the agent decides to do in a given state. The set of all possible actions forms the action space.

TIPDiscrete actions
Finite number of enumerable actions.
Examples: Up, Down, Left, Right (maze); 0 or 1 (CartPole)

NOTEContinuous actions
Numeric values within an interval.
Examples: Force from 0 to 10 N; steering angle from −30° to +30°

Reward r_t — Reward Learning signal

Numeric signal received by the agent after each action. It is the only way for the environment to tell the agent whether its action was good or bad.

NOTEImmediate reward (3.6)
Reward received directly after an action at step t. Provides instant feedback on the quality of a single action.
Notation: r_t+1

NOTELong-term reward (3.7)
Sum of rewards accumulated over the entire episode (or to infinity). This is what the agent actually seeks to maximize.
Notation: G_t = r_t+1 + r_t+2 + r_t+3 + …

Discount — Discount factor γ (gamma) Key parameter

Value between 0 and 1 that weights future rewards. The farther a reward is in time, the more it is "discounted". This is the penalty applied to future rewards.

NOTEγ close to 0
The agent is "myopic": it only thinks about the immediate reward. Short-term behavior.

TIPγ close to 1
The agent is "far-sighted": it considers future rewards with almost equal importance. Long-term behavior.

NOTETypical value: γ = 0.9 or 0.99 in most practical applications.

Utility — Utility / Return G_t Final objective

The total sum of discounted rewards obtained from time step t. This is THE value the agent seeks to maximize. Also called return or cumulative return.

TIPIn plain French: Utility = the sum of all future rewards, with a weighting that decreases over time (thanks to the discount γ). A good action now AND good actions in the future = high utility.

NOTESynonyms used in the course: Long-term reward, Return, Return, Cumulative return, G_t

Policy — Policy π Heart of RL

The agent's decision strategy: for each state, it defines which action to take. It is the agent's "brain", what it seeks to learn.

Python Virtual Environments

NOTEObjective — Understand why virtual environments are used, how to install multiple Python versions in parallel, and how to create an isolated environment for each project — on Windows and on Linux (Ubuntu).

Why virtual environments?

The problem without venv

The solution with venv

Environment	Python	Libraries	Project
`myenv39`	3.9	TensorFlow 2.10, Gym 0.26	RL Course (old)
`myenv310`	3.10	Streamlit 1.28, pandas 2.0	Dashboard
`myenv311`	3.11	TensorFlow 2.15, Gymnasium	RL Course (current)
`rl-env`	3.12	Gymnasium, NumPy, Matplotlib	This course

🪟 Windows Install multiple Python versions

NOTEBefore you start — PowerShell as administrator
Some commands require administrator rights. If you get a permission error when activating a venv, first run in PowerShell:

output

# Open PowerShell as administrator, then:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
# Answer Y (Yes) to the confirmation

Download and install each version from python.org/downloads. Check «Add to PATH» only for the first version. For subsequent versions, uncheck «Add to PATH» to avoid conflicts.

TIPTip: When installing Python on Windows, the py launcher is installed automatically. It lets you call any installed version with py -3.10, py -3.11, etc.

output

python --version
# Python 3.x.x (the default version in PATH)

py -3.9 --version
# Python 3.9.x

py -3.10 --version
# Python 3.10.x

py -3.11 --version
# Python 3.11.x

py -3.12 --version
# Python 3.12.x

py -3.13 --version
# Python 3.13.x

output

# Check the version
py -3.9 --version

# Create the virtual environment
py -3.9 -m venv myenv39

# Activate the environment (PowerShell)
.\myenv39\Scripts\activate

# The prompt changes: (myenv39) indicates the env is active
(myenv39) python --version
# Python 3.9.x

(myenv39) pip install streamlit
(myenv39) pip show streamlit
# Name: streamlit
# Version: 1.x.x
# ...

# Deactivate the environment
(myenv39) deactivate

output

py -3.10 --version
py -3.10 -m venv myenv310

.\myenv310\Scripts\activate

(myenv310) python --version
# Python 3.10.x

(myenv310) pip install streamlit
(myenv310) pip show streamlit
(myenv310) deactivate

output

# Python 3.11
py -3.11 -m venv myenv311
.\myenv311\Scripts\activate
(myenv311) python --version
(myenv311) pip install streamlit
(myenv311) deactivate

# Python 3.12
py -3.12 -m venv myenv312
.\myenv312\Scripts\activate
(myenv312) python --version
(myenv312) pip install streamlit
(myenv312) deactivate

# Python 3.13
py -3.13 -m venv myenv313
.\myenv313\Scripts\activate
(myenv313) python --version
(myenv313) pip install streamlit
(myenv313) deactivate

🔗 SSH Connect to a Linux VM from Windows

NOTEContext: If you are working with a Linux virtual machine (VirtualBox, VMware...) or a remote server, you can connect to it from Windows via SSH without leaving your Windows terminal.

Step 1 — Install SSH on the Linux VM

In the Linux VM terminal, as root:

output

su                         # or: sudo -s
apt install openssh-server -y
systemctl status ssh       # check status
systemctl start ssh        # start if needed
systemctl enable ssh       # start automatically at boot

# Find the VM's IP address
ip -br addr
# lo               UNKNOWN        127.0.0.1/8
# eth0             UP             192.168.2.139/24   ← copy this IP

First Gym Script – Test the Installation

TIP😌 No panic — we are only testing!
The goal of this lesson is only to verify that your environment works. You will copy-paste scripts and observe the results. You do not need to understand everything yet — every line of code will be explained in detail in the following chapters. Breathe, this is just a startup test!

NOTEObjective — Run your first scripts with Gymnasium and confirm that your installation works. Two options: locally (Jupyter) or directly in Google Colab with nothing to install.

Option A – Test locally (Jupyter)

If you followed lessons 01 and 02 (Anaconda installation + Jupyter setup), activate your environment and launch Jupyter:

bash

conda activate rl-env
jupyter notebook

Then copy the scripts below into a cell and run with Shift + Enter.

Option B – Test in Google Colab (nothing to install!)

TIPHave not installed Python locally yet? No problem.
Google Colab runs directly in your browser. No installation required — perfect for testing now and installing locally later.

Open a new notebook at colab.research.google.com, then install Gymnasium in the first cell:

bash

# Cell 1 — Installation (Colab only, already done if you are local)
!pip install gymnasium[toy-text] matplotlib --quiet

Then copy the scripts below into the following cells. That's it!

NOTEColab or local — the scripts are identical. The only difference is the installation cell above, which you do not need to run locally.

What is Gymnasium (formerly OpenAI Gym)?

Gymnasium is the reference library for testing RL algorithms. It provides standardized environments with a uniform interface.

Classic environments

Universal interface

Script 1 – Explore the CartPole environment

TIP📋 Simply copy-paste this code — do not try to understand everything now. You will see what information CartPole produces. We will detail each term (observation_space, action_space, etc.) in Chapter 02.

output

import gymnasium as gym
import numpy as np

# Create the environment
env = gym.make("CartPole-v1")
obs, info = env.reset(seed=42)

print("=== CartPole-v1 ===")
print(f"State space  : {env.observation_space}")
print(f"  Number of dimensions : {env.observation_space.shape[0]}")
print(f"  Min values : {env.observation_space.low}")
print(f"  Max values : {env.observation_space.high}")
print(f"\nAction space : {env.action_space}")
print(f"  Number of actions : {env.action_space.n}")
print(f"  Actions : 0 (left), 1 (right)")
print(f"\nInitial observation : {obs}")
print(f"  [cart position, cart velocity, pole angle, pole velocity]")

env.close()

Script 2 – Random agent on CartPole

TIP🎲 An agent that plays randomly — this is our starting point. This script does nothing intelligent: it chooses random actions. This is our baseline reference. Everything you learn in this course will enable an agent to do much better than this. Run it and observe the rewards — that's all!

output

import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt

def run_random_agent(env_name, n_episodes=100):
    """Agent that takes random actions - our baseline."""
    env = gym.make(env_name)
    total_rewards = []

    for episode in range(n_episodes):
        obs, info = env.reset()
        episode_reward = 0
        done = False

        while not done:
            # Random action (no learning)
            action = env.action_space.sample()
            obs, reward, terminated, truncated, info = env.step(action)
            episode_reward += reward
            done = terminated or truncated

        total_rewards.append(episode_reward)

    env.close()
    return total_rewards

# Run the random agent
rewards = run_random_agent("CartPole-v1", n_episodes=100)

print(f"Average reward : {np.mean(rewards):.2f}")
print(f"Maximum reward : {np.max(rewards):.0f}")
print(f"Minimum reward : {np.min(rewards):.0f}")

# Visualize
plt.figure(figsize=(10, 4))
plt.plot(rewards, alpha=0.6, label="Reward per episode")
plt.axhline(np.mean(rewards), color='red', linestyle='--', label=f"Mean: {np.mean(rewards):.1f}")
plt.xlabel("Episode")
plt.ylabel("Total reward")
plt.title("Random agent - CartPole-v1")
plt.legend()
plt.tight_layout()
plt.savefig("random_agent_cartpole.png")
plt.show()

Script 3 – Explore FrozenLake (discrete environment)

TIP🗺️ FrozenLake will be your main playground in this course. Do not worry about the loop or manual actions for now — the goal is simply to see that the environment launches and responds. Chapter 03 will explain everything in detail.

output

import gymnasium as gym

# FrozenLake : 4x4 grid with discrete states
env = gym.make("FrozenLake-v1", is_slippery=False)
obs, info = env.reset()

print("=== FrozenLake-v1 ===")
print(f"State space  : {env.observation_space.n} states (4x4 grid)")
print(f"Action space : {env.action_space.n} actions")
print(f"  0=left, 1=down, 2=right, 3=up")
print(f"\nInitial state : {obs}")
print("\nVisualize the grid :")
env.render()

# Manual loop
manual_actions = [2, 2, 1, 1, 1, 2, 1, 2, 2]  # path to the exit
total_reward = 0

print("\n--- Manual simulation ---")
obs, _ = env.reset()
for i, action in enumerate(manual_actions):
    obs, reward, terminated, truncated, _ = env.step(action)
    total_reward += reward
    direction = ["←", "↓", "→", "↑"][action]
    print(f"Step {i+1}: {direction}  State={obs}  Reward={reward}")
    if terminated or truncated:
        break

print(f"\nTotal reward : {total_reward}")
env.close()

TIPWhy FrozenLake? This environment is perfect for beginners because it has a discrete state space (16 cells). You can visualize the entire Q-table. CartPole has continuous states, which requires more advanced techniques.

go-further

This article covers the most useful excerpts — the complete Getting Started with Reinforcement Learning course (12 chapters, 42 lessons, corrected exercises and final project) takes you all the way.

./access-the-complete-course free course: Mastering Claude Code

FAQ

How long does it take to learn Getting Started with Reinforcement Learning?

With a structured progression (12 chapters, 42 short and practical lessons), you reach an operational level in a few weeks at 30 to 60 minutes per day. The important thing is to practice each concept immediately.

Are there any prerequisites?

No prerequisites: the course starts from zero; every concept is introduced before being used.

Where to start concretely?

Reproduce the commands in this article, then follow the complete Getting Started with Reinforcement Learning course: it chains the 42 lessons in order, with exercises and a final project.

./also-read

→ Get started with Machine Learning for Beginners: your first concrete step today → Machine Learning Simplified in practice: the code and commands that really matter → Python Machine Learning: the 9 key steps to go from zero to operational

📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.

Chapter 02 – Complete Terminology Cheat Sheet

State — State st Core component

Action at Core component

Reward rt — Reward Learning signal

Discount — Discount factor γ (gamma) Key parameter

Utility — Utility / Return Gt Final objective

Policy — Policy π Heart of RL

Python Virtual Environments

Why virtual environments?

The problem without venv

The solution with venv

🪟 Windows Install multiple Python versions

🔗 SSH Connect to a Linux VM from Windows

Step 1 — Install SSH on the Linux VM

First Gym Script – Test the Installation

Option A – Test locally (Jupyter)

Option B – Test in Google Colab (nothing to install!)

What is Gymnasium (formerly OpenAI Gym)?

Classic environments

Universal interface

Script 1 – Explore the CartPole environment

Script 2 – Random agent on CartPole

Script 3 – Explore FrozenLake (discrete environment)

FAQ

Stay up to date

State — State s_t Core component

Action a_t Core component

Reward r_t — Reward Learning signal

Utility — Utility / Return G_t Final objective