Get Started with Reinforcement Learning: Your First Concrete Step Today
Getting Started with Reinforcement Learning: The Essentials in One Article — Real Code, Diagrams and Concrete Steps, Excerpts from a 42-Lesson Course.
The best way to learn Getting Started with Reinforcement Learning is by doing. This article gives you a head start with practical excerpts from a 42-lesson course — enough to get your first result today.
- Course Resources and Install the Environment
- Introduction to Reinforcement Learning
- Fundamental Concepts of RL
- MDP Part 1 - States Actions Rewards
- MDP Part 2 - Policy and Bellman Equations
Chapter 02 – Complete Terminology Cheat Sheet
Machine learning paradigm in which an agent learns to behave in an environment by taking actions and receiving rewards or penalties in return.
- The agent observes the state of the environment
- It chooses an action to perform
- It observes the outcome: new state + reward (positive or negative)
- By repeating this cycle, it learns which actions maximize long-term reward
State — State st Core component
Complete description of the current situation of the environment at time t. This is the information the agent has to decide which action to take.
| State type | Description | Example |
|---|---|---|
| Starting state | Initial position at the beginning of an episode | Cell (1,1) in a maze |
| Intermediate state | Any state during an episode (non-terminal) | Cells traversed in the maze |
| Terminal state | End-of-episode state (win or loss) | Maze exit or hole |
Action at Core component
What the agent decides to do in a given state. The set of all possible actions forms the action space.
Finite number of enumerable actions.
Examples: Up, Down, Left, Right (maze); 0 or 1 (CartPole)
Numeric values within an interval.
Examples: Force from 0 to 10 N; steering angle from −30° to +30°
Reward rt — Reward Learning signal
Numeric signal received by the agent after each action. It is the only way for the environment to tell the agent whether its action was good or bad.
Reward received directly after an action at step t. Provides instant feedback on the quality of a single action.
Notation: rt+1
Sum of rewards accumulated over the entire episode (or to infinity). This is what the agent actually seeks to maximize.
Notation: Gt = rt+1 + rt+2 + rt+3 + …
Discount — Discount factor γ (gamma) Key parameter
Value between 0 and 1 that weights future rewards. The farther a reward is in time, the more it is "discounted". This is the penalty applied to future rewards.
The agent is "myopic": it only thinks about the immediate reward. Short-term behavior.
The agent is "far-sighted": it considers future rewards with almost equal importance. Long-term behavior.
Utility — Utility / Return Gt Final objective
The total sum of discounted rewards obtained from time step t. This is THE value the agent seeks to maximize. Also called return or cumulative return.
Policy — Policy π Heart of RL
The agent's decision strategy: for each state, it defines which action to take. It is the agent's "brain", what it seeks to learn.
Python Virtual Environments
Why virtual environments?
The problem without venv
The solution with venv
| Environment | Python | Libraries | Project |
|---|---|---|---|
myenv39 | 3.9 | TensorFlow 2.10, Gym 0.26 | RL Course (old) |
myenv310 | 3.10 | Streamlit 1.28, pandas 2.0 | Dashboard |
myenv311 | 3.11 | TensorFlow 2.15, Gymnasium | RL Course (current) |
rl-env | 3.12 | Gymnasium, NumPy, Matplotlib | This course |
🪟 Windows Install multiple Python versions
Some commands require administrator rights. If you get a permission error when activating a venv, first run in PowerShell:
# Open PowerShell as administrator, then: Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser # Answer Y (Yes) to the confirmation
Download and install each version from python.org/downloads. Check «Add to PATH» only for the first version. For subsequent versions, uncheck «Add to PATH» to avoid conflicts.
py launcher is installed automatically. It lets you call any installed version with py -3.10, py -3.11, etc.python --version # Python 3.x.x (the default version in PATH) py -3.9 --version # Python 3.9.x py -3.10 --version # Python 3.10.x py -3.11 --version # Python 3.11.x py -3.12 --version # Python 3.12.x py -3.13 --version # Python 3.13.x
# Check the version py -3.9 --version # Create the virtual environment py -3.9 -m venv myenv39 # Activate the environment (PowerShell) .\myenv39\Scripts\activate # The prompt changes: (myenv39) indicates the env is active (myenv39) python --version # Python 3.9.x (myenv39) pip install streamlit (myenv39) pip show streamlit # Name: streamlit # Version: 1.x.x # ... # Deactivate the environment (myenv39) deactivate
py -3.10 --version
py -3.10 -m venv myenv310
.\myenv310\Scripts\activate
(myenv310) python --version
# Python 3.10.x
(myenv310) pip install streamlit
(myenv310) pip show streamlit
(myenv310) deactivate# Python 3.11 py -3.11 -m venv myenv311 .\myenv311\Scripts\activate (myenv311) python --version (myenv311) pip install streamlit (myenv311) deactivate # Python 3.12 py -3.12 -m venv myenv312 .\myenv312\Scripts\activate (myenv312) python --version (myenv312) pip install streamlit (myenv312) deactivate # Python 3.13 py -3.13 -m venv myenv313 .\myenv313\Scripts\activate (myenv313) python --version (myenv313) pip install streamlit (myenv313) deactivate
🔗 SSH Connect to a Linux VM from Windows
Step 1 — Install SSH on the Linux VM
In the Linux VM terminal, as root:
su # or: sudo -s apt install openssh-server -y systemctl status ssh # check status systemctl start ssh # start if needed systemctl enable ssh # start automatically at boot # Find the VM's IP address ip -br addr # lo UNKNOWN 127.0.0.1/8 # eth0 UP 192.168.2.139/24 ← copy this IP
First Gym Script – Test the Installation
The goal of this lesson is only to verify that your environment works. You will copy-paste scripts and observe the results. You do not need to understand everything yet — every line of code will be explained in detail in the following chapters. Breathe, this is just a startup test!
Option A – Test locally (Jupyter)
If you followed lessons 01 and 02 (Anaconda installation + Jupyter setup), activate your environment and launch Jupyter:
conda activate rl-env jupyter notebook
Then copy the scripts below into a cell and run with Shift + Enter.
Option B – Test in Google Colab (nothing to install!)
Google Colab runs directly in your browser. No installation required — perfect for testing now and installing locally later.
Open a new notebook at colab.research.google.com, then install Gymnasium in the first cell:
# Cell 1 — Installation (Colab only, already done if you are local) !pip install gymnasium[toy-text] matplotlib --quiet
Then copy the scripts below into the following cells. That's it!
What is Gymnasium (formerly OpenAI Gym)?
Gymnasium is the reference library for testing RL algorithms. It provides standardized environments with a uniform interface.
Classic environments
Universal interface
Script 1 – Explore the CartPole environment
import gymnasium as gym
import numpy as np
# Create the environment
env = gym.make("CartPole-v1")
obs, info = env.reset(seed=42)
print("=== CartPole-v1 ===")
print(f"State space : {env.observation_space}")
print(f" Number of dimensions : {env.observation_space.shape[0]}")
print(f" Min values : {env.observation_space.low}")
print(f" Max values : {env.observation_space.high}")
print(f"\nAction space : {env.action_space}")
print(f" Number of actions : {env.action_space.n}")
print(f" Actions : 0 (left), 1 (right)")
print(f"\nInitial observation : {obs}")
print(f" [cart position, cart velocity, pole angle, pole velocity]")
env.close()Script 2 – Random agent on CartPole
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
def run_random_agent(env_name, n_episodes=100):
"""Agent that takes random actions - our baseline."""
env = gym.make(env_name)
total_rewards = []
for episode in range(n_episodes):
obs, info = env.reset()
episode_reward = 0
done = False
while not done:
# Random action (no learning)
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)
episode_reward += reward
done = terminated or truncated
total_rewards.append(episode_reward)
env.close()
return total_rewards
# Run the random agent
rewards = run_random_agent("CartPole-v1", n_episodes=100)
print(f"Average reward : {np.mean(rewards):.2f}")
print(f"Maximum reward : {np.max(rewards):.0f}")
print(f"Minimum reward : {np.min(rewards):.0f}")
# Visualize
plt.figure(figsize=(10, 4))
plt.plot(rewards, alpha=0.6, label="Reward per episode")
plt.axhline(np.mean(rewards), color='red', linestyle='--', label=f"Mean: {np.mean(rewards):.1f}")
plt.xlabel("Episode")
plt.ylabel("Total reward")
plt.title("Random agent - CartPole-v1")
plt.legend()
plt.tight_layout()
plt.savefig("random_agent_cartpole.png")
plt.show()Script 3 – Explore FrozenLake (discrete environment)
import gymnasium as gym
# FrozenLake : 4x4 grid with discrete states
env = gym.make("FrozenLake-v1", is_slippery=False)
obs, info = env.reset()
print("=== FrozenLake-v1 ===")
print(f"State space : {env.observation_space.n} states (4x4 grid)")
print(f"Action space : {env.action_space.n} actions")
print(f" 0=left, 1=down, 2=right, 3=up")
print(f"\nInitial state : {obs}")
print("\nVisualize the grid :")
env.render()
# Manual loop
manual_actions = [2, 2, 1, 1, 1, 2, 1, 2, 2] # path to the exit
total_reward = 0
print("\n--- Manual simulation ---")
obs, _ = env.reset()
for i, action in enumerate(manual_actions):
obs, reward, terminated, truncated, _ = env.step(action)
total_reward += reward
direction = ["←", "↓", "→", "↑"][action]
print(f"Step {i+1}: {direction} State={obs} Reward={reward}")
if terminated or truncated:
break
print(f"\nTotal reward : {total_reward}")
env.close()This article covers the most useful excerpts — the complete Getting Started with Reinforcement Learning course (12 chapters, 42 lessons, corrected exercises and final project) takes you all the way.
./access-the-complete-course free course: Mastering Claude CodeFAQ
How long does it take to learn Getting Started with Reinforcement Learning?
Are there any prerequisites?
Where to start concretely?
📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.