Data & Big Data

Python Data Science: 9 Key Steps to Go from Zero to Operational

Python Data Science: The Essentials in One Article — Real Code, Diagrams and Concrete Steps, Extracts from a 36-Lesson Course.

REHOUMA Haythem

12 Jun 2026 • 14 min read

Everyone can learn Python Data Science — provided they follow the steps in the right order. We have condensed a complete 36-lesson course into a clear learning path, with the most useful code snippets.

tl;dr

Introduction and Installation
Essential NumPy
Pandas Series and DataFrames
Reading and Writing Data
Data Cleaning

~$ cat ./parcours.md # Python Data Science — 10 chapters

Introduction and Installation

→ Course presentation→ Install the Data Science environment+ 1 more lessons

Essential NumPy

→ NumPy Arrays vs Python lists→ Vectorized operations+ 1 more lessons

Pandas Series and DataFrames

→ Create a Series→ Create a DataFrame+ 2 more lessons

Reading and writing data

→ CSV with read_csv and to_csv→ Excel, JSON and Parquet+ 1 more lessons

Data cleaning

→ Missing values (NaN)→ Duplicates and inconsistencies+ 1 more lessons

Aggregations and groupby

→ basic groupby→ agg on multiple columns+ 1 more lessons

Joins and merges

→ merge like in SQL→ concat vs merge+ 1 more lessons

Quick visualizations

→ direct Pandas plot→ Matplotlib for DS+ 1 more lessons

🏁

Final project (+ 2 chapters along the way)

→ You leave with a concrete and demonstrable project

Duplicates and Inconsistencies

NOTEObjective — Learn to detect and eliminate exact or approximate duplicates, and to standardize character strings (case, spaces, accents) that silently poison your analyses.

Learning Objectives

TIPAt the end of this module — You will be able to detect duplicate rows, decide which copy to keep, normalize text columns, and correct the classic inconsistencies such as « Paris », « paris », « PARIS ».

Why Duplicates Are Dangerous

A duplicate means the same reality counted twice. Consequences: inflated revenue, biased averages, overfitted models. A duplicate can come from a repeated import, a poorly made join, a form submitted twice…

WARNINGTrue story — A company calculated its revenue from an Excel file concatenated each month… without dropping duplicates. Its revenue had been overestimated by 12 %% for two years.

Detecting Duplicates

output

import pandas as pd

df = pd.DataFrame({
    "id": [1, 2, 3, 2, 4, 1],
    "nom": ["Alice", "Bob", "Chloe", "Bob", "David", "Alice"],
    "ville": ["Paris", "Lyon", "Nice", "Lyon", "Lille", "Paris"]
})

# Boolean mask: True if the row is a duplicate
df.duplicated()

# How many duplicates in total?
df.duplicated().sum()

# Display only the duplicates
df[df.duplicated()]

# Duplicates based only on certain columns
df.duplicated(subset=["id"])
df.duplicated(subset=["nom", "ville"])

NOTEBy default — duplicated() marks True starting from the 2nd occurrence. The first instance is considered the original.

Removing Duplicates

output

# Keep the first occurrence (default)
df.drop_duplicates()

# Keep the last occurrence
df.drop_duplicates(keep="last")

# Remove all duplicate occurrences
df.drop_duplicates(keep=False)

# On certain columns only (useful if a column is a timestamp)
df.drop_duplicates(subset=["id"])

# Inplace to modify the DataFrame directly
df.drop_duplicates(inplace=True)

TIPChoosing keep — "first" to keep the oldest record, "last" for the most recent (often the right choice if the data has been updated), False to delete everything and investigate.

The Approximate Duplicates Trap

« Paris » and « paris » are different for Pandas, even if they are the same city to you. Before looking for duplicates, normalize.

output

df = pd.DataFrame({
    "email": ["Alice@MAIL.com", " alice@mail.com ", "bob@mail.com"],
    "ville": ["Paris", "PARIS", "Lyon"]
})

# Normalize emails (spaces + lowercase)
df["email"] = df["email"].str.strip().str.lower()
df["ville"] = df["ville"].str.strip().str.title()

# NOW we can detect the real duplicates
df.duplicated(subset=["email"])
print(df)

`.str` Toolbox

The .str accessor gives access to all Python string methods, vectorized:

output

s = pd.Series(["  Alice DUPONT  ", "bob martin", "CHLOE.LEROY"])

s.str.strip()                  # remove spaces
s.str.lower()                  # lowercase
s.str.upper()                  # UPPERCASE
s.str.title()                  # Title Case
s.str.replace(".", " ")      # replacement
s.str.contains("alice", case=False)  # filter
s.str.startswith("A")
s.str.len()                    # number of characters
s.str.split(" ", expand=True)  # split into columns

NOTEAccents tip — To normalize accents (é → e), use unicodedata or the unidecode library: df["ville"].apply(unidecode).

Categorical Inconsistencies

Often the same category is entered in different ways: « H », « Homme », « M », « Masculin »… Pandas sees 4 distinct categories.

output

df = pd.DataFrame({
    "sexe": ["H", "Homme", "M", "Masculin", "F", "Femme", "f"]
})

# Before: see unique values
print(df["sexe"].value_counts())

# Strategy: mapping dictionary
mapping = {
    "H": "Homme", "M": "Homme", "Masculin": "Homme",
    "F": "Femme", "f": "Femme", "Femme": "Femme"
}
df["sexe"] = df["sexe"].map(mapping)

print(df["sexe"].value_counts())

TIPGolden reflex — Before any categorical analysis, run df["col"].value_counts(). Inconsistencies jump out immediately.

Mini-project: Clean an Employee Database

output

import pandas as pd

df = pd.DataFrame({
    "id":    [1, 2, 3, 2, 4, 5],
    "nom":   ["Alice", " Bob ", "Chloe", "BOB", "David", "emma"],
    "poste": ["Dev", "DEV", "PM", "dev", "Pm", "Designer"],
    "ville": ["Paris", "paris", "Lyon", "PARIS", "Lille", "Lyon"]
})

# 1. Normalize strings
for col in ["nom", "poste", "ville"]:
    df[col] = df[col].str.strip().str.title()

# 2. Detect duplicates on id
print("Duplicates:", df.duplicated(subset=["id"]).sum())

# 3. Keep the last record (most up-to-date)
df = df.drop_duplicates(subset=["id"], keep="last")

# 4. Verify
print(df)
print(df["poste"].value_counts())

Missing Values (NaN)

NOTEObjective — Understand what a missing value is in Pandas, how to detect it, how to analyze it, and choose the right strategy: delete or impute (fill). No clean analysis is possible without this step.

Learning Objectives

TIPAt the end of this module — You will know how to count NaNs per column, delete incomplete rows, or fill them intelligently (mean, median, fixed value, forward/backward fill). You will also know when each strategy is appropriate.

What Is a NaN?

NaN = « Not a Number ». This is how Pandas (inherited from NumPy) represents a missing value in a numeric column. For dates it is NaT; for generic objects, None. Pandas treats all three uniformly via isnull().

output

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "nom": ["Alice", "Bob", None, "David"],
    "age": [30, np.nan, 25, 35],
    "date": pd.to_datetime(["2025-01-01", None, "2025-01-03", "2025-01-04"])
})
print(df)

WARNINGClassic pitfall — NaN != NaN in pure Python (by design). Therefore x == np.nan is always false. Always use isnull() or isna() to test.

Detecting Missing Values

output

# Boolean mask cell by cell
df.isnull()                # True if NaN
df.notna()                 # inverse

# Count per column
df.isnull().sum()

# Percentage of NaN per column
(df.isnull().sum() / len(df) * 100).round(2)

# Total in the entire DataFrame
df.isnull().sum().sum()

# Rows with at least one NaN
df[df.isnull().any(axis=1)]

TIPGolden reflex — As soon as a dataset arrives, run df.isnull().sum(). In 2 seconds you have your data quality dashboard.

Strategy 1 — Delete (dropna)

When NaNs are rare or the analysis requires complete data:

output

# Delete any row containing at least 1 NaN
df.dropna()

# Delete only if ALL values are NaN
df.dropna(how="all")

# Delete if NaN in a specific column
df.dropna(subset=["age"])

# Keep a row if it has at least 3 non-NaN values
df.dropna(thresh=3)

# Delete an entire column that is too incomplete
df.dropna(axis=1, thresh=100)   # columns < 100 values removed

WARNINGWhen to avoid dropna? — On a 100,000-row dataset where each column has a few %% of NaNs, dropna can delete everything! Check first: df.dropna().shape vs df.shape.

Strategy 2 — Fill (fillna)

With a fixed value

output

df["age"].fillna(0)
df["ville"].fillna("Unknown")

# On the entire DataFrame with a dict per column
df.fillna({"age": 0, "ville": "Unknown", "salary": 2000})

With a column statistic

output

df["age"].fillna(df["age"].mean())     # mean
df["age"].fillna(df["age"].median())   # median (robust)
df["ville"].fillna(df["ville"].mode()[0])  # most frequent value

By propagation (useful for time series)

output

# Fill with the previous value (forward fill)
df["price"].ffill()

# With the next value (backward fill)
df["price"].bfill()

# Limit the number of consecutive fills
df["price"].ffill(limit=3)

Data Types and Conversion

NOTEObjective — Understand Pandas types (int, float, object, datetime, category, bool), why they matter enormously, and how to convert cleanly without crashing on dirty values.

Learning Objectives

TIPAt the end of this module — You will know how to inspect dtypes, convert to number/date/category, use errors="coerce" for dirty values, and reduce a DataFrame's memory footprint by switching object → category.

Why Type Matters

A « 42 » stored as text (object) cannot be summed or compared numerically. A date stored as text does not allow .dt.year. Wrong type = silent bugs or loss of functionality.

output

import pandas as pd

df = pd.DataFrame({
    "price": ["12.50", "8.99", "15.00"],
    "date": ["2025-01-01", "2025-02-15", "2025-03-10"]
})

print(df.dtypes)
# price    object  <-- text!
# date    object  <-- text!

df["price"].sum()   # concatenates "12.508.9915.00"!

WARNINGClassic pitfall — object = « anything goes », most often text. If you see object on a column that should be numeric, there is a problem to solve before any analysis.

Pandas Types at a Glance

Dtype	Usage	Example
`int64`	Integer	Ages, quantities
`float64`	Float	Prices, temperatures
`bool`	True/False	Active, paid
`object`	Generic (often str)	Names, emails
`datetime64[ns]`	Date/time	Order, birth
`timedelta64`	Duration	Difference between 2 dates
`category`	Repeated categories	Gender, city, status

Conversion with `astype`

output

# Simple conversion, fails if a value does not fit
df["age"] = df["age"].astype(int)
df["price"] = df["price"].astype(float)
df["active"] = df["active"].astype(bool)
df["gender"] = df["gender"].astype("category")

# On multiple columns at once
df = df.astype({"age": int, "price": float, "city": "category"})

WARNINGLimit — astype(int) crashes as soon as a value does not convert (e.g.: "12 EUR"). To handle these cases, use pd.to_numeric.

`pd.to_numeric` — the tolerant one

output

s = pd.Series(["12.5", "8.99", "oops", "15"])

# Strict mode: crash if a value is not numeric
pd.to_numeric(s)   # ValueError

# Tolerant mode: dirty values -> NaN
pd.to_numeric(s, errors="coerce")
# 0    12.50
# 1     8.99
# 2      NaN  <-- "oops" becomes NaN
# 3    15.00

# Choose the subtype to save memory
pd.to_numeric(s, errors="coerce", downcast="float")
pd.to_numeric(s, errors="coerce", downcast="integer")

TIPClean workflow — errors="coerce" → count generated NaNs → decide what to do (impute, delete, flag).

`pd.to_datetime` — all dates

output

# Recognizes most formats automatically
pd.to_datetime(["2025-01-15", "15/01/2025", "Jan 15, 2025"])

# Explicit format (faster, safer)
pd.to_datetime(df["date"], format="%%d/%%m/%%Y")

# Tolerant to dirty values
pd.to_datetime(df["date"], errors="coerce")

# Once datetime, access components
df["date"] = pd.to_datetime(df["date"])
df["year"]    = df["date"].dt.year
df["month"]     = df["date"].dt.month
df["day_name"] = df["date"].dt.day_name()

NOTEExplicit format — On large volumes, specifying format= can multiply speed by 100. Pandas no longer has to guess.

`category` Type: Memory Magic

For a « city » column repeated 1 million times, storing each string wastes space. category stores each unique value once and references an integer index.

output

import pandas as pd

df = pd.DataFrame({
    "city": ["Paris", "Lyon", "Paris"] * 1_000_000
})

print(df.memory_usage(deep=True).sum() / 1e6, "MB")
# About 200 MB

df["city"] = df["city"].astype("category")
print(df.memory_usage(deep=True).sum() / 1e6, "MB")
# About 8 MB -- 25x gain!

TIPWhen to use category? — When the number of unique values is small compared to the total number of rows (typically < 5 %%). Otherwise, little gain.

Mini-project: Clean an Accounting Export

output

import pandas as pd

df = pd.DataFrame({
    "amount":    ["125,50", "99,00", "N/A", "42,75"],
    "date":       ["01/03/2025", "02/03/2025", "oops", "04/03/2025"],
    "category":  ["Purchase", "Sale", "Purchase", "Sale"],
    "paid":       ["Yes", "No", "Yes", "Yes"]
})

# 1. Amount: comma -> dot, then numeric
df["amount"] = df["amount"].str.replace(",", ".")
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")

# 2. Date in dd/mm/yyyy format
df["date"] = pd.to_datetime(df["date"], format="%%d/%%m/%%Y", errors="coerce")

# 3. Category -> category type
df["category"] = df["category"].astype("category")

# 4. Yes/No -> bool
df["paid"] = df["paid"].map({"Yes": True, "No": False})

print(df.dtypes)
print(df)

go-further

This article covers the most useful snippets — the complete Python Data Science course (11 chapters, 36 lessons, corrected exercises and final project) takes you all the way.

./access-the-complete-course free course: Mastering Claude Code

FAQ

How long does it take to learn Python Data Science?

With a structured progression (11 chapters, 36 short and practical lessons), you reach an operational level in a few weeks at 30 to 60 minutes per day. The key is to practice each concept immediately.

Are there any prerequisites?

Basic computer knowledge is enough. If you can use a terminal and read simple code, you are ready.

Where to start concretely?

Reproduce the commands in this article, then follow the complete Python Data Science course: it chains the 36 lessons in order, with exercises and a final project.

./read-also

→ AWS Data Engineering Bootcamp explained simply (with diagrams and real code)→ Get started with AWS Real-Time Data: your first concrete step today → Python NumPy in practice: the code and commands that really matter

📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.

Duplicates and Inconsistencies

Learning Objectives

Why Duplicates Are Dangerous

Detecting Duplicates

Removing Duplicates

The Approximate Duplicates Trap

.str Toolbox

Categorical Inconsistencies

Mini-project: Clean an Employee Database

Missing Values (NaN)

Learning Objectives

What Is a NaN?

Detecting Missing Values

Strategy 1 — Delete (dropna)

Strategy 2 — Fill (fillna)

With a fixed value

With a column statistic

By propagation (useful for time series)

Data Types and Conversion

Learning Objectives

Why Type Matters

Pandas Types at a Glance

Conversion with astype

pd.to_numeric — the tolerant one

pd.to_datetime — all dates

category Type: Memory Magic

Mini-project: Clean an Accounting Export

FAQ

Stay up to date

`.str` Toolbox

Conversion with `astype`

`pd.to_numeric` — the tolerant one

`pd.to_datetime` — all dates

`category` Type: Memory Magic