Python Data Science: 9 Key Steps to Go from Zero to Operational

Python Data Science: The Essentials in One Article — Real Code, Diagrams and Concrete Steps, Extracts from a 36-Lesson Course.

Python Data Science: 9 Key Steps to Go from Zero to Operational

Everyone can learn Python Data Science — provided they follow the steps in the right order. We have condensed a complete 36-lesson course into a clear learning path, with the most useful code snippets.

tl;dr
  • Introduction and Installation
  • Essential NumPy
  • Pandas Series and DataFrames
  • Reading and Writing Data
  • Data Cleaning
~$ cat ./parcours.md # Python Data Science — 10 chapters
01
Introduction and Installation
→ Course presentation→ Install the Data Science environment+ 1 more lessons
02
Essential NumPy
→ NumPy Arrays vs Python lists→ Vectorized operations+ 1 more lessons
03
Pandas Series and DataFrames
→ Create a Series→ Create a DataFrame+ 2 more lessons
04
Reading and writing data
→ CSV with read_csv and to_csv→ Excel, JSON and Parquet+ 1 more lessons
05
Data cleaning
→ Missing values (NaN)→ Duplicates and inconsistencies+ 1 more lessons
06
Aggregations and groupby
→ basic groupby→ agg on multiple columns+ 1 more lessons
07
Joins and merges
→ merge like in SQL→ concat vs merge+ 1 more lessons
08
Quick visualizations
→ direct Pandas plot→ Matplotlib for DS+ 1 more lessons
🏁
Final project (+ 2 chapters along the way)
→ You leave with a concrete and demonstrable project

Duplicates and Inconsistencies

NOTEObjective — Learn to detect and eliminate exact or approximate duplicates, and to standardize character strings (case, spaces, accents) that silently poison your analyses.

Learning Objectives

TIPAt the end of this module — You will be able to detect duplicate rows, decide which copy to keep, normalize text columns, and correct the classic inconsistencies such as « Paris », « paris », « PARIS ».

Why Duplicates Are Dangerous

A duplicate means the same reality counted twice. Consequences: inflated revenue, biased averages, overfitted models. A duplicate can come from a repeated import, a poorly made join, a form submitted twice…

WARNINGTrue story — A company calculated its revenue from an Excel file concatenated each month… without dropping duplicates. Its revenue had been overestimated by 12 %% for two years.

Detecting Duplicates

output
import pandas as pd

df = pd.DataFrame({
    "id": [1, 2, 3, 2, 4, 1],
    "nom": ["Alice", "Bob", "Chloe", "Bob", "David", "Alice"],
    "ville": ["Paris", "Lyon", "Nice", "Lyon", "Lille", "Paris"]
})

# Boolean mask: True if the row is a duplicate
df.duplicated()

# How many duplicates in total?
df.duplicated().sum()

# Display only the duplicates
df[df.duplicated()]

# Duplicates based only on certain columns
df.duplicated(subset=["id"])
df.duplicated(subset=["nom", "ville"])
NOTEBy defaultduplicated() marks True starting from the 2nd occurrence. The first instance is considered the original.

Removing Duplicates

output
# Keep the first occurrence (default)
df.drop_duplicates()

# Keep the last occurrence
df.drop_duplicates(keep="last")

# Remove all duplicate occurrences
df.drop_duplicates(keep=False)

# On certain columns only (useful if a column is a timestamp)
df.drop_duplicates(subset=["id"])

# Inplace to modify the DataFrame directly
df.drop_duplicates(inplace=True)
TIPChoosing keep"first" to keep the oldest record, "last" for the most recent (often the right choice if the data has been updated), False to delete everything and investigate.

The Approximate Duplicates Trap

« Paris » and « paris » are different for Pandas, even if they are the same city to you. Before looking for duplicates, normalize.

output
df = pd.DataFrame({
    "email": ["Alice@MAIL.com", " alice@mail.com ", "bob@mail.com"],
    "ville": ["Paris", "PARIS", "Lyon"]
})

# Normalize emails (spaces + lowercase)
df["email"] = df["email"].str.strip().str.lower()
df["ville"] = df["ville"].str.strip().str.title()

# NOW we can detect the real duplicates
df.duplicated(subset=["email"])
print(df)

.str Toolbox

The .str accessor gives access to all Python string methods, vectorized:

output
s = pd.Series(["  Alice DUPONT  ", "bob martin", "CHLOE.LEROY"])

s.str.strip()                  # remove spaces
s.str.lower()                  # lowercase
s.str.upper()                  # UPPERCASE
s.str.title()                  # Title Case
s.str.replace(".", " ")      # replacement
s.str.contains("alice", case=False)  # filter
s.str.startswith("A")
s.str.len()                    # number of characters
s.str.split(" ", expand=True)  # split into columns
NOTEAccents tip — To normalize accents (é → e), use unicodedata or the unidecode library: df["ville"].apply(unidecode).

Categorical Inconsistencies

Often the same category is entered in different ways: « H », « Homme », « M », « Masculin »… Pandas sees 4 distinct categories.

output
df = pd.DataFrame({
    "sexe": ["H", "Homme", "M", "Masculin", "F", "Femme", "f"]
})

# Before: see unique values
print(df["sexe"].value_counts())

# Strategy: mapping dictionary
mapping = {
    "H": "Homme", "M": "Homme", "Masculin": "Homme",
    "F": "Femme", "f": "Femme", "Femme": "Femme"
}
df["sexe"] = df["sexe"].map(mapping)

print(df["sexe"].value_counts())
TIPGolden reflex — Before any categorical analysis, run df["col"].value_counts(). Inconsistencies jump out immediately.

Mini-project: Clean an Employee Database

output
import pandas as pd

df = pd.DataFrame({
    "id":    [1, 2, 3, 2, 4, 5],
    "nom":   ["Alice", " Bob ", "Chloe", "BOB", "David", "emma"],
    "poste": ["Dev", "DEV", "PM", "dev", "Pm", "Designer"],
    "ville": ["Paris", "paris", "Lyon", "PARIS", "Lille", "Lyon"]
})

# 1. Normalize strings
for col in ["nom", "poste", "ville"]:
    df[col] = df[col].str.strip().str.title()

# 2. Detect duplicates on id
print("Duplicates:", df.duplicated(subset=["id"]).sum())

# 3. Keep the last record (most up-to-date)
df = df.drop_duplicates(subset=["id"], keep="last")

# 4. Verify
print(df)
print(df["poste"].value_counts())

Missing Values (NaN)

NOTEObjective — Understand what a missing value is in Pandas, how to detect it, how to analyze it, and choose the right strategy: delete or impute (fill). No clean analysis is possible without this step.

Learning Objectives

TIPAt the end of this module — You will know how to count NaNs per column, delete incomplete rows, or fill them intelligently (mean, median, fixed value, forward/backward fill). You will also know when each strategy is appropriate.

What Is a NaN?

NaN = « Not a Number ». This is how Pandas (inherited from NumPy) represents a missing value in a numeric column. For dates it is NaT; for generic objects, None. Pandas treats all three uniformly via isnull().

output
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "nom": ["Alice", "Bob", None, "David"],
    "age": [30, np.nan, 25, 35],
    "date": pd.to_datetime(["2025-01-01", None, "2025-01-03", "2025-01-04"])
})
print(df)
WARNINGClassic pitfallNaN != NaN in pure Python (by design). Therefore x == np.nan is always false. Always use isnull() or isna() to test.

Detecting Missing Values

output
# Boolean mask cell by cell
df.isnull()                # True if NaN
df.notna()                 # inverse

# Count per column
df.isnull().sum()

# Percentage of NaN per column
(df.isnull().sum() / len(df) * 100).round(2)

# Total in the entire DataFrame
df.isnull().sum().sum()

# Rows with at least one NaN
df[df.isnull().any(axis=1)]
TIPGolden reflex — As soon as a dataset arrives, run df.isnull().sum(). In 2 seconds you have your data quality dashboard.

Strategy 1 — Delete (dropna)

When NaNs are rare or the analysis requires complete data:

output
# Delete any row containing at least 1 NaN
df.dropna()

# Delete only if ALL values are NaN
df.dropna(how="all")

# Delete if NaN in a specific column
df.dropna(subset=["age"])

# Keep a row if it has at least 3 non-NaN values
df.dropna(thresh=3)

# Delete an entire column that is too incomplete
df.dropna(axis=1, thresh=100)   # columns < 100 values removed
WARNINGWhen to avoid dropna? — On a 100,000-row dataset where each column has a few %% of NaNs, dropna can delete everything! Check first: df.dropna().shape vs df.shape.

Strategy 2 — Fill (fillna)

With a fixed value

output
df["age"].fillna(0)
df["ville"].fillna("Unknown")

# On the entire DataFrame with a dict per column
df.fillna({"age": 0, "ville": "Unknown", "salary": 2000})

With a column statistic

output
df["age"].fillna(df["age"].mean())     # mean
df["age"].fillna(df["age"].median())   # median (robust)
df["ville"].fillna(df["ville"].mode()[0])  # most frequent value

By propagation (useful for time series)

output
# Fill with the previous value (forward fill)
df["price"].ffill()

# With the next value (backward fill)
df["price"].bfill()

# Limit the number of consecutive fills
df["price"].ffill(limit=3)

Data Types and Conversion

NOTEObjective — Understand Pandas types (int, float, object, datetime, category, bool), why they matter enormously, and how to convert cleanly without crashing on dirty values.

Learning Objectives

TIPAt the end of this module — You will know how to inspect dtypes, convert to number/date/category, use errors="coerce" for dirty values, and reduce a DataFrame's memory footprint by switching objectcategory.

Why Type Matters

A « 42 » stored as text (object) cannot be summed or compared numerically. A date stored as text does not allow .dt.year. Wrong type = silent bugs or loss of functionality.

output
import pandas as pd

df = pd.DataFrame({
    "price": ["12.50", "8.99", "15.00"],
    "date": ["2025-01-01", "2025-02-15", "2025-03-10"]
})

print(df.dtypes)
# price    object  <-- text!
# date    object  <-- text!

df["price"].sum()   # concatenates "12.508.9915.00"!
WARNINGClassic pitfallobject = « anything goes », most often text. If you see object on a column that should be numeric, there is a problem to solve before any analysis.

Pandas Types at a Glance

DtypeUsageExample
int64IntegerAges, quantities
float64FloatPrices, temperatures
boolTrue/FalseActive, paid
objectGeneric (often str)Names, emails
datetime64[ns]Date/timeOrder, birth
timedelta64DurationDifference between 2 dates
categoryRepeated categoriesGender, city, status

Conversion with astype

output
# Simple conversion, fails if a value does not fit
df["age"] = df["age"].astype(int)
df["price"] = df["price"].astype(float)
df["active"] = df["active"].astype(bool)
df["gender"] = df["gender"].astype("category")

# On multiple columns at once
df = df.astype({"age": int, "price": float, "city": "category"})
WARNINGLimitastype(int) crashes as soon as a value does not convert (e.g.: "12 EUR"). To handle these cases, use pd.to_numeric.

pd.to_numeric — the tolerant one

output
s = pd.Series(["12.5", "8.99", "oops", "15"])

# Strict mode: crash if a value is not numeric
pd.to_numeric(s)   # ValueError

# Tolerant mode: dirty values -> NaN
pd.to_numeric(s, errors="coerce")
# 0    12.50
# 1     8.99
# 2      NaN  <-- "oops" becomes NaN
# 3    15.00

# Choose the subtype to save memory
pd.to_numeric(s, errors="coerce", downcast="float")
pd.to_numeric(s, errors="coerce", downcast="integer")
TIPClean workflowerrors="coerce" → count generated NaNs → decide what to do (impute, delete, flag).

pd.to_datetime — all dates

output
# Recognizes most formats automatically
pd.to_datetime(["2025-01-15", "15/01/2025", "Jan 15, 2025"])

# Explicit format (faster, safer)
pd.to_datetime(df["date"], format="%%d/%%m/%%Y")

# Tolerant to dirty values
pd.to_datetime(df["date"], errors="coerce")

# Once datetime, access components
df["date"] = pd.to_datetime(df["date"])
df["year"]    = df["date"].dt.year
df["month"]     = df["date"].dt.month
df["day_name"] = df["date"].dt.day_name()
NOTEExplicit format — On large volumes, specifying format= can multiply speed by 100. Pandas no longer has to guess.

category Type: Memory Magic

For a « city » column repeated 1 million times, storing each string wastes space. category stores each unique value once and references an integer index.

output
import pandas as pd

df = pd.DataFrame({
    "city": ["Paris", "Lyon", "Paris"] * 1_000_000
})

print(df.memory_usage(deep=True).sum() / 1e6, "MB")
# About 200 MB

df["city"] = df["city"].astype("category")
print(df.memory_usage(deep=True).sum() / 1e6, "MB")
# About 8 MB -- 25x gain!
TIPWhen to use category? — When the number of unique values is small compared to the total number of rows (typically < 5 %%). Otherwise, little gain.

Mini-project: Clean an Accounting Export

output
import pandas as pd

df = pd.DataFrame({
    "amount":    ["125,50", "99,00", "N/A", "42,75"],
    "date":       ["01/03/2025", "02/03/2025", "oops", "04/03/2025"],
    "category":  ["Purchase", "Sale", "Purchase", "Sale"],
    "paid":       ["Yes", "No", "Yes", "Yes"]
})

# 1. Amount: comma -> dot, then numeric
df["amount"] = df["amount"].str.replace(",", ".")
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")

# 2. Date in dd/mm/yyyy format
df["date"] = pd.to_datetime(df["date"], format="%%d/%%m/%%Y", errors="coerce")

# 3. Category -> category type
df["category"] = df["category"].astype("category")

# 4. Yes/No -> bool
df["paid"] = df["paid"].map({"Yes": True, "No": False})

print(df.dtypes)
print(df)
go-further

This article covers the most useful snippets — the complete Python Data Science course (11 chapters, 36 lessons, corrected exercises and final project) takes you all the way.

./access-the-complete-course free course: Mastering Claude Code

FAQ

How long does it take to learn Python Data Science?
With a structured progression (11 chapters, 36 short and practical lessons), you reach an operational level in a few weeks at 30 to 60 minutes per day. The key is to practice each concept immediately.
Are there any prerequisites?
Basic computer knowledge is enough. If you can use a terminal and read simple code, you are ready.
Where to start concretely?
Reproduce the commands in this article, then follow the complete Python Data Science course: it chains the 36 lessons in order, with exercises and a final project.

📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.