Python Data Science: 9 Key Steps to Go from Zero to Operational
Python Data Science: The Essentials in One Article — Real Code, Diagrams and Concrete Steps, Extracts from a 36-Lesson Course.
Everyone can learn Python Data Science — provided they follow the steps in the right order. We have condensed a complete 36-lesson course into a clear learning path, with the most useful code snippets.
- Introduction and Installation
- Essential NumPy
- Pandas Series and DataFrames
- Reading and Writing Data
- Data Cleaning
Duplicates and Inconsistencies
Learning Objectives
Why Duplicates Are Dangerous
A duplicate means the same reality counted twice. Consequences: inflated revenue, biased averages, overfitted models. A duplicate can come from a repeated import, a poorly made join, a form submitted twice…
Detecting Duplicates
import pandas as pd df = pd.DataFrame({ "id": [1, 2, 3, 2, 4, 1], "nom": ["Alice", "Bob", "Chloe", "Bob", "David", "Alice"], "ville": ["Paris", "Lyon", "Nice", "Lyon", "Lille", "Paris"] }) # Boolean mask: True if the row is a duplicate df.duplicated() # How many duplicates in total? df.duplicated().sum() # Display only the duplicates df[df.duplicated()] # Duplicates based only on certain columns df.duplicated(subset=["id"]) df.duplicated(subset=["nom", "ville"])
duplicated() marks True starting from the 2nd occurrence. The first instance is considered the original.Removing Duplicates
# Keep the first occurrence (default) df.drop_duplicates() # Keep the last occurrence df.drop_duplicates(keep="last") # Remove all duplicate occurrences df.drop_duplicates(keep=False) # On certain columns only (useful if a column is a timestamp) df.drop_duplicates(subset=["id"]) # Inplace to modify the DataFrame directly df.drop_duplicates(inplace=True)
keep — "first" to keep the oldest record, "last" for the most recent (often the right choice if the data has been updated), False to delete everything and investigate.The Approximate Duplicates Trap
« Paris » and « paris » are different for Pandas, even if they are the same city to you. Before looking for duplicates, normalize.
df = pd.DataFrame({
"email": ["Alice@MAIL.com", " alice@mail.com ", "bob@mail.com"],
"ville": ["Paris", "PARIS", "Lyon"]
})
# Normalize emails (spaces + lowercase)
df["email"] = df["email"].str.strip().str.lower()
df["ville"] = df["ville"].str.strip().str.title()
# NOW we can detect the real duplicates
df.duplicated(subset=["email"])
print(df).str Toolbox
The .str accessor gives access to all Python string methods, vectorized:
s = pd.Series([" Alice DUPONT ", "bob martin", "CHLOE.LEROY"]) s.str.strip() # remove spaces s.str.lower() # lowercase s.str.upper() # UPPERCASE s.str.title() # Title Case s.str.replace(".", " ") # replacement s.str.contains("alice", case=False) # filter s.str.startswith("A") s.str.len() # number of characters s.str.split(" ", expand=True) # split into columns
unicodedata or the unidecode library: df["ville"].apply(unidecode).Categorical Inconsistencies
Often the same category is entered in different ways: « H », « Homme », « M », « Masculin »… Pandas sees 4 distinct categories.
df = pd.DataFrame({
"sexe": ["H", "Homme", "M", "Masculin", "F", "Femme", "f"]
})
# Before: see unique values
print(df["sexe"].value_counts())
# Strategy: mapping dictionary
mapping = {
"H": "Homme", "M": "Homme", "Masculin": "Homme",
"F": "Femme", "f": "Femme", "Femme": "Femme"
}
df["sexe"] = df["sexe"].map(mapping)
print(df["sexe"].value_counts())df["col"].value_counts(). Inconsistencies jump out immediately.Mini-project: Clean an Employee Database
import pandas as pd df = pd.DataFrame({ "id": [1, 2, 3, 2, 4, 5], "nom": ["Alice", " Bob ", "Chloe", "BOB", "David", "emma"], "poste": ["Dev", "DEV", "PM", "dev", "Pm", "Designer"], "ville": ["Paris", "paris", "Lyon", "PARIS", "Lille", "Lyon"] }) # 1. Normalize strings for col in ["nom", "poste", "ville"]: df[col] = df[col].str.strip().str.title() # 2. Detect duplicates on id print("Duplicates:", df.duplicated(subset=["id"]).sum()) # 3. Keep the last record (most up-to-date) df = df.drop_duplicates(subset=["id"], keep="last") # 4. Verify print(df) print(df["poste"].value_counts())
Missing Values (NaN)
Learning Objectives
What Is a NaN?
NaN = « Not a Number ». This is how Pandas (inherited from NumPy) represents a missing value in a numeric column. For dates it is NaT; for generic objects, None. Pandas treats all three uniformly via isnull().
import pandas as pd import numpy as np df = pd.DataFrame({ "nom": ["Alice", "Bob", None, "David"], "age": [30, np.nan, 25, 35], "date": pd.to_datetime(["2025-01-01", None, "2025-01-03", "2025-01-04"]) }) print(df)
NaN != NaN in pure Python (by design). Therefore x == np.nan is always false. Always use isnull() or isna() to test.Detecting Missing Values
# Boolean mask cell by cell df.isnull() # True if NaN df.notna() # inverse # Count per column df.isnull().sum() # Percentage of NaN per column (df.isnull().sum() / len(df) * 100).round(2) # Total in the entire DataFrame df.isnull().sum().sum() # Rows with at least one NaN df[df.isnull().any(axis=1)]
df.isnull().sum(). In 2 seconds you have your data quality dashboard.Strategy 1 — Delete (dropna)
When NaNs are rare or the analysis requires complete data:
# Delete any row containing at least 1 NaN df.dropna() # Delete only if ALL values are NaN df.dropna(how="all") # Delete if NaN in a specific column df.dropna(subset=["age"]) # Keep a row if it has at least 3 non-NaN values df.dropna(thresh=3) # Delete an entire column that is too incomplete df.dropna(axis=1, thresh=100) # columns < 100 values removed
df.dropna().shape vs df.shape.Strategy 2 — Fill (fillna)
With a fixed value
df["age"].fillna(0) df["ville"].fillna("Unknown") # On the entire DataFrame with a dict per column df.fillna({"age": 0, "ville": "Unknown", "salary": 2000})
With a column statistic
df["age"].fillna(df["age"].mean()) # mean df["age"].fillna(df["age"].median()) # median (robust) df["ville"].fillna(df["ville"].mode()[0]) # most frequent value
By propagation (useful for time series)
# Fill with the previous value (forward fill) df["price"].ffill() # With the next value (backward fill) df["price"].bfill() # Limit the number of consecutive fills df["price"].ffill(limit=3)
Data Types and Conversion
Learning Objectives
dtypes, convert to number/date/category, use errors="coerce" for dirty values, and reduce a DataFrame's memory footprint by switching object → category.Why Type Matters
A « 42 » stored as text (object) cannot be summed or compared numerically. A date stored as text does not allow .dt.year. Wrong type = silent bugs or loss of functionality.
import pandas as pd df = pd.DataFrame({ "price": ["12.50", "8.99", "15.00"], "date": ["2025-01-01", "2025-02-15", "2025-03-10"] }) print(df.dtypes) # price object <-- text! # date object <-- text! df["price"].sum() # concatenates "12.508.9915.00"!
object = « anything goes », most often text. If you see object on a column that should be numeric, there is a problem to solve before any analysis.Pandas Types at a Glance
| Dtype | Usage | Example |
|---|---|---|
int64 | Integer | Ages, quantities |
float64 | Float | Prices, temperatures |
bool | True/False | Active, paid |
object | Generic (often str) | Names, emails |
datetime64[ns] | Date/time | Order, birth |
timedelta64 | Duration | Difference between 2 dates |
category | Repeated categories | Gender, city, status |
Conversion with astype
# Simple conversion, fails if a value does not fit df["age"] = df["age"].astype(int) df["price"] = df["price"].astype(float) df["active"] = df["active"].astype(bool) df["gender"] = df["gender"].astype("category") # On multiple columns at once df = df.astype({"age": int, "price": float, "city": "category"})
astype(int) crashes as soon as a value does not convert (e.g.: "12 EUR"). To handle these cases, use pd.to_numeric.pd.to_numeric — the tolerant one
s = pd.Series(["12.5", "8.99", "oops", "15"]) # Strict mode: crash if a value is not numeric pd.to_numeric(s) # ValueError # Tolerant mode: dirty values -> NaN pd.to_numeric(s, errors="coerce") # 0 12.50 # 1 8.99 # 2 NaN <-- "oops" becomes NaN # 3 15.00 # Choose the subtype to save memory pd.to_numeric(s, errors="coerce", downcast="float") pd.to_numeric(s, errors="coerce", downcast="integer")
errors="coerce" → count generated NaNs → decide what to do (impute, delete, flag).pd.to_datetime — all dates
# Recognizes most formats automatically pd.to_datetime(["2025-01-15", "15/01/2025", "Jan 15, 2025"]) # Explicit format (faster, safer) pd.to_datetime(df["date"], format="%%d/%%m/%%Y") # Tolerant to dirty values pd.to_datetime(df["date"], errors="coerce") # Once datetime, access components df["date"] = pd.to_datetime(df["date"]) df["year"] = df["date"].dt.year df["month"] = df["date"].dt.month df["day_name"] = df["date"].dt.day_name()
format= can multiply speed by 100. Pandas no longer has to guess.category Type: Memory Magic
For a « city » column repeated 1 million times, storing each string wastes space. category stores each unique value once and references an integer index.
import pandas as pd df = pd.DataFrame({ "city": ["Paris", "Lyon", "Paris"] * 1_000_000 }) print(df.memory_usage(deep=True).sum() / 1e6, "MB") # About 200 MB df["city"] = df["city"].astype("category") print(df.memory_usage(deep=True).sum() / 1e6, "MB") # About 8 MB -- 25x gain!
category? — When the number of unique values is small compared to the total number of rows (typically < 5 %%). Otherwise, little gain.Mini-project: Clean an Accounting Export
import pandas as pd df = pd.DataFrame({ "amount": ["125,50", "99,00", "N/A", "42,75"], "date": ["01/03/2025", "02/03/2025", "oops", "04/03/2025"], "category": ["Purchase", "Sale", "Purchase", "Sale"], "paid": ["Yes", "No", "Yes", "Yes"] }) # 1. Amount: comma -> dot, then numeric df["amount"] = df["amount"].str.replace(",", ".") df["amount"] = pd.to_numeric(df["amount"], errors="coerce") # 2. Date in dd/mm/yyyy format df["date"] = pd.to_datetime(df["date"], format="%%d/%%m/%%Y", errors="coerce") # 3. Category -> category type df["category"] = df["category"].astype("category") # 4. Yes/No -> bool df["paid"] = df["paid"].map({"Yes": True, "No": False}) print(df.dtypes) print(df)
This article covers the most useful snippets — the complete Python Data Science course (11 chapters, 36 lessons, corrected exercises and final project) takes you all the way.
./access-the-complete-course free course: Mastering Claude CodeFAQ
How long does it take to learn Python Data Science?
Are there any prerequisites?
Where to start concretely?
📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.