Machine & Deep Learning

Get Started with Python Scikit-Learn: Your First Concrete Step Today

Python scikit-learn: the essentials in one article — real code, diagrams and concrete steps, excerpts from a 33-lesson course.

REHOUMA Haythem

12 Jun 2026 • 11 min read

The best way to learn Python scikit Learn is by doing. This article gives you a head start with practical excerpts from a 33-lesson course — enough to get your first result today.

tl;dr

Introduction and installation
Machine Learning basics
Data preprocessing
Supervised regression
Supervised classification

~$ cat ./parcours.md # Python scikit Learn — 10 chapters

Introduction and installation

→ What is machine learning ?→ Install scikit-learn and the ecosystem+ 1 more lessons

Basics of Machine Learning

→ ML Vocabulary — features, labels, overfitting→ The uniform scikit-learn API+ 1 more lessons

Data Preprocessing

→ Missing values and imputation→ Encoding categorical variables+ 2 more lessons

Supervised Regression

→ Linear regression and regularization→ Trees and RandomForest for regression+ 1 more lessons

Supervised Classification

→ Logistic regression→ k-NN and SVM+ 2 more lessons

Unsupervised Learning

→ Clustering — K-Means, DBSCAN, Agglomerative→ Choosing the optimal K+ 1 more lessons

Dimensionality Reduction

→ PCA — principal components→ t-SNE and UMAP — non-linear visualization+ 1 more lessons

Validation and hyperparameters

→ Cross-validation→ GridSearchCV and RandomizedSearchCV+ 1 more lessons

🏁

Final project (+ 2 chapters along the way)

→ You leave with a concrete and demonstrable project

First model in 10 lines (Iris)

NOTEObjective — Build your first end-to-end machine learning model in less than 10 lines of code, on the historic iris dataset.

The Iris dataset

150 iris flowers divided into 3 species (setosa, versicolor, virginica). 4 measured features: sepal and petal length and width.

output

from sklearn.datasets import load_iris

iris = load_iris()
print("Features :", iris.feature_names)
print("Classes :", iris.target_names)
print("Shape X :", iris.data.shape)   # (150, 4)
print("Shape y :", iris.target.shape) # (150,)

The complete code (10 lines)

output

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"Accuracy : {score:.2%}")  # ~96.7%

That's it. You have just created a classification model that distinguishes 3 iris species with ~97% accuracy.

Line-by-line breakdown

Line	Role
`load_iris(return_X_y=True)`	Loads X (features) and y (labels) directly
`train_test_split(...)`	Splits 80% train / 20% test
`random_state=42`	Reproducibility (always the same split)
`KNeighborsClassifier(...)`	Algorithm choice: k-NN with k=5
`model.fit(...)`	Training: memorizes the examples
`model.score(...)`	Evaluates accuracy on the test set

Make a prediction on a new flower

output

import numpy as np

# A flower: sepal 5.1 x 3.5, petal 1.4 x 0.2
new_flower = np.array([[5.1, 3.5, 1.4, 0.2]])
prediction = model.predict(new_flower)

print("Predicted :", iris.target_names[prediction[0]])
# > 'setosa'

# Probabilities for each class
proba = model.predict_proba(new_flower)
print("Probabilities :", proba)
# > [[1.0, 0.0, 0.0]]  -> 100% setosa

Visualize the result

output

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# True labels
axes[0].scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap="viridis")
axes[0].set_title("True classes")
axes[0].set_xlabel("Sepal length")
axes[0].set_ylabel("Sepal width")

# Predictions
predictions = model.predict(X_test)
axes[1].scatter(X_test[:, 0], X_test[:, 1], c=predictions, cmap="viridis")
axes[1].set_title("Model predictions")
axes[1].set_xlabel("Sepal length")

plt.tight_layout()
plt.show()

The fit / predict / score pattern

TIPUniform API — This structure (fit → predict → score) is the same for ALL scikit-learn models. Once learned, you can try 30 algorithms by changing only the import.

output

# Same code, different model:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

# Same code, yet another one:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

Good practices from the start

WARNINGPitfall — On iris, almost everything works very well (97%). It is a toy dataset. In real life, you will struggle more — and that is normal!

Practical challenge

TIPExercise — Reuse the 10-line code and:

Try 3 different algorithms (KNN, DecisionTree, RandomForest)
Vary random_state and observe whether the score changes
Try k=1 then k=10 in KNeighborsClassifier

Save a model for production

NOTEObjective — Persist a trained model and reload it later (API, batch, monitoring).

joblib: the recommended method

output

import joblib

# Save
joblib.dump(pipe, "model.joblib")

# Load
loaded_pipe = joblib.load("model.joblib")
preds = loaded_pipe.predict(X_new)

With compression

output

# Level 0 (fast) to 9 (max compression)
joblib.dump(pipe, "model.joblib.gz", compress=3)
loaded = joblib.load("model.joblib.gz")

pickle (alternative)

output

import pickle

with open("model.pkl", "wb") as f:
    pickle.dump(pipe, f)

with open("model.pkl", "rb") as f:
    pipe = pickle.load(f)

WARNINGjoblib vs pickle — joblib is more efficient for large NumPy arrays (which characterize sklearn). pickle remains compatible with all Python.

Production: complete packaging

output

import joblib
import sklearn
import numpy as np

# Save model + metadata
artifact = {
    "model": pipe,
    "feature_names": feature_names,
    "target_names": target_names,
    "sklearn_version": sklearn.__version__,
    "numpy_version": np.__version__,
    "training_date": "2026-05-15",
    "metrics": {"f1": 0.94, "auc": 0.97}
}
joblib.dump(artifact, "model_v1.joblib")

Minimal FastAPI

output

from fastapi import FastAPI
from pydantic import BaseModel
import joblib

app = FastAPI()
artifact = joblib.load("model_v1.joblib")
model = artifact["model"]

class Input(BaseModel):
    features: list[float]

@app.post("/predict")
def predict(data: Input):
    pred = model.predict([data.features])
    proba = model.predict_proba([data.features])[0].tolist()
    return {"prediction": int(pred[0]), "probabilities": proba}

Version with MLflow

output

import mlflow
import mlflow.sklearn

with mlflow.start_run():
    pipe.fit(X_tr, y_tr)
    
    # Log params, metrics, model
    mlflow.log_params({"max_depth": 10, "n_estimators": 200})
    mlflow.log_metric("f1", 0.94)
    mlflow.sklearn.log_model(pipe, "model", registered_model_name="my_model")

What is machine learning?

NOTEObjective — Understand what machine learning is, distinguish the 3 main families (supervised / unsupervised / reinforcement), and know when to use ML versus classical logic.

Pragmatic definition

Machine learning (or automatic learning) consists of building a program that learns to perform a task from examples rather than from rules written by hand.

TIPAnalogy — Teaching a child to recognize a cat: you do not give them a rule ("the cat has 4 legs, whiskers..."), you show them photos of cats until they can generalize.

When to use ML?

Problem	Approach
Convert Celsius to Fahrenheit	Classical formula (no ML)
Predict the price of a house	Supervised ML (regression)
Detect spam	Supervised ML (classification)
Group similar customers	Unsupervised ML (clustering)
Recognize a face	Deep learning (ML variant)
Beat a human at chess	Reinforcement learning

The 3 ML families

1. Supervised

We have data with labels (X, y).

Goal: predict y from X.

2. Unsupervised

We have data without labels (X only).

Goal: discover hidden structures.

3. Reinforcement

An agent learns by trial and error in an environment.

Goal: maximize a reward.

Concrete examples by family

Supervised — Regression

output

# X: housing surface, number of rooms, neighborhood...
# y: sale price (number)
# Goal: predict the price of a new house

Supervised — Classification

output

# X: email content
# y: "spam" or "not-spam" (category)
# Goal: classify a new email

Unsupervised — Clustering

output

# X: customer purchases (amount, frequency...)
# No y
# Goal: group into 3-5 customer segments

The typical workflow of an ML project

WARNINGCommon pitfall — 80% of the real time of an ML project is spent on steps 2-4 (data), not on training. This is reality versus the Hollywood image.

Why scikit-learn?

go-further

This article covers the most useful excerpts — the complete Python scikit Learn course (10 chapters, 33 lessons, corrected exercises and final project) takes you all the way.

./access-the-complete-course free course: Mastering Claude Code

FAQ

How long does it take to learn Python scikit Learn?

With a structured progression (10 chapters, 33 short and practical lessons), you reach an operational level in a few weeks at 30 to 60 minutes per day. The important thing is to practice each concept immediately.

Are there any prerequisites?

Basic computer knowledge is sufficient. If you know how to use a terminal and read simple code, you are ready.

Where to start concretely?

Reproduce the commands from this article, then follow the complete Python scikit Learn course: it chains the 33 lessons in order, with exercises and a final project.

./further-reading

→ Get started in Machine Learning for Beginners: your first concrete step today → Machine Learning Simplified in practice: the code and commands that really matter → Python Automatic Learning: the 9 key steps to go from zero to operational

📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.

First model in 10 lines (Iris)

The Iris dataset

The complete code (10 lines)

Line-by-line breakdown

Make a prediction on a new flower

Visualize the result

The fit / predict / score pattern

Good practices from the start

Practical challenge

Save a model for production

joblib: the recommended method

With compression

pickle (alternative)

Production: complete packaging

Minimal FastAPI

Version with MLflow

What is machine learning?

Pragmatic definition

When to use ML?

The 3 ML families

1. Supervised

2. Unsupervised

3. Reinforcement

Concrete examples by family

Supervised — Regression

Supervised — Classification

Unsupervised — Clustering

The typical workflow of an ML project

Why scikit-learn?

FAQ

Stay up to date