Get Started with Python Scikit-Learn: Your First Concrete Step Today

Python scikit-learn: the essentials in one article — real code, diagrams and concrete steps, excerpts from a 33-lesson course.

Get Started with Python Scikit-Learn: Your First Concrete Step Today

The best way to learn Python scikit Learn is by doing. This article gives you a head start with practical excerpts from a 33-lesson course — enough to get your first result today.

tl;dr
  • Introduction and installation
  • Machine Learning basics
  • Data preprocessing
  • Supervised regression
  • Supervised classification
~$ cat ./parcours.md # Python scikit Learn — 10 chapters
01
Introduction and installation
→ What is machine learning ?→ Install scikit-learn and the ecosystem+ 1 more lessons
02
Basics of Machine Learning
→ ML Vocabulary — features, labels, overfitting→ The uniform scikit-learn API+ 1 more lessons
03
Data Preprocessing
→ Missing values and imputation→ Encoding categorical variables+ 2 more lessons
04
Supervised Regression
→ Linear regression and regularization→ Trees and RandomForest for regression+ 1 more lessons
05
Supervised Classification
→ Logistic regression→ k-NN and SVM+ 2 more lessons
06
Unsupervised Learning
→ Clustering — K-Means, DBSCAN, Agglomerative→ Choosing the optimal K+ 1 more lessons
07
Dimensionality Reduction
→ PCA — principal components→ t-SNE and UMAP — non-linear visualization+ 1 more lessons
08
Validation and hyperparameters
→ Cross-validation→ GridSearchCV and RandomizedSearchCV+ 1 more lessons
🏁
Final project (+ 2 chapters along the way)
→ You leave with a concrete and demonstrable project

First model in 10 lines (Iris)

NOTEObjective — Build your first end-to-end machine learning model in less than 10 lines of code, on the historic iris dataset.

The Iris dataset

150 iris flowers divided into 3 species (setosa, versicolor, virginica). 4 measured features: sepal and petal length and width.

output
from sklearn.datasets import load_iris

iris = load_iris()
print("Features :", iris.feature_names)
print("Classes :", iris.target_names)
print("Shape X :", iris.data.shape)   # (150, 4)
print("Shape y :", iris.target.shape) # (150,)

The complete code (10 lines)

output
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(f"Accuracy : {score:.2%}")  # ~96.7%

That's it. You have just created a classification model that distinguishes 3 iris species with ~97% accuracy.

Line-by-line breakdown

LineRole
load_iris(return_X_y=True)Loads X (features) and y (labels) directly
train_test_split(...)Splits 80% train / 20% test
random_state=42Reproducibility (always the same split)
KNeighborsClassifier(...)Algorithm choice: k-NN with k=5
model.fit(...)Training: memorizes the examples
model.score(...)Evaluates accuracy on the test set

Make a prediction on a new flower

output
import numpy as np

# A flower: sepal 5.1 x 3.5, petal 1.4 x 0.2
new_flower = np.array([[5.1, 3.5, 1.4, 0.2]])
prediction = model.predict(new_flower)

print("Predicted :", iris.target_names[prediction[0]])
# > 'setosa'

# Probabilities for each class
proba = model.predict_proba(new_flower)
print("Probabilities :", proba)
# > [[1.0, 0.0, 0.0]]  -> 100% setosa

Visualize the result

output
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# True labels
axes[0].scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap="viridis")
axes[0].set_title("True classes")
axes[0].set_xlabel("Sepal length")
axes[0].set_ylabel("Sepal width")

# Predictions
predictions = model.predict(X_test)
axes[1].scatter(X_test[:, 0], X_test[:, 1], c=predictions, cmap="viridis")
axes[1].set_title("Model predictions")
axes[1].set_xlabel("Sepal length")

plt.tight_layout()
plt.show()

The fit / predict / score pattern

TIPUniform API — This structure (fitpredictscore) is the same for ALL scikit-learn models. Once learned, you can try 30 algorithms by changing only the import.
output
# Same code, different model:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

# Same code, yet another one:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))

Good practices from the start

WARNINGPitfall — On iris, almost everything works very well (97%). It is a toy dataset. In real life, you will struggle more — and that is normal!

Practical challenge

TIPExercise — Reuse the 10-line code and:
  1. Try 3 different algorithms (KNN, DecisionTree, RandomForest)
  2. Vary random_state and observe whether the score changes
  3. Try k=1 then k=10 in KNeighborsClassifier

Save a model for production

NOTEObjective — Persist a trained model and reload it later (API, batch, monitoring).

joblib: the recommended method

output
import joblib

# Save
joblib.dump(pipe, "model.joblib")

# Load
loaded_pipe = joblib.load("model.joblib")
preds = loaded_pipe.predict(X_new)

With compression

output
# Level 0 (fast) to 9 (max compression)
joblib.dump(pipe, "model.joblib.gz", compress=3)
loaded = joblib.load("model.joblib.gz")

pickle (alternative)

output
import pickle

with open("model.pkl", "wb") as f:
    pickle.dump(pipe, f)

with open("model.pkl", "rb") as f:
    pipe = pickle.load(f)
WARNINGjoblib vs pickle — joblib is more efficient for large NumPy arrays (which characterize sklearn). pickle remains compatible with all Python.

Production: complete packaging

output
import joblib
import sklearn
import numpy as np

# Save model + metadata
artifact = {
    "model": pipe,
    "feature_names": feature_names,
    "target_names": target_names,
    "sklearn_version": sklearn.__version__,
    "numpy_version": np.__version__,
    "training_date": "2026-05-15",
    "metrics": {"f1": 0.94, "auc": 0.97}
}
joblib.dump(artifact, "model_v1.joblib")

Minimal FastAPI

output
from fastapi import FastAPI
from pydantic import BaseModel
import joblib

app = FastAPI()
artifact = joblib.load("model_v1.joblib")
model = artifact["model"]

class Input(BaseModel):
    features: list[float]

@app.post("/predict")
def predict(data: Input):
    pred = model.predict([data.features])
    proba = model.predict_proba([data.features])[0].tolist()
    return {"prediction": int(pred[0]), "probabilities": proba}

Version with MLflow

output
import mlflow
import mlflow.sklearn

with mlflow.start_run():
    pipe.fit(X_tr, y_tr)
    
    # Log params, metrics, model
    mlflow.log_params({"max_depth": 10, "n_estimators": 200})
    mlflow.log_metric("f1", 0.94)
    mlflow.sklearn.log_model(pipe, "model", registered_model_name="my_model")

What is machine learning?

NOTEObjective — Understand what machine learning is, distinguish the 3 main families (supervised / unsupervised / reinforcement), and know when to use ML versus classical logic.

Pragmatic definition

Machine learning (or automatic learning) consists of building a program that learns to perform a task from examples rather than from rules written by hand.

TIPAnalogy — Teaching a child to recognize a cat: you do not give them a rule ("the cat has 4 legs, whiskers..."), you show them photos of cats until they can generalize.

When to use ML?

ProblemApproach
Convert Celsius to FahrenheitClassical formula (no ML)
Predict the price of a houseSupervised ML (regression)
Detect spamSupervised ML (classification)
Group similar customersUnsupervised ML (clustering)
Recognize a faceDeep learning (ML variant)
Beat a human at chessReinforcement learning

The 3 ML families

1. Supervised

We have data with labels (X, y).

Goal: predict y from X.

2. Unsupervised

We have data without labels (X only).

Goal: discover hidden structures.

3. Reinforcement

An agent learns by trial and error in an environment.

Goal: maximize a reward.

Concrete examples by family

Supervised — Regression

output
# X: housing surface, number of rooms, neighborhood...
# y: sale price (number)
# Goal: predict the price of a new house

Supervised — Classification

output
# X: email content
# y: "spam" or "not-spam" (category)
# Goal: classify a new email

Unsupervised — Clustering

output
# X: customer purchases (amount, frequency...)
# No y
# Goal: group into 3-5 customer segments

The typical workflow of an ML project

WARNINGCommon pitfall — 80% of the real time of an ML project is spent on steps 2-4 (data), not on training. This is reality versus the Hollywood image.

Why scikit-learn?

go-further

This article covers the most useful excerpts — the complete Python scikit Learn course (10 chapters, 33 lessons, corrected exercises and final project) takes you all the way.

./access-the-complete-course free course: Mastering Claude Code

FAQ

How long does it take to learn Python scikit Learn?
With a structured progression (10 chapters, 33 short and practical lessons), you reach an operational level in a few weeks at 30 to 60 minutes per day. The important thing is to practice each concept immediately.
Are there any prerequisites?
Basic computer knowledge is sufficient. If you know how to use a terminal and read simple code, you are ready.
Where to start concretely?
Reproduce the commands from this article, then follow the complete Python scikit Learn course: it chains the 33 lessons in order, with exercises and a final project.

📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.