Get Started with Python Scikit-Learn: Your First Concrete Step Today
Python scikit-learn: the essentials in one article — real code, diagrams and concrete steps, excerpts from a 33-lesson course.
The best way to learn Python scikit Learn is by doing. This article gives you a head start with practical excerpts from a 33-lesson course — enough to get your first result today.
- Introduction and installation
- Machine Learning basics
- Data preprocessing
- Supervised regression
- Supervised classification
First model in 10 lines (Iris)
The Iris dataset
150 iris flowers divided into 3 species (setosa, versicolor, virginica). 4 measured features: sepal and petal length and width.
from sklearn.datasets import load_iris iris = load_iris() print("Features :", iris.feature_names) print("Classes :", iris.target_names) print("Shape X :", iris.data.shape) # (150, 4) print("Shape y :", iris.target.shape) # (150,)
The complete code (10 lines)
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = KNeighborsClassifier(n_neighbors=5) model.fit(X_train, y_train) score = model.score(X_test, y_test) print(f"Accuracy : {score:.2%}") # ~96.7%
That's it. You have just created a classification model that distinguishes 3 iris species with ~97% accuracy.
Line-by-line breakdown
| Line | Role |
|---|---|
load_iris(return_X_y=True) | Loads X (features) and y (labels) directly |
train_test_split(...) | Splits 80% train / 20% test |
random_state=42 | Reproducibility (always the same split) |
KNeighborsClassifier(...) | Algorithm choice: k-NN with k=5 |
model.fit(...) | Training: memorizes the examples |
model.score(...) | Evaluates accuracy on the test set |
Make a prediction on a new flower
import numpy as np # A flower: sepal 5.1 x 3.5, petal 1.4 x 0.2 new_flower = np.array([[5.1, 3.5, 1.4, 0.2]]) prediction = model.predict(new_flower) print("Predicted :", iris.target_names[prediction[0]]) # > 'setosa' # Probabilities for each class proba = model.predict_proba(new_flower) print("Probabilities :", proba) # > [[1.0, 0.0, 0.0]] -> 100% setosa
Visualize the result
import matplotlib.pyplot as plt fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # True labels axes[0].scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap="viridis") axes[0].set_title("True classes") axes[0].set_xlabel("Sepal length") axes[0].set_ylabel("Sepal width") # Predictions predictions = model.predict(X_test) axes[1].scatter(X_test[:, 0], X_test[:, 1], c=predictions, cmap="viridis") axes[1].set_title("Model predictions") axes[1].set_xlabel("Sepal length") plt.tight_layout() plt.show()
The fit / predict / score pattern
fit → predict → score) is the same for ALL scikit-learn models. Once learned, you can try 30 algorithms by changing only the import.# Same code, different model: from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier(random_state=42) model.fit(X_train, y_train) print(model.score(X_test, y_test)) # Same code, yet another one: from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train, y_train) print(model.score(X_test, y_test))
Good practices from the start
Practical challenge
- Try 3 different algorithms (KNN, DecisionTree, RandomForest)
- Vary
random_stateand observe whether the score changes - Try
k=1thenk=10in KNeighborsClassifier
Save a model for production
joblib: the recommended method
import joblib # Save joblib.dump(pipe, "model.joblib") # Load loaded_pipe = joblib.load("model.joblib") preds = loaded_pipe.predict(X_new)
With compression
# Level 0 (fast) to 9 (max compression) joblib.dump(pipe, "model.joblib.gz", compress=3) loaded = joblib.load("model.joblib.gz")
pickle (alternative)
import pickle with open("model.pkl", "wb") as f: pickle.dump(pipe, f) with open("model.pkl", "rb") as f: pipe = pickle.load(f)
Production: complete packaging
import joblib import sklearn import numpy as np # Save model + metadata artifact = { "model": pipe, "feature_names": feature_names, "target_names": target_names, "sklearn_version": sklearn.__version__, "numpy_version": np.__version__, "training_date": "2026-05-15", "metrics": {"f1": 0.94, "auc": 0.97} } joblib.dump(artifact, "model_v1.joblib")
Minimal FastAPI
from fastapi import FastAPI from pydantic import BaseModel import joblib app = FastAPI() artifact = joblib.load("model_v1.joblib") model = artifact["model"] class Input(BaseModel): features: list[float] @app.post("/predict") def predict(data: Input): pred = model.predict([data.features]) proba = model.predict_proba([data.features])[0].tolist() return {"prediction": int(pred[0]), "probabilities": proba}
Version with MLflow
import mlflow import mlflow.sklearn with mlflow.start_run(): pipe.fit(X_tr, y_tr) # Log params, metrics, model mlflow.log_params({"max_depth": 10, "n_estimators": 200}) mlflow.log_metric("f1", 0.94) mlflow.sklearn.log_model(pipe, "model", registered_model_name="my_model")
What is machine learning?
Pragmatic definition
Machine learning (or automatic learning) consists of building a program that learns to perform a task from examples rather than from rules written by hand.
When to use ML?
| Problem | Approach |
|---|---|
| Convert Celsius to Fahrenheit | Classical formula (no ML) |
| Predict the price of a house | Supervised ML (regression) |
| Detect spam | Supervised ML (classification) |
| Group similar customers | Unsupervised ML (clustering) |
| Recognize a face | Deep learning (ML variant) |
| Beat a human at chess | Reinforcement learning |
The 3 ML families
1. Supervised
We have data with labels (X, y).
Goal: predict y from X.
2. Unsupervised
We have data without labels (X only).
Goal: discover hidden structures.
3. Reinforcement
An agent learns by trial and error in an environment.
Goal: maximize a reward.
Concrete examples by family
Supervised — Regression
# X: housing surface, number of rooms, neighborhood... # y: sale price (number) # Goal: predict the price of a new house
Supervised — Classification
# X: email content # y: "spam" or "not-spam" (category) # Goal: classify a new email
Unsupervised — Clustering
# X: customer purchases (amount, frequency...) # No y # Goal: group into 3-5 customer segments
The typical workflow of an ML project
Why scikit-learn?
This article covers the most useful excerpts — the complete Python scikit Learn course (10 chapters, 33 lessons, corrected exercises and final project) takes you all the way.
./access-the-complete-course free course: Mastering Claude CodeFAQ
How long does it take to learn Python scikit Learn?
Are there any prerequisites?
Where to start concretely?
📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.