Cloud & DevOps

ML Model Monitoring Explained Simply (with Diagrams and Real Code)

ML Model Monitoring: The Essentials in One Article — Real Code, Diagrams and Concrete Steps, Excerpts from a 24-Lesson Course.

REHOUMA Haythem

12 Jun 2026 • 13 min read

A guide that gets straight to the point: ML Model Monitoring broken down with diagrams, concrete examples and tested commands. Everything comes from a structured 7-chapter course — here is the best of it.

tl;dr

Introduction ML Monitoring
Data Drift
Model Drift and performance
Market Tools
Monitoring in production

~$ cat ./parcours.md # ML Model Monitoring — 6 chapters

Introduction ML Monitoring

→ Le syndrome du modèle qui pourrit→ Les 3 piliers du monitoring ML+ 1 more lessons

Data Drift

→ Concepts et causes du data drift→ Tests statistiques : KS, Chi², PSI, Wasserstein+ 1 more lessons

Model Drift et performance

→ Concept drift vs Performance drift→ Métriques par type de problème ML+ 1 more lessons

Outils du marche

→ Tour d'horizon des outils ML Monitoring→ Stack open-source complète : Evidently + Prometheus + Grafana+ 1 more lessons

Monitoring en production

→ SLO, SLA et budgets d'erreur pour ML→ Continuous Training (CT) avec Airflow+ 2 more lessons

Final project

→ Projet final : Cahier des charges et architecture→ Implémentation FraudGuard pas à pas+ 1 more lessons

🏁

Final project

→ Tu repars avec un projet concret et démontrable

Configure Prometheus, Slack and PagerDuty alerts

NOTEObjective — Turn your ML monitoring metrics into actionable alerts: write Prometheus rules on drift and degradation, route notifications via Alertmanager to Slack and PagerDuty, and avoid alert fatigue.

Learning objectives

TIPBy the end of this module

Write Prometheus alerting rules (alerting_rules.yml) on ML KPIs
Configure Alertmanager to route to Slack and PagerDuty
Distinguish warning alerts from critical alerts by severity
Implement grouping (group_by) and inhibition
Reduce alert fatigue with for, thresholds and silences

From metric to alert: the complete chain

A metric exposed on /metrics is useless if nobody looks at it at 3 a.m. The alerting chain links a numeric value to a human action. It has four links.

1. Export the metric

Your ML service publishes Gauge and Counter metrics: drift score, p95 latency, error rate, rolling accuracy.

2. Evaluate the rule

Prometheus periodically evaluates PromQL expressions. When the expression remains true for the duration of for, the alert moves from pending to firing.

3. Route the alert

Alertmanager receives firing alerts, groups them, deduplicates them and selects the receiver according to the labels.

4. Notify the human

The receiver sends a Slack message (warning) or triggers PagerDuty with on-call (critical).

NOTEGolden rule: an alert that fires must always require an action. If nobody does anything when it rings, delete it or turn it into a simple dashboard panel.

Write ML alerting rules in Prometheus

The rules live in a YAML file loaded by Prometheus via rule_files. For a model in production, we monitor three families: data drift, performance degradation, and infra health (latency, errors).

Lever	Effect
`for: 15m`	Eliminates transient spikes, only alerts on sustained drifts
Inhibition	If the API is down, we do not also alert on drift (single root cause)
Silences	During a scheduled retraining, we temporarily cut drift alerts
Calibrated thresholds	Thresholds derived from history, not arbitrary values copied from a tutorial

The 3 pillars of ML monitoring

Chapter 00 • Lesson 02 • Duration: 45 min

NOTE🎯 Objectives

Identify the 3 categories of metrics to monitor
Understand which metrics to measure according to model type
Establish a progressive instrumentation roadmap
Know ML logging best practices

1. Overview: the 3 pillars

output

┌─────────────────────────────────────────────────────────────┐
│                  MONITORING ML EN PRODUCTION                  │
└─────────────────────────────────────────────────────────────┘

   ┌────────────────┐  ┌────────────────┐  ┌────────────────┐
   │  PILIER 1       │  │  PILIER 2       │  │  PILIER 3       │
   │  Système        │  │  Données        │  │  Modèle         │
   │  (infra)        │  │  (data + drift) │  │  (performance)  │
   └────────────────┘  └────────────────┘  └────────────────┘
   - Latence            - Distribution X    - Accuracy
   - QPS                - Valeurs manqu.    - F1, AUC
   - CPU/Mémoire        - Outliers          - Calibration
   - Erreurs 5xx        - Drift KS/PSI      - Business KPI

2. Pillar 1 — System monitoring (infra)

This is classic application monitoring, identical to any REST API.

Metric	Tools	Typical threshold
Latency p50/p95/p99	Prometheus, Datadog, CloudWatch	p99 < 500 ms
QPS (Queries Per Second)	Prometheus, ALB metrics	Track trend
HTTP error rate (5xx)	Prometheus, CloudWatch	< 0.1 %
CPU / Memory	cAdvisor, Datadog	< 80 %
GPU usage (if applicable)	NVIDIA DCGM, Prometheus	< 90 %
Disk I/O	node_exporter	—
Queue saturation (Kafka, SQS)	Kafka exporter	< 10k messages

TIP

ML specificity: Latency can explode on deep learning models because of poorly shared GPUs. Always measure p99 latency, not just the average.

3. Pillar 2 — Data monitoring (inputs)

3.1 Per-feature metrics

Feature type	Metrics
Numeric	Min, max, mean, std, median, quantiles, missing values
Categorical	Class distribution, new classes, NULL
Text	Average length, vocabulary, detected languages
Image	Size, ratio, RGB histograms
Timestamp	Date ranges, frequency by hour of day

3.2 Drift detection

Statistical test	Feature type	When to use
Kolmogorov-Smirnov (KS)	Numeric	Continuous test vs reference distribution
Chi-squared (χ²)	Categorical	Frequency comparison
Population Stability Index (PSI)	Discretized numeric	Finance and banking standard
Wasserstein distance	Numeric	More sensitive than KS for large distributions
Jensen-Shannon divergence	Categorical	Probabilistic comparison

Details and code in Chapter 01 Lesson 02.

3.3 Data quality

4. Pillar 3 — Model monitoring

4.1 Prediction metrics (no labels)

Metric	Description
Prediction distribution	% of each class for classification
Average confidence score	Indicator of model uncertainty
Output entropy	The higher, the more uncertain the model
OOD rate (Out-of-Distribution)	% of inputs outside the training distribution
Fallback / unknown rate	% of predictions where the model does not know

4.2 Performance (with labels)

Problem	Main metrics
Binary classification	Accuracy, Precision, Recall, F1, AUC-ROC, AUC-PR
Multi-class classification	Accuracy, F1-macro, F1-weighted, confusion matrix
Regression	RMSE, MAE, MAPE, R²
Ranking / Recommendation	NDCG, MAP, Recall@k, Precision@k
Anomaly detection	Precision/Recall on the rare class
Generation (LLM)	BLEU, ROUGE, perplexity, human eval

4.3 Calibration

A calibrated model is one whose predicted probabilities match actual frequencies. E.g.: out of 100 predictions at 80 % confidence, ~80 must be correct.

output

from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt

prob_pred = model.predict_proba(X_test)[:, 1]
prob_true, prob_pred = calibration_curve(y_test, prob_pred, n_bins=10)

plt.plot([0, 1], [0, 1], 'k--')
plt.plot(prob_pred, prob_true, 'o-')
plt.xlabel('Probabilité prédite')
plt.ylabel('Probabilité observée')
plt.title('Calibration plot')

5. The "delayed feedback" problem

Often, we only know the ground truth (the label) days/weeks/months after the prediction.

Use case	Feedback delay
Spam detection	A few minutes (user reports)
Recommendation	Hours (click) or days (purchase)
Bank fraud detection	Days to weeks (chargeback)
Credit scoring	1-3 years (default)
Churn prediction	30 to 90 days
Medical diagnosis	Weeks to years

Concepts and causes of data drift

Chapter 01 • Lesson 01 • Duration: 45 min

NOTE🎯 Objectives

Precisely define what data drift is (and what it is not)
Distinguish covariate shift, label shift, and concept drift
Recognize common causes: seasonality, events, upstream bugs
Establish a reference dataset strategy

1. Formal definition

Data drift occurs when the statistical distribution of production data differs from the one used to train the model.

output

P_train(X) ≠ P_prod(X)

ou plus généralement :

P_train(X, Y) ≠ P_prod(X, Y)

Mathematically, we speak of covariate shift when only the distribution of features X changes, but the relationship P(Y|X) remains identical.

2. The 3 types of shift

Type	Definition	Example
Covariate shift	P(X) changes, P(Y\|X) constant	Health model trained on adults, in prod we have seniors. But the symptoms → disease relationship remains the same.
Label shift	P(Y) changes, P(X\|Y) constant	Spam detection: 5 % spam in train, 50 % in prod (massive phishing campaign). The spam profile is similar but their frequency explodes.
Concept drift	P(Y\|X) changes	Fraud recognition: fraudsters adapt, so same profile = different behavior. The rule changes.

3. Visual diagram

output

Distribution training            Distribution production
       
       *  *                              
      ***                                       *
     *****                                     ***
    *******                                   *****
   *********              vs                  ********
  ***********                                **********
 *************                              *************
───────────────────                ──────────────────────────
        Pas de drift                    DRIFT DÉTECTÉ
                                  (la distribution s'est décalée)

4. Main causes of data drift

4.1 Seasonality

Domain	Seasonal example
E-commerce	Black Friday (volumes ×10), Christmas period
Weather	Temperature features very different winter/summer
Traffic	Weekend vs weekdays, rush hours
Tourism	Summer in the southern hemisphere vs northern
Energy	Winter vs summer consumption (air conditioning/heating)

4.2 External events

4.3 Business evolution

4.4 Bugs and technical changes (often overlooked)

WARNING⚠️ The most insidious causes

Upstream ETL pipeline that changes an encoding (UTF-8 vs Latin-1)
Source API update that returns more/fewer fields
Bug fix that modifies upstream values (e.g. changed rounding)
Library version change (numpy 1.x vs 2.x different)
Database migration that alters types
New optional field: NULL in prod, never NULL in train

5. Concrete Python example — simulate drift

output

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(42)
n_train = 5000

train_age = np.random.normal(35, 10, n_train)
train_age = np.clip(train_age, 18, 80)

train_income = train_age * 1000 + np.random.normal(0, 5000, n_train)

prod_age = np.random.normal(50, 12, n_train)
prod_age = np.clip(prod_age, 18, 80)

prod_income = prod_age * 1000 + np.random.normal(0, 5000, n_train)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(train_age, bins=30, alpha=0.5, label='Train', color='blue')
axes[0].hist(prod_age, bins=30, alpha=0.5, label='Production', color='red')
axes[0].axvline(train_age.mean(), color='blue', linestyle='--', label=f'Train moy={train_age.mean():.1f}')
axes[0].axvline(prod_age.mean(), color='red', linestyle='--', label=f'Prod moy={prod_age.mean():.1f}')
axes[0].set_title('Distribution de "age" — covariate shift visible')
axes[0].set_xlabel('Age')
axes[0].legend()

axes[1].hist(train_income, bins=30, alpha=0.5, label='Train', color='blue')
axes[1].hist(prod_income, bins=30, alpha=0.5, label='Production', color='red')
axes[1].set_title('Distribution de "income"')
axes[1].set_xlabel('Income')
axes[1].legend()

plt.tight_layout()
plt.savefig('drift_visualization.png')
plt.show()

6. The "reference dataset" concept

To detect drift, you must compare against something known. This is the reference dataset.

Option	Advantages	Disadvantages
Training dataset	Sample of what the model has seen	Can be obsolete (1 year+)
Validation dataset	Independent of train	Small, sometimes biased
Sliding prod window (past week)	Always fresh	Progressive drift not detected
Previous month	Good balance	Seasonality ignored
Previous month N-12 (same month)	Seasonality taken into account	Long-term upheavals missed

TIP

Recommendation: Keep 2 references — (1) initial validation dataset for "structural" drift (2) 30-day sliding window for "behavioral" drift.

7. Detection granularity

Granularity	When to use
Per feature	Precisely identify which variable is drifting
Multivariate (entire dataset)	Detect subtle drifts (correlations)
Per segment (cohort, geography)	Detect localized drifts
Per time window	Temporal trend (day, week, month)

8. Analysis frequency

va-plus-loin

This article covers the most useful excerpts — the full ML Model Monitoring course (7 chapters, 24 lessons, corrected exercises and final project) takes you all the way.

./acceder-au-cours-complet free course: Mastering Claude Code

FAQ

How long does it take to learn ML Model Monitoring?

With a structured progression (7 chapters, 24 short and practical lessons), you reach an operational level in a few weeks at 30 to 60 minutes per day. The important thing is to practice each concept immediately.

Are prerequisites required?

Basic computer science knowledge is sufficient. If you know how to use a terminal and read simple code, you are ready.

Where to start concretely?

Reproduce the commands in this article, then follow the full ML Model Monitoring course: it chains the 24 lessons in order, with exercises and a final project.

./a-lire-aussi

→ Docker Containerization explained simply (with diagrams and real code)→ Mastering Linux explained simply (with diagrams and real code)→ Python Security Ports Linux: the 9 key steps to go from zero to operational

📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.

Configure Prometheus, Slack and PagerDuty alerts

Learning objectives

From metric to alert: the complete chain

1. Export the metric

2. Evaluate the rule

3. Route the alert

4. Notify the human

Write ML alerting rules in Prometheus

The 3 pillars of ML monitoring

1. Overview: the 3 pillars

2. Pillar 1 — System monitoring (infra)

3. Pillar 2 — Data monitoring (inputs)

3.1 Per-feature metrics

3.2 Drift detection

3.3 Data quality

4. Pillar 3 — Model monitoring

4.1 Prediction metrics (no labels)

4.2 Performance (with labels)

4.3 Calibration

5. The "delayed feedback" problem

Concepts and causes of data drift

1. Formal definition

2. The 3 types of shift

3. Visual diagram

4. Main causes of data drift

4.1 Seasonality

4.2 External events

4.3 Business evolution

4.4 Bugs and technical changes (often overlooked)

5. Concrete Python example — simulate drift

6. The "reference dataset" concept

7. Detection granularity

8. Analysis frequency

FAQ

Stay up to date