ML Model Monitoring Explained Simply (with Diagrams and Real Code)
ML Model Monitoring: The Essentials in One Article — Real Code, Diagrams and Concrete Steps, Excerpts from a 24-Lesson Course.
A guide that gets straight to the point: ML Model Monitoring broken down with diagrams, concrete examples and tested commands. Everything comes from a structured 7-chapter course — here is the best of it.
- Introduction ML Monitoring
- Data Drift
- Model Drift and performance
- Market Tools
- Monitoring in production
Configure Prometheus, Slack and PagerDuty alerts
Learning objectives
- Write Prometheus alerting rules (
alerting_rules.yml) on ML KPIs - Configure Alertmanager to route to Slack and PagerDuty
- Distinguish warning alerts from critical alerts by severity
- Implement grouping (
group_by) and inhibition - Reduce alert fatigue with
for, thresholds and silences
From metric to alert: the complete chain
A metric exposed on /metrics is useless if nobody looks at it at 3 a.m. The alerting chain links a numeric value to a human action. It has four links.
1. Export the metric
Your ML service publishes Gauge and Counter metrics: drift score, p95 latency, error rate, rolling accuracy.
2. Evaluate the rule
Prometheus periodically evaluates PromQL expressions. When the expression remains true for the duration of for, the alert moves from pending to firing.
3. Route the alert
Alertmanager receives firing alerts, groups them, deduplicates them and selects the receiver according to the labels.
4. Notify the human
The receiver sends a Slack message (warning) or triggers PagerDuty with on-call (critical).
Write ML alerting rules in Prometheus
The rules live in a YAML file loaded by Prometheus via rule_files. For a model in production, we monitor three families: data drift, performance degradation, and infra health (latency, errors).
| Lever | Effect |
|---|---|
for: 15m | Eliminates transient spikes, only alerts on sustained drifts |
| Inhibition | If the API is down, we do not also alert on drift (single root cause) |
| Silences | During a scheduled retraining, we temporarily cut drift alerts |
| Calibrated thresholds | Thresholds derived from history, not arbitrary values copied from a tutorial |
The 3 pillars of ML monitoring
Chapter 00 • Lesson 02 • Duration: 45 min
- Identify the 3 categories of metrics to monitor
- Understand which metrics to measure according to model type
- Establish a progressive instrumentation roadmap
- Know ML logging best practices
1. Overview: the 3 pillars
┌─────────────────────────────────────────────────────────────┐ │ MONITORING ML EN PRODUCTION │ └─────────────────────────────────────────────────────────────┘ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ PILIER 1 │ │ PILIER 2 │ │ PILIER 3 │ │ Système │ │ Données │ │ Modèle │ │ (infra) │ │ (data + drift) │ │ (performance) │ └────────────────┘ └────────────────┘ └────────────────┘ - Latence - Distribution X - Accuracy - QPS - Valeurs manqu. - F1, AUC - CPU/Mémoire - Outliers - Calibration - Erreurs 5xx - Drift KS/PSI - Business KPI
2. Pillar 1 — System monitoring (infra)
This is classic application monitoring, identical to any REST API.
| Metric | Tools | Typical threshold |
|---|---|---|
| Latency p50/p95/p99 | Prometheus, Datadog, CloudWatch | p99 < 500 ms |
| QPS (Queries Per Second) | Prometheus, ALB metrics | Track trend |
| HTTP error rate (5xx) | Prometheus, CloudWatch | < 0.1 % |
| CPU / Memory | cAdvisor, Datadog | < 80 % |
| GPU usage (if applicable) | NVIDIA DCGM, Prometheus | < 90 % |
| Disk I/O | node_exporter | — |
| Queue saturation (Kafka, SQS) | Kafka exporter | < 10k messages |
ML specificity: Latency can explode on deep learning models because of poorly shared GPUs. Always measure p99 latency, not just the average.
3. Pillar 2 — Data monitoring (inputs)
3.1 Per-feature metrics
| Feature type | Metrics |
|---|---|
| Numeric | Min, max, mean, std, median, quantiles, missing values |
| Categorical | Class distribution, new classes, NULL |
| Text | Average length, vocabulary, detected languages |
| Image | Size, ratio, RGB histograms |
| Timestamp | Date ranges, frequency by hour of day |
3.2 Drift detection
| Statistical test | Feature type | When to use |
|---|---|---|
| Kolmogorov-Smirnov (KS) | Numeric | Continuous test vs reference distribution |
| Chi-squared (χ²) | Categorical | Frequency comparison |
| Population Stability Index (PSI) | Discretized numeric | Finance and banking standard |
| Wasserstein distance | Numeric | More sensitive than KS for large distributions |
| Jensen-Shannon divergence | Categorical | Probabilistic comparison |
Details and code in Chapter 01 Lesson 02.
3.3 Data quality
4. Pillar 3 — Model monitoring
4.1 Prediction metrics (no labels)
| Metric | Description |
|---|---|
| Prediction distribution | % of each class for classification |
| Average confidence score | Indicator of model uncertainty |
| Output entropy | The higher, the more uncertain the model |
| OOD rate (Out-of-Distribution) | % of inputs outside the training distribution |
| Fallback / unknown rate | % of predictions where the model does not know |
4.2 Performance (with labels)
| Problem | Main metrics |
|---|---|
| Binary classification | Accuracy, Precision, Recall, F1, AUC-ROC, AUC-PR |
| Multi-class classification | Accuracy, F1-macro, F1-weighted, confusion matrix |
| Regression | RMSE, MAE, MAPE, R² |
| Ranking / Recommendation | NDCG, MAP, Recall@k, Precision@k |
| Anomaly detection | Precision/Recall on the rare class |
| Generation (LLM) | BLEU, ROUGE, perplexity, human eval |
4.3 Calibration
A calibrated model is one whose predicted probabilities match actual frequencies. E.g.: out of 100 predictions at 80 % confidence, ~80 must be correct.
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
prob_pred = model.predict_proba(X_test)[:, 1]
prob_true, prob_pred = calibration_curve(y_test, prob_pred, n_bins=10)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(prob_pred, prob_true, 'o-')
plt.xlabel('Probabilité prédite')
plt.ylabel('Probabilité observée')
plt.title('Calibration plot')5. The "delayed feedback" problem
Often, we only know the ground truth (the label) days/weeks/months after the prediction.
| Use case | Feedback delay |
|---|---|
| Spam detection | A few minutes (user reports) |
| Recommendation | Hours (click) or days (purchase) |
| Bank fraud detection | Days to weeks (chargeback) |
| Credit scoring | 1-3 years (default) |
| Churn prediction | 30 to 90 days |
| Medical diagnosis | Weeks to years |
Concepts and causes of data drift
Chapter 01 • Lesson 01 • Duration: 45 min
- Precisely define what data drift is (and what it is not)
- Distinguish covariate shift, label shift, and concept drift
- Recognize common causes: seasonality, events, upstream bugs
- Establish a reference dataset strategy
1. Formal definition
Data drift occurs when the statistical distribution of production data differs from the one used to train the model.
P_train(X) ≠ P_prod(X) ou plus généralement : P_train(X, Y) ≠ P_prod(X, Y)
Mathematically, we speak of covariate shift when only the distribution of features X changes, but the relationship P(Y|X) remains identical.
2. The 3 types of shift
| Type | Definition | Example |
|---|---|---|
| Covariate shift | P(X) changes, P(Y|X) constant | Health model trained on adults, in prod we have seniors. But the symptoms → disease relationship remains the same. |
| Label shift | P(Y) changes, P(X|Y) constant | Spam detection: 5 % spam in train, 50 % in prod (massive phishing campaign). The spam profile is similar but their frequency explodes. |
| Concept drift | P(Y|X) changes | Fraud recognition: fraudsters adapt, so same profile = different behavior. The rule changes. |
3. Visual diagram
Distribution training Distribution production
* *
*** *
***** ***
******* *****
********* vs ********
*********** **********
************* *************
─────────────────── ──────────────────────────
Pas de drift DRIFT DÉTECTÉ
(la distribution s'est décalée)4. Main causes of data drift
4.1 Seasonality
| Domain | Seasonal example |
|---|---|
| E-commerce | Black Friday (volumes ×10), Christmas period |
| Weather | Temperature features very different winter/summer |
| Traffic | Weekend vs weekdays, rush hours |
| Tourism | Summer in the southern hemisphere vs northern |
| Energy | Winter vs summer consumption (air conditioning/heating) |
4.2 External events
4.3 Business evolution
4.4 Bugs and technical changes (often overlooked)
- Upstream ETL pipeline that changes an encoding (UTF-8 vs Latin-1)
- Source API update that returns more/fewer fields
- Bug fix that modifies upstream values (e.g. changed rounding)
- Library version change (numpy 1.x vs 2.x different)
- Database migration that alters types
- New optional field: NULL in prod, never NULL in train
5. Concrete Python example — simulate drift
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(42)
n_train = 5000
train_age = np.random.normal(35, 10, n_train)
train_age = np.clip(train_age, 18, 80)
train_income = train_age * 1000 + np.random.normal(0, 5000, n_train)
prod_age = np.random.normal(50, 12, n_train)
prod_age = np.clip(prod_age, 18, 80)
prod_income = prod_age * 1000 + np.random.normal(0, 5000, n_train)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].hist(train_age, bins=30, alpha=0.5, label='Train', color='blue')
axes[0].hist(prod_age, bins=30, alpha=0.5, label='Production', color='red')
axes[0].axvline(train_age.mean(), color='blue', linestyle='--', label=f'Train moy={train_age.mean():.1f}')
axes[0].axvline(prod_age.mean(), color='red', linestyle='--', label=f'Prod moy={prod_age.mean():.1f}')
axes[0].set_title('Distribution de "age" — covariate shift visible')
axes[0].set_xlabel('Age')
axes[0].legend()
axes[1].hist(train_income, bins=30, alpha=0.5, label='Train', color='blue')
axes[1].hist(prod_income, bins=30, alpha=0.5, label='Production', color='red')
axes[1].set_title('Distribution de "income"')
axes[1].set_xlabel('Income')
axes[1].legend()
plt.tight_layout()
plt.savefig('drift_visualization.png')
plt.show()6. The "reference dataset" concept
To detect drift, you must compare against something known. This is the reference dataset.
| Option | Advantages | Disadvantages |
|---|---|---|
| Training dataset | Sample of what the model has seen | Can be obsolete (1 year+) |
| Validation dataset | Independent of train | Small, sometimes biased |
| Sliding prod window (past week) | Always fresh | Progressive drift not detected |
| Previous month | Good balance | Seasonality ignored |
| Previous month N-12 (same month) | Seasonality taken into account | Long-term upheavals missed |
Recommendation: Keep 2 references — (1) initial validation dataset for "structural" drift (2) 30-day sliding window for "behavioral" drift.
7. Detection granularity
| Granularity | When to use |
|---|---|
| Per feature | Precisely identify which variable is drifting |
| Multivariate (entire dataset) | Detect subtle drifts (correlations) |
| Per segment (cohort, geography) | Detect localized drifts |
| Per time window | Temporal trend (day, week, month) |
8. Analysis frequency
This article covers the most useful excerpts — the full ML Model Monitoring course (7 chapters, 24 lessons, corrected exercises and final project) takes you all the way.
./acceder-au-cours-complet free course: Mastering Claude CodeFAQ
How long does it take to learn ML Model Monitoring?
Are prerequisites required?
Where to start concretely?
📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.