ML Model Monitoring Explained Simply (with Diagrams and Real Code)

ML Model Monitoring: The Essentials in One Article — Real Code, Diagrams and Concrete Steps, Excerpts from a 24-Lesson Course.

ML Model Monitoring Explained Simply (with Diagrams and Real Code)

A guide that gets straight to the point: ML Model Monitoring broken down with diagrams, concrete examples and tested commands. Everything comes from a structured 7-chapter course — here is the best of it.

tl;dr
  • Introduction ML Monitoring
  • Data Drift
  • Model Drift and performance
  • Market Tools
  • Monitoring in production
~$ cat ./parcours.md # ML Model Monitoring — 6 chapters
01
Introduction ML Monitoring
→ Le syndrome du modèle qui pourrit→ Les 3 piliers du monitoring ML+ 1 more lessons
02
Data Drift
→ Concepts et causes du data drift→ Tests statistiques : KS, Chi², PSI, Wasserstein+ 1 more lessons
03
Model Drift et performance
→ Concept drift vs Performance drift→ Métriques par type de problème ML+ 1 more lessons
04
Outils du marche
→ Tour d'horizon des outils ML Monitoring→ Stack open-source complète : Evidently + Prometheus + Grafana+ 1 more lessons
05
Monitoring en production
→ SLO, SLA et budgets d'erreur pour ML→ Continuous Training (CT) avec Airflow+ 2 more lessons
06
Final project
→ Projet final : Cahier des charges et architecture→ Implémentation FraudGuard pas à pas+ 1 more lessons
🏁
Final project
→ Tu repars avec un projet concret et démontrable

Configure Prometheus, Slack and PagerDuty alerts

NOTEObjective — Turn your ML monitoring metrics into actionable alerts: write Prometheus rules on drift and degradation, route notifications via Alertmanager to Slack and PagerDuty, and avoid alert fatigue.

Learning objectives

TIPBy the end of this module
  • Write Prometheus alerting rules (alerting_rules.yml) on ML KPIs
  • Configure Alertmanager to route to Slack and PagerDuty
  • Distinguish warning alerts from critical alerts by severity
  • Implement grouping (group_by) and inhibition
  • Reduce alert fatigue with for, thresholds and silences

From metric to alert: the complete chain

A metric exposed on /metrics is useless if nobody looks at it at 3 a.m. The alerting chain links a numeric value to a human action. It has four links.

1. Export the metric

Your ML service publishes Gauge and Counter metrics: drift score, p95 latency, error rate, rolling accuracy.

2. Evaluate the rule

Prometheus periodically evaluates PromQL expressions. When the expression remains true for the duration of for, the alert moves from pending to firing.

3. Route the alert

Alertmanager receives firing alerts, groups them, deduplicates them and selects the receiver according to the labels.

4. Notify the human

The receiver sends a Slack message (warning) or triggers PagerDuty with on-call (critical).

NOTEGolden rule: an alert that fires must always require an action. If nobody does anything when it rings, delete it or turn it into a simple dashboard panel.

Write ML alerting rules in Prometheus

The rules live in a YAML file loaded by Prometheus via rule_files. For a model in production, we monitor three families: data drift, performance degradation, and infra health (latency, errors).

LeverEffect
for: 15mEliminates transient spikes, only alerts on sustained drifts
InhibitionIf the API is down, we do not also alert on drift (single root cause)
SilencesDuring a scheduled retraining, we temporarily cut drift alerts
Calibrated thresholdsThresholds derived from history, not arbitrary values copied from a tutorial

The 3 pillars of ML monitoring

Chapter 00 • Lesson 02 • Duration: 45 min

NOTE🎯 Objectives
  • Identify the 3 categories of metrics to monitor
  • Understand which metrics to measure according to model type
  • Establish a progressive instrumentation roadmap
  • Know ML logging best practices

1. Overview: the 3 pillars

output
┌─────────────────────────────────────────────────────────────┐
│                  MONITORING ML EN PRODUCTION                  │
└─────────────────────────────────────────────────────────────┘

   ┌────────────────┐  ┌────────────────┐  ┌────────────────┐
   │  PILIER 1       │  │  PILIER 2       │  │  PILIER 3       │
   │  Système        │  │  Données        │  │  Modèle         │
   │  (infra)        │  │  (data + drift) │  │  (performance)  │
   └────────────────┘  └────────────────┘  └────────────────┘
   - Latence            - Distribution X    - Accuracy
   - QPS                - Valeurs manqu.    - F1, AUC
   - CPU/Mémoire        - Outliers          - Calibration
   - Erreurs 5xx        - Drift KS/PSI      - Business KPI

2. Pillar 1 — System monitoring (infra)

This is classic application monitoring, identical to any REST API.

MetricToolsTypical threshold
Latency p50/p95/p99Prometheus, Datadog, CloudWatchp99 < 500 ms
QPS (Queries Per Second)Prometheus, ALB metricsTrack trend
HTTP error rate (5xx)Prometheus, CloudWatch< 0.1 %
CPU / MemorycAdvisor, Datadog< 80 %
GPU usage (if applicable)NVIDIA DCGM, Prometheus< 90 %
Disk I/Onode_exporter
Queue saturation (Kafka, SQS)Kafka exporter< 10k messages
TIP

ML specificity: Latency can explode on deep learning models because of poorly shared GPUs. Always measure p99 latency, not just the average.

3. Pillar 2 — Data monitoring (inputs)

3.1 Per-feature metrics

Feature typeMetrics
NumericMin, max, mean, std, median, quantiles, missing values
CategoricalClass distribution, new classes, NULL
TextAverage length, vocabulary, detected languages
ImageSize, ratio, RGB histograms
TimestampDate ranges, frequency by hour of day

3.2 Drift detection

Statistical testFeature typeWhen to use
Kolmogorov-Smirnov (KS)NumericContinuous test vs reference distribution
Chi-squared (χ²)CategoricalFrequency comparison
Population Stability Index (PSI)Discretized numericFinance and banking standard
Wasserstein distanceNumericMore sensitive than KS for large distributions
Jensen-Shannon divergenceCategoricalProbabilistic comparison

Details and code in Chapter 01 Lesson 02.

3.3 Data quality

4. Pillar 3 — Model monitoring

4.1 Prediction metrics (no labels)

MetricDescription
Prediction distribution% of each class for classification
Average confidence scoreIndicator of model uncertainty
Output entropyThe higher, the more uncertain the model
OOD rate (Out-of-Distribution)% of inputs outside the training distribution
Fallback / unknown rate% of predictions where the model does not know

4.2 Performance (with labels)

ProblemMain metrics
Binary classificationAccuracy, Precision, Recall, F1, AUC-ROC, AUC-PR
Multi-class classificationAccuracy, F1-macro, F1-weighted, confusion matrix
RegressionRMSE, MAE, MAPE, R²
Ranking / RecommendationNDCG, MAP, Recall@k, Precision@k
Anomaly detectionPrecision/Recall on the rare class
Generation (LLM)BLEU, ROUGE, perplexity, human eval

4.3 Calibration

A calibrated model is one whose predicted probabilities match actual frequencies. E.g.: out of 100 predictions at 80 % confidence, ~80 must be correct.

output
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt

prob_pred = model.predict_proba(X_test)[:, 1]
prob_true, prob_pred = calibration_curve(y_test, prob_pred, n_bins=10)

plt.plot([0, 1], [0, 1], 'k--')
plt.plot(prob_pred, prob_true, 'o-')
plt.xlabel('Probabilité prédite')
plt.ylabel('Probabilité observée')
plt.title('Calibration plot')

5. The "delayed feedback" problem

Often, we only know the ground truth (the label) days/weeks/months after the prediction.

Use caseFeedback delay
Spam detectionA few minutes (user reports)
RecommendationHours (click) or days (purchase)
Bank fraud detectionDays to weeks (chargeback)
Credit scoring1-3 years (default)
Churn prediction30 to 90 days
Medical diagnosisWeeks to years

Concepts and causes of data drift

Chapter 01 • Lesson 01 • Duration: 45 min

NOTE🎯 Objectives
  • Precisely define what data drift is (and what it is not)
  • Distinguish covariate shift, label shift, and concept drift
  • Recognize common causes: seasonality, events, upstream bugs
  • Establish a reference dataset strategy

1. Formal definition

Data drift occurs when the statistical distribution of production data differs from the one used to train the model.

output
P_train(X) ≠ P_prod(X)

ou plus généralement :

P_train(X, Y) ≠ P_prod(X, Y)

Mathematically, we speak of covariate shift when only the distribution of features X changes, but the relationship P(Y|X) remains identical.

2. The 3 types of shift

TypeDefinitionExample
Covariate shiftP(X) changes, P(Y|X) constantHealth model trained on adults, in prod we have seniors. But the symptoms → disease relationship remains the same.
Label shiftP(Y) changes, P(X|Y) constantSpam detection: 5 % spam in train, 50 % in prod (massive phishing campaign). The spam profile is similar but their frequency explodes.
Concept driftP(Y|X) changesFraud recognition: fraudsters adapt, so same profile = different behavior. The rule changes.

3. Visual diagram

output
Distribution training            Distribution production
       
       *  *                              
      ***                                       *
     *****                                     ***
    *******                                   *****
   *********              vs                  ********
  ***********                                **********
 *************                              *************
───────────────────                ──────────────────────────
        Pas de drift                    DRIFT DÉTECTÉ
                                  (la distribution s'est décalée)

4. Main causes of data drift

4.1 Seasonality

DomainSeasonal example
E-commerceBlack Friday (volumes ×10), Christmas period
WeatherTemperature features very different winter/summer
TrafficWeekend vs weekdays, rush hours
TourismSummer in the southern hemisphere vs northern
EnergyWinter vs summer consumption (air conditioning/heating)

4.2 External events

4.3 Business evolution

4.4 Bugs and technical changes (often overlooked)

WARNING⚠️ The most insidious causes
  • Upstream ETL pipeline that changes an encoding (UTF-8 vs Latin-1)
  • Source API update that returns more/fewer fields
  • Bug fix that modifies upstream values (e.g. changed rounding)
  • Library version change (numpy 1.x vs 2.x different)
  • Database migration that alters types
  • New optional field: NULL in prod, never NULL in train

5. Concrete Python example — simulate drift

output
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(42)
n_train = 5000

train_age = np.random.normal(35, 10, n_train)
train_age = np.clip(train_age, 18, 80)

train_income = train_age * 1000 + np.random.normal(0, 5000, n_train)

prod_age = np.random.normal(50, 12, n_train)
prod_age = np.clip(prod_age, 18, 80)

prod_income = prod_age * 1000 + np.random.normal(0, 5000, n_train)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(train_age, bins=30, alpha=0.5, label='Train', color='blue')
axes[0].hist(prod_age, bins=30, alpha=0.5, label='Production', color='red')
axes[0].axvline(train_age.mean(), color='blue', linestyle='--', label=f'Train moy={train_age.mean():.1f}')
axes[0].axvline(prod_age.mean(), color='red', linestyle='--', label=f'Prod moy={prod_age.mean():.1f}')
axes[0].set_title('Distribution de "age" — covariate shift visible')
axes[0].set_xlabel('Age')
axes[0].legend()

axes[1].hist(train_income, bins=30, alpha=0.5, label='Train', color='blue')
axes[1].hist(prod_income, bins=30, alpha=0.5, label='Production', color='red')
axes[1].set_title('Distribution de "income"')
axes[1].set_xlabel('Income')
axes[1].legend()

plt.tight_layout()
plt.savefig('drift_visualization.png')
plt.show()

6. The "reference dataset" concept

To detect drift, you must compare against something known. This is the reference dataset.

OptionAdvantagesDisadvantages
Training datasetSample of what the model has seenCan be obsolete (1 year+)
Validation datasetIndependent of trainSmall, sometimes biased
Sliding prod window (past week)Always freshProgressive drift not detected
Previous monthGood balanceSeasonality ignored
Previous month N-12 (same month)Seasonality taken into accountLong-term upheavals missed
TIP

Recommendation: Keep 2 references — (1) initial validation dataset for "structural" drift (2) 30-day sliding window for "behavioral" drift.

7. Detection granularity

GranularityWhen to use
Per featurePrecisely identify which variable is drifting
Multivariate (entire dataset)Detect subtle drifts (correlations)
Per segment (cohort, geography)Detect localized drifts
Per time windowTemporal trend (day, week, month)

8. Analysis frequency

va-plus-loin

This article covers the most useful excerpts — the full ML Model Monitoring course (7 chapters, 24 lessons, corrected exercises and final project) takes you all the way.

./acceder-au-cours-complet free course: Mastering Claude Code

FAQ

How long does it take to learn ML Model Monitoring?
With a structured progression (7 chapters, 24 short and practical lessons), you reach an operational level in a few weeks at 30 to 60 minutes per day. The important thing is to practice each concept immediately.
Are prerequisites required?
Basic computer science knowledge is sufficient. If you know how to use a terminal and read simple code, you are ready.
Where to start concretely?
Reproduce the commands in this article, then follow the full ML Model Monitoring course: it chains the 24 lessons in order, with exercises and a final project.

📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.