Data & Big Data

Python SciPy statsmodels in Practice: The Code and Commands That Really Matter

Python SciPy statsmodels: The Essentials in One Article — Real Code, Diagrams, and Concrete Steps, Excerpts from a 37-Lesson Course.

REHOUMA Haythem

12 Jun 2026 • 16 min read

No endless theory here: open the terminal and practice. Here’s the essentials of Python SciPy statsmodels, extracted directly from a complete 37-lesson course — with real code you can copy-paste right now.

tl;dr

Introduction and Installation
NumPy Recap for SciPy
Discover SciPy
Descriptive Statistics with scipy.stats
Statistical Tests with SciPy

~$ cat ./parcours.md # Python SciPy statsmodels — 10 chapters

Introduction and Installation

→ Course Presentation→ Install Python and the libraries+ 1 more lessons

NumPy Recap for SciPy

→ NumPy Arrays, the Fundamental Building Block→ What You Can Do with a NumPy Array+ 1 more lessons

Discover SciPy

→ What Is SciPy and What Is It Used For?→ Organization of scipy.stats Modules+ 1 more lessons

Descriptive Statistics with scipy.stats

→ Mean, Median and Mode→ Variance, Standard Deviation and Dispersion+ 1 more lessons

Statistical Tests with SciPy

→ Introduction to Statistical Tests and the p-value→ Student’s t-Test — Comparing Two Means+ 2 more lessons

Introduction to Statsmodels

→ What Is Statsmodels and How Does It Differ from SciPy?→ Load Data and Explore Built-in Datasets+ 1 more lessons

Linear Regression with Statsmodels

→ Theory of Linear Regression Explained Simply→ Simple Linear Regression with Statsmodels+ 2 more lessons

Logistic Regression with Statsmodels

→ Theory of Logistic Regression→ Logistic Regression in Practice with Statsmodels+ 1 more lessons

🏁

Final project (+ 2 chapters along the way)

→ You leave with a concrete and demonstrable project

Forecasting the Future with ARIMA

NOTEWhat you will learn — ARIMA is a model that learns from past trends to predict the future. We will see how it works, how to configure it, and how to use it to generate a forecast.

What we will cover in this lesson

What is ARIMA, in simple terms?

ARIMA is a forecasting algorithm. It looks at past values and tries to infer what will happen in the future.

TIPAnalogy — Imagine a meteorologist. To forecast tomorrow’s temperature, they look at the temperatures of the last few days. If the last 3 days were 18, 20, 22 degrees (upward trend), they will probably predict 24 tomorrow. ARIMA does the same, but using precise mathematics.

The full name is Auto-Regressive Integrated Moving Average. It’s a complicated name, but the concept is simple: the model relies on past values and past forecast errors to estimate the next value.

The 3 ARIMA parameters: p, d, q

We write ARIMA(p, d, q). Each letter has a precise role.

Parameter	Meaning	Simple example
p	How many past values we use to predict	p=2: we look at the last 2 months
d	How many times we transform the series to stabilize it	d=1: most of the time, d=1 is enough
q	How many past errors we use to correct the forecast	q=1: we take the last error into account

TIPBeginner tip — Always start with ARIMA(1, 1, 1). This is the default setting that works well in most cases. You can test other values later if needed.

NOTEWhat is d for? — Most real series have a trend (sales increase every year, for example). ARIMA needs the series to be “stationary” (no trend) to work. The d parameter transforms the series to make it stationary. d=1 means we compute the differences between consecutive values.

Building the ARIMA model step by step

We reuse our monthly sales series over 4 years. We will use the first 38 months to train the model and the last 10 months to check whether the forecasts are good.

output

import pandas as pd
import numpy as np
from statsmodels.tsa.arima.model import ARIMA

# --- Recreate the monthly sales series ---
np.random.seed(42)
dates = pd.date_range(start="2020-01", periods=48, freq="ME")
tendance = np.arange(48) * 1.0
mois = dates.month
saisonnalite = 10 * np.sin(2 * np.pi * (mois - 3) / 12)
bruit = np.random.normal(0, 3, 48)
ventes = 100 + tendance + saisonnalite + bruit
serie = pd.Series(ventes, index=dates, name="Ventes")

# --- Split: 38 months for training, 10 months for testing ---
train = serie.iloc[:38]   # training data
test  = serie.iloc[38:]   # test data (known future)

print(f"Training : {len(train)} months")
print(f"Test     : {len(test)} months")

output

# --- Create and fit the model ---
# order=(p, d, q) = (1, 1, 1) : default setting
modele = ARIMA(train, order=(1, 1, 1))
resultat = modele.fit()

# Display a short summary
print(f"Model AIC : {resultat.aic:.2f}")
print("Model fitted successfully!")

TIPAIC = Akaike Information Criterion. It is a score that measures model quality. The lower it is, the better. We use it to compare several ARIMA models against each other.

Generating a forecast

Now that we have trained the model, we ask it to forecast the next 10 months.

output

# Generate a forecast for 10 months
prevision = resultat.forecast(steps=10)

# Compare forecast vs actual values
comparaison = pd.DataFrame({
    "Reel"    : test.values.round(1),
    "Prevision": prevision.values.round(1),
    "Erreur"  : (test.values - prevision.values).round(1)
}, index=test.index)

print(comparaison)

NOTEHow to read this table?

Reel : the true sales for those months (what we hid from the model during training)
Prevision : what the ARIMA model predicted
Erreur : the difference. An error close to 0 is good. A large error means the forecast was bad.

Measuring forecast quality

To know whether our model forecasts well, we compute the mean absolute error (MAE). It is simply the average of the errors, ignoring the sign (positive or negative).

output

mae = np.abs(test.values - prevision.values).mean()
print(f"Mean absolute error (MAE) : {mae:.1f} units")
print(f"Mean of actual sales      : {test.mean():.1f} units")
print(f"Relative error            : {mae / test.mean() * 100:.1f}%")

TIPInterpretation — The model is wrong by 9.1 units on average, on sales averaging 141. That represents a 6.4% error. For a basic ARIMA(1,1,1) model, this is acceptable. We could improve it by testing other p and q values.

Forecasting the future beyond available data

In real life, you have no test data to compare against. You just want to forecast the next months.

First SciPy script

NOTEObjective — Write and run your very first script using SciPy. You will compute simple statistics on a dummy dataset, see how SciPy is used, and make sure everything works before going further.

Learning objectives

TIPAt the end of this module — You will have run your first SciPy script, you will understand the structure of an analysis script, and you will be comfortable moving on to the following chapters.

The structure of a typical analysis script

Every data analysis script in Python follows a similar pattern. Learning this pattern now will help you in all subsequent chapters.

General pattern

Scenario: analyzing course grades

You are a teacher. You have the grades of 20 students out of 100. You want to understand:

Step 1: the imports

The first thing to do in any script is to import the libraries. Here is why we import them with aliases:

output

import numpy as np                   # np is shorter to write than numpy
from scipy import stats              # we import only the stats module from SciPy
import matplotlib.pyplot as plt      # plt is the standard alias for plotting graphs

NOTEWhy from scipy import stats and not import scipy? — SciPy is divided into sub-modules. Importing only stats is more efficient because we load only what we need. We then write stats.mean() instead of scipy.stats.mean().

Step 2: create the data

output

notes = np.array([
    72, 85, 91, 63, 78, 55, 88, 94, 70, 82,
    66, 79, 87, 61, 90, 75, 83, 68, 77, 95
])

print("Number of students :", len(notes))
print("Grades               :", notes)

TIPnp.array() — This function creates a NumPy array from a Python list. A NumPy array looks like a Python list, but it is much more powerful for mathematical computations. SciPy always works with NumPy arrays.

Step 3: compute statistics with scipy.stats

output

moyenne = np.mean(notes)
mediane = np.median(notes)
ecart_type = np.std(notes)
minimum = np.min(notes)
maximum = np.max(notes)

print(f"Mean       : {moyenne:.2f}")
print(f"Median     : {mediane:.2f}")
print(f"Std dev    : {ecart_type:.2f}")
print(f"Minimum    : {minimum}")
print(f"Maximum    : {maximum}")

Step 4: use scipy.stats to go further

NumPy computes basic statistics. SciPy goes further by adding measures such as skewness and kurtosis, which describe the shape of the distribution.

output

asymetrie = stats.skew(notes)
aplatissement = stats.kurtosis(notes)
description = stats.describe(notes)

print(f"Skewness   : {asymetrie:.4f}")
print(f"Kurtosis   : {aplatissement:.4f}")
print()
print("Full description from scipy.stats.describe() :")
print(description)

Step 5: interpret the results

Statistic	Value	What it means
Mean	77.95	On average, students scored almost 78/100
Median	78.50	Half the students scored below 78.5 and the other half above
Std dev	11.32	Grades vary by about 11 points around the mean
Skewness (-0.25)	Slight	The distribution is almost symmetric (slight concentration toward higher grades)
Kurtosis (-0.88)	Negative	Distribution slightly flatter than normal (platykurtic)

NOTEMean vs median — The mean (77.95) and median (78.50) are very close here. This indicates there are no extreme values pulling the mean up or down. When the mean and median differ greatly, it often signals outliers.

Step 6: visualize (bonus)

output

plt.figure(figsize=(8, 4))
plt.hist(notes, bins=8, color='steelblue', edgecolor='white', alpha=0.8)
plt.axvline(moyenne, color='red', linestyle='--', label=f'Mean ({moyenne:.1f})')
plt.axvline(mediane, color='orange', linestyle='-', label=f'Median ({mediane:.1f})')
plt.xlabel("Grade out of 100")
plt.ylabel("Number of students")
plt.title("Grade distribution of the class")
plt.legend()
plt.tight_layout()
plt.show()

TIPplt.axvline() — This function draws a vertical line on the plot. It is useful for marking the mean or median on a histogram.

What we can do with a NumPy array

NOTEWhat you will learn — How to read a precise value in an array, how to keep only certain values, and how to perform calculations on an entire array in one go.

What we will cover in this lesson

Reading a precise value in an array

Imagine your array is a row of numbered boxes. The first box is numbered 0, not 1. This is a Python convention. The second box is 1, the third is 2, and so on.

TIPAnalogy — It is like the floors of a building in Europe: the ground floor is floor 0. The first floor is 1. In Python, we count the same way.

We can also count from the end with negative numbers. -1 gives the last element, -2 the second-to-last, etc.

output

import numpy as np

# An array of 7 weekly temperatures
temperatures = np.array([18.5, 22.1, 19.8, 25.3, 21.0, 17.2, 23.4])
#                          [0]   [1]   [2]   [3]   [4]   [5]   [6]

print( temperatures[0] )   # 18.5  -> Monday (first value)
print( temperatures[2] )   # 19.8  -> Wednesday (third value)
print( temperatures[-1] )  # 23.4  -> Sunday (last value)
print( temperatures[-2] )  # 17.2  -> Saturday (second-to-last)

Selecting several consecutive values

If you want to take a slice of the array, use the start:end notation. Note: the end value is exclusive. If you write [1:4], you get boxes 1, 2 and 3, but not box 4.

TIPAnalogy — It is like cutting a baguette. If you say “give me from piece 1 to piece 4”, the baker gives you pieces 1, 2, 3. Piece 4 stays in the shop.

output

temperatures = np.array([18.5, 22.1, 19.8, 25.3, 21.0, 17.2, 23.4])

# The first 3 days (Monday, Tuesday, Wednesday)
print( temperatures[:3] )     # omitted start = from the beginning

# From Tuesday to Friday (index 1 to 4, 4 excluded)
print( temperatures[1:4] )

# The last 2 days
print( temperatures[-2:] )    # omitted end = until the end

# Every other day (Monday, Wednesday, Friday, Sunday)
print( temperatures[::2] )

Keeping only values that satisfy a condition

This is one of the most powerful features of NumPy. You can say “give me only the temperatures above 21 degrees” and NumPy performs the selection automatically.

Here is how it works in 2 steps:

output

temperatures = np.array([18.5, 22.1, 19.8, 25.3, 21.0, 17.2, 23.4])

# Step 1: which days are above 21 degrees?
masque = temperatures > 21
print("Mask :", masque)
# True where value > 21, False otherwise

# Step 2: keep only those values
print("Days > 21 degrees :", temperatures[masque])

# We can also do both steps in a single line:
print("Days < 20 degrees :", temperatures[temperatures < 20])

NOTEHow to read the mask? — The mask has the same size as the array. For each value in the array, it says True (yes, this value satisfies the condition) or False (no). NumPy then keeps only the positions marked True.

Performing mathematical calculations on the entire array

NumPy provides mathematical functions that apply to every element at the same time. No need for a loop. You write a single line and NumPy performs the calculation on all values.

output

data = np.array([4.0, 9.0, 16.0, 25.0])

# np.sqrt = square root of each value
print("Square root :", np.sqrt(data))

# np.log = natural logarithm of each value
print("Logarithm   :", np.round(np.log(data), 2))

# np.abs = absolute value (removes the negative sign)
nombres = np.array([-3, 5, -7, 2])
print("Absolute value:", np.abs(nombres))

Computing statistics on the entire array

These functions summarize an array into a single number:

output

ventes = np.array([1200, 850, 1500, 970, 1100, 1350, 780])

print("Weekly total   :", np.sum(ventes))
print("Daily average  :", np.mean(ventes).round(1))
print("Highest day    :", np.max(ventes))
print("Lowest day     :", np.min(ventes))

# cumsum : cumulative total day by day
print("Cumulative day by day :", np.cumsum(ventes))

TIPWhat is cumsum for? — cumsum means “cumulative sum”. Each displayed value is the running total from the beginning. Here: Monday = 1200, Monday+Tuesday = 2050, Monday+Tuesday+Wednesday = 3550, etc. Useful for tracking accumulating revenue.

Applying a calculation to all values at once

In plain Python, if you want to add 10% to every price in a list, you write a loop. With NumPy, you do not need one. You write the operation once and it applies to all values. This is called broadcasting.

TIPAnalogy — Imagine a year of price increases. All prices in your list are raised by 20%. With NumPy, you multiply the entire array by 1.20 in a single line. Without NumPy, you would have had to write a loop that iterates over each price one by one.

go-further

This article covers the most useful excerpts — the complete Python SciPy statsmodels course (11 chapters, 37 lessons, corrected exercises and final project) takes you all the way.

./access-the-full-course free course: Mastering Claude Code

FAQ

How long does it take to learn Python SciPy statsmodels?

With a structured progression (11 chapters, 37 short practical lessons), you reach an operational level in a few weeks at 30–60 minutes per day. The key is to practice each concept immediately.

Are there any prerequisites?

Basic computer literacy is enough. If you can use a terminal and read simple code, you are ready.

Where to start concretely?

Reproduce the commands in this article, then follow the full Python SciPy statsmodels course: it chains the 37 lessons in order, with exercises and a final project.

./further-reading

→ AWS Data Engineering Bootcamp explained simply (with diagrams and real code)→ Get started with AWS Real-Time Data: your first concrete step today → Python Data Science: the 9 key steps to go from zero to operational

📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.

Forecasting the Future with ARIMA

What we will cover in this lesson

What is ARIMA, in simple terms?

The 3 ARIMA parameters: p, d, q

Building the ARIMA model step by step

Generating a forecast

Measuring forecast quality

Forecasting the future beyond available data

First SciPy script

Learning objectives

The structure of a typical analysis script

General pattern

Scenario: analyzing course grades

Step 1: the imports

Step 2: create the data

Step 3: compute statistics with scipy.stats

Step 4: use scipy.stats to go further

Step 5: interpret the results

Step 6: visualize (bonus)

What we can do with a NumPy array

What we will cover in this lesson

Reading a precise value in an array

Selecting several consecutive values

Keeping only values that satisfy a condition

Performing mathematical calculations on the entire array

Computing statistics on the entire array

Applying a calculation to all values at once

FAQ

Stay up to date