Python SciPy statsmodels in Practice: The Code and Commands That Really Matter
Python SciPy statsmodels: The Essentials in One Article — Real Code, Diagrams, and Concrete Steps, Excerpts from a 37-Lesson Course.
No endless theory here: open the terminal and practice. Here’s the essentials of Python SciPy statsmodels, extracted directly from a complete 37-lesson course — with real code you can copy-paste right now.
- Introduction and Installation
- NumPy Recap for SciPy
- Discover SciPy
- Descriptive Statistics with scipy.stats
- Statistical Tests with SciPy
Forecasting the Future with ARIMA
What we will cover in this lesson
What is ARIMA, in simple terms?
ARIMA is a forecasting algorithm. It looks at past values and tries to infer what will happen in the future.
The full name is Auto-Regressive Integrated Moving Average. It’s a complicated name, but the concept is simple: the model relies on past values and past forecast errors to estimate the next value.
The 3 ARIMA parameters: p, d, q
We write ARIMA(p, d, q). Each letter has a precise role.
| Parameter | Meaning | Simple example |
|---|---|---|
| p | How many past values we use to predict | p=2: we look at the last 2 months |
| d | How many times we transform the series to stabilize it | d=1: most of the time, d=1 is enough |
| q | How many past errors we use to correct the forecast | q=1: we take the last error into account |
Building the ARIMA model step by step
We reuse our monthly sales series over 4 years. We will use the first 38 months to train the model and the last 10 months to check whether the forecasts are good.
import pandas as pd
import numpy as np
from statsmodels.tsa.arima.model import ARIMA
# --- Recreate the monthly sales series ---
np.random.seed(42)
dates = pd.date_range(start="2020-01", periods=48, freq="ME")
tendance = np.arange(48) * 1.0
mois = dates.month
saisonnalite = 10 * np.sin(2 * np.pi * (mois - 3) / 12)
bruit = np.random.normal(0, 3, 48)
ventes = 100 + tendance + saisonnalite + bruit
serie = pd.Series(ventes, index=dates, name="Ventes")
# --- Split: 38 months for training, 10 months for testing ---
train = serie.iloc[:38] # training data
test = serie.iloc[38:] # test data (known future)
print(f"Training : {len(train)} months")
print(f"Test : {len(test)} months")# --- Create and fit the model ---
# order=(p, d, q) = (1, 1, 1) : default setting
modele = ARIMA(train, order=(1, 1, 1))
resultat = modele.fit()
# Display a short summary
print(f"Model AIC : {resultat.aic:.2f}")
print("Model fitted successfully!")Generating a forecast
Now that we have trained the model, we ask it to forecast the next 10 months.
# Generate a forecast for 10 months
prevision = resultat.forecast(steps=10)
# Compare forecast vs actual values
comparaison = pd.DataFrame({
"Reel" : test.values.round(1),
"Prevision": prevision.values.round(1),
"Erreur" : (test.values - prevision.values).round(1)
}, index=test.index)
print(comparaison)- Reel : the true sales for those months (what we hid from the model during training)
- Prevision : what the ARIMA model predicted
- Erreur : the difference. An error close to 0 is good. A large error means the forecast was bad.
Measuring forecast quality
To know whether our model forecasts well, we compute the mean absolute error (MAE). It is simply the average of the errors, ignoring the sign (positive or negative).
mae = np.abs(test.values - prevision.values).mean()
print(f"Mean absolute error (MAE) : {mae:.1f} units")
print(f"Mean of actual sales : {test.mean():.1f} units")
print(f"Relative error : {mae / test.mean() * 100:.1f}%")Forecasting the future beyond available data
In real life, you have no test data to compare against. You just want to forecast the next months.
First SciPy script
Learning objectives
The structure of a typical analysis script
Every data analysis script in Python follows a similar pattern. Learning this pattern now will help you in all subsequent chapters.
General pattern
Scenario: analyzing course grades
You are a teacher. You have the grades of 20 students out of 100. You want to understand:
Step 1: the imports
The first thing to do in any script is to import the libraries. Here is why we import them with aliases:
import numpy as np # np is shorter to write than numpy from scipy import stats # we import only the stats module from SciPy import matplotlib.pyplot as plt # plt is the standard alias for plotting graphs
from scipy import stats and not import scipy? — SciPy is divided into sub-modules. Importing only stats is more efficient because we load only what we need. We then write stats.mean() instead of scipy.stats.mean().Step 2: create the data
notes = np.array([
72, 85, 91, 63, 78, 55, 88, 94, 70, 82,
66, 79, 87, 61, 90, 75, 83, 68, 77, 95
])
print("Number of students :", len(notes))
print("Grades :", notes)Step 3: compute statistics with scipy.stats
moyenne = np.mean(notes)
mediane = np.median(notes)
ecart_type = np.std(notes)
minimum = np.min(notes)
maximum = np.max(notes)
print(f"Mean : {moyenne:.2f}")
print(f"Median : {mediane:.2f}")
print(f"Std dev : {ecart_type:.2f}")
print(f"Minimum : {minimum}")
print(f"Maximum : {maximum}")Step 4: use scipy.stats to go further
NumPy computes basic statistics. SciPy goes further by adding measures such as skewness and kurtosis, which describe the shape of the distribution.
asymetrie = stats.skew(notes)
aplatissement = stats.kurtosis(notes)
description = stats.describe(notes)
print(f"Skewness : {asymetrie:.4f}")
print(f"Kurtosis : {aplatissement:.4f}")
print()
print("Full description from scipy.stats.describe() :")
print(description)Step 5: interpret the results
| Statistic | Value | What it means |
|---|---|---|
| Mean | 77.95 | On average, students scored almost 78/100 |
| Median | 78.50 | Half the students scored below 78.5 and the other half above |
| Std dev | 11.32 | Grades vary by about 11 points around the mean |
| Skewness (-0.25) | Slight | The distribution is almost symmetric (slight concentration toward higher grades) |
| Kurtosis (-0.88) | Negative | Distribution slightly flatter than normal (platykurtic) |
Step 6: visualize (bonus)
plt.figure(figsize=(8, 4))
plt.hist(notes, bins=8, color='steelblue', edgecolor='white', alpha=0.8)
plt.axvline(moyenne, color='red', linestyle='--', label=f'Mean ({moyenne:.1f})')
plt.axvline(mediane, color='orange', linestyle='-', label=f'Median ({mediane:.1f})')
plt.xlabel("Grade out of 100")
plt.ylabel("Number of students")
plt.title("Grade distribution of the class")
plt.legend()
plt.tight_layout()
plt.show()What we can do with a NumPy array
What we will cover in this lesson
Reading a precise value in an array
Imagine your array is a row of numbered boxes. The first box is numbered 0, not 1. This is a Python convention. The second box is 1, the third is 2, and so on.
We can also count from the end with negative numbers. -1 gives the last element, -2 the second-to-last, etc.
import numpy as np # An array of 7 weekly temperatures temperatures = np.array([18.5, 22.1, 19.8, 25.3, 21.0, 17.2, 23.4]) # [0] [1] [2] [3] [4] [5] [6] print( temperatures[0] ) # 18.5 -> Monday (first value) print( temperatures[2] ) # 19.8 -> Wednesday (third value) print( temperatures[-1] ) # 23.4 -> Sunday (last value) print( temperatures[-2] ) # 17.2 -> Saturday (second-to-last)
Selecting several consecutive values
If you want to take a slice of the array, use the start:end notation. Note: the end value is exclusive. If you write [1:4], you get boxes 1, 2 and 3, but not box 4.
temperatures = np.array([18.5, 22.1, 19.8, 25.3, 21.0, 17.2, 23.4]) # The first 3 days (Monday, Tuesday, Wednesday) print( temperatures[:3] ) # omitted start = from the beginning # From Tuesday to Friday (index 1 to 4, 4 excluded) print( temperatures[1:4] ) # The last 2 days print( temperatures[-2:] ) # omitted end = until the end # Every other day (Monday, Wednesday, Friday, Sunday) print( temperatures[::2] )
Keeping only values that satisfy a condition
This is one of the most powerful features of NumPy. You can say “give me only the temperatures above 21 degrees” and NumPy performs the selection automatically.
Here is how it works in 2 steps:
temperatures = np.array([18.5, 22.1, 19.8, 25.3, 21.0, 17.2, 23.4])
# Step 1: which days are above 21 degrees?
masque = temperatures > 21
print("Mask :", masque)
# True where value > 21, False otherwise
# Step 2: keep only those values
print("Days > 21 degrees :", temperatures[masque])
# We can also do both steps in a single line:
print("Days < 20 degrees :", temperatures[temperatures < 20])Performing mathematical calculations on the entire array
NumPy provides mathematical functions that apply to every element at the same time. No need for a loop. You write a single line and NumPy performs the calculation on all values.
data = np.array([4.0, 9.0, 16.0, 25.0])
# np.sqrt = square root of each value
print("Square root :", np.sqrt(data))
# np.log = natural logarithm of each value
print("Logarithm :", np.round(np.log(data), 2))
# np.abs = absolute value (removes the negative sign)
nombres = np.array([-3, 5, -7, 2])
print("Absolute value:", np.abs(nombres))Computing statistics on the entire array
These functions summarize an array into a single number:
ventes = np.array([1200, 850, 1500, 970, 1100, 1350, 780])
print("Weekly total :", np.sum(ventes))
print("Daily average :", np.mean(ventes).round(1))
print("Highest day :", np.max(ventes))
print("Lowest day :", np.min(ventes))
# cumsum : cumulative total day by day
print("Cumulative day by day :", np.cumsum(ventes))Applying a calculation to all values at once
In plain Python, if you want to add 10% to every price in a list, you write a loop. With NumPy, you do not need one. You write the operation once and it applies to all values. This is called broadcasting.
This article covers the most useful excerpts — the complete Python SciPy statsmodels course (11 chapters, 37 lessons, corrected exercises and final project) takes you all the way.
./access-the-full-course free course: Mastering Claude CodeFAQ
How long does it take to learn Python SciPy statsmodels?
Are there any prerequisites?
Where to start concretely?
📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.