Advanced Python Performance Explained Simply (with Diagrams and Real Code)

Advanced Python Performance: The Essentials in One Article — Real Code, Diagrams and Concrete Steps, Excerpts from a 35-Lesson Course.

Advanced Python Performance Explained Simply (with Diagrams and Real Code)

A guide that gets straight to the point: Advanced Python Performance dissected with diagrams, concrete examples and tested commands. Everything comes from a structured 11-chapter course — here are the highlights.

tl;dr
  • Introduction and Installation
  • Code Profiling
  • Comprehensions iterators generators
  • Multithreading vs Multiprocessing
  • asyncio and coroutines
~$ cat ./parcours.md # Python Advanced Performance — 10 chapters
01
Introduction and Installation
→ Course presentation→ Install the profiling tools+ 1 more lessons
02
Code profiling
→ cProfile and pstats→ timeit for precise measurements+ 1 more lessons
03
Comprehensions iterators generators
→ List, set and dict comprehensions→ The iterator protocol+ 1 more lessons
04
Multithreading vs Multiprocessing
→ Threading, the GIL and its limits→ Multiprocessing, true parallelism+ 1 more lessons
05
asyncio and coroutines
→ Introduction to asyncio→ async / await in practice+ 1 more lessons
06
Cython Numba and vectorization
→ Vectorization with NumPy→ Numba JIT+ 1 more lessons
07
Memory management
→ The Python garbage collector→ Weak references with weakref+ 1 more lessons
08
Caching and memoization
→ functools.lru_cache→ Disk cache with joblib and diskcache+ 1 more lessons
🏁
Final project (+ 2 chapters along the way)
→ You leave with a concrete and demonstrable project

Course Overview

NOTEObjective — Understand what it means to “optimize Python code”, when to do it (and especially when not to), and get a clear picture of the journey ahead in this course.

Learning Objectives

TIPBy the end of this module — You will be able to explain why Python is reputed to be slow (and why that is partly a myth), you will know the 3 main optimization axes, and you will understand the golden rule: measure before optimizing.

The Concrete Problem

Have you ever experienced this? You launch your Python script at 5 pm to compute a report, you go grab a coffee, and when you come back 30 minutes later the script is still running. Worse: your colleague who does the same thing in R or Julia finished in 3 minutes.

Typical Symptoms of a Slow Python Program

What You Will Learn to Do

NOTEIs Python really slow? — Pure Python (CPython) can be 50 to 100× slower than C for numerical loops. However, most scientific libraries (NumPy, Pandas, scikit-learn) are written in C or Fortran. Used correctly, Python reaches 80 to 95 % of C performance. The problem is almost never “Python is slow” but “my Python code is poorly written”.

The 3 Main Optimization Axes

Axis Question asked Typical gain
1. Algorithm Is my code O(n²) when it could be O(n log n)? 10× to 10,000×
2. Data Structure Am I using a list when a set would do? 10× to 1000×
3. Concurrency / Parallelism Can I run these 8 tasks at the same time? 2× to 16× (depending on CPU/IO)
WARNINGOrder of attack — Always in this order: algorithm > structure > parallelism. Parallelizing a bad algorithm is like putting 8 people to dig a tunnel with teaspoons when an excavator would have sufficed.

The 80/20 Rule (Pareto)

In 80 % of programs, 80 % of execution time is spent in 20 % of the code. Often it is even 90/10 or 95/5.

TIPConsequence — No need to optimize all your code. Find the 5 lines that cost 95 % of the time, optimize them aggressively, and leave the rest alone.

Donald Knuth, the computing legend, summarized it in 1974:

NOTE« Premature optimization is the root of all evil. »
(Premature optimization is the root of all evil.)

Practical translation: first write clear and correct code. Enjoy it. If it is too slow, optimize only the hot spots. Otherwise you waste time making unreadable code that did not affect anyone.

A Telling Before/After

Here is a real example: summing the squares of numbers from 0 to 10 million.

output
# Naive version: Python loop
total = 0
for i in range(10_000_000):
    total += i * i
# Time: ~1.2 seconds on a modern laptop

# Vectorized version with NumPy
import numpy as np
arr = np.arange(10_000_000)
total = (arr * arr).sum()
# Time: ~0.04 seconds -> 30× faster

# Compiled version with Numba @jit
from numba import jit
@jit(nopython=True)
def somme_carres(n):
    total = 0
    for i in range(n):
        total += i * i
    return total
# Time: ~0.01 seconds -> 120× faster

Same problem, same language, same mathematical result — but 120 times faster. This is exactly what you will learn to do, systematically, on your own code.

What You Will Build

Phase 1: Measure (ch. 0-1)

Install the tools, run your first profile, learn to read a cProfile report. You will be able to identify the bottleneck in under 5 minutes.

Phase 2: Optimize (ch. 2-7)

Generators, threading, multiprocessing, asyncio, NumPy, Numba, Cython, caching. The entire modern Python developer toolkit.

Generators with yield

NOTEObjective — Discover yield, the keyword that turns a function into a generator, and learn to build lazy data-processing pipelines capable of handling multi-gigabyte files with only a few megabytes of RAM.

Learning Objectives

TIPBy the end of this module — You will know how to write a generator, use it in a processing pipeline, and intelligently choose between list, generator and materialized collection.

A First Generator

output
def compter(max):
    n = 0
    while n < max:
        yield n     # suspend the function and return n
        n += 1

# Call: does NOTHING, we get a generator back
g = compter(5)
print(type(g))   # <class 'generator'>

# Consumption
for n in g:
    print(n)     # 0 1 2 3 4
NOTEMagic — As soon as a function contains a yield, Python turns it into a generator factory. The call does not trigger the code: it returns a generator object. The code only executes on each call to next().

yield vs return

returnyield
Terminates the functionSuspends the function
State is lostState is preserved
Returns a value (once)Can be called multiple times
Returns everything at onceReturns one element at a time

Real-World Use Case: Reading a Large Log File

output
def lire_log(chemin):
    """Generator that yields each line without loading everything."""
    with open(chemin, encoding="utf-8") as f:
        for ligne in f:
            yield ligne.rstrip("\n")

def filtrer_erreurs(lignes):
    """Generator that keeps only ERROR lines."""
    for ligne in lignes:
        if "ERROR" in ligne:
            yield ligne

def extraire_codes(lignes):
    """Generator that yields the HTTP code of each line."""
    for ligne in lignes:
        try:
            code = int(ligne.split()[-1])
            yield code
        except (ValueError, IndexError):
            continue

# Pipeline: none of the steps consume RAM, even for 50 GB
lignes = lire_log("acces.log")
erreurs = filtrer_erreurs(lignes)
codes = extraire_codes(erreurs)

# Final consumption
from collections import Counter
print(Counter(codes).most_common(5))
# [(500, 1284), (502, 412), (503, 309), ...]
TIPThe art of the pipeline — Each generator does one thing and passes the result to the next. This is the Unix tools model: cat file | grep ERROR | awk '{print $NF}' | sort | uniq -c. Readable, modular, constant memory.

yield from: delegate to another generator

output
def sous_compter(a, b):
    for i in range(a, b):
        yield i

def compter_tout():
    yield from sous_compter(0, 3)     # 0,1,2
    yield from sous_compter(10, 13)   # 10,11,12
    yield 99

print(list(compter_tout()))
# [0, 1, 2, 10, 11, 12, 99]

yield from avoids the for x in autre: yield x loop and also correctly handles exceptions and sent values.

send(): bidirectional generators

You can send values into a generator (rarely used but powerful).

output
def echo():
    while True:
        recu = yield
        print("Received:", recu)

g = echo()
next(g)        # start the generator
g.send("hello")
g.send("world")
# Prints: Received: hello / Received: world

This mechanism is at the origin of asyncio before Python 3.5. Today we prefer async/await.

Pitfall #1: a generator can only be iterated once

output
g = (i*i for i in range(5))
print(list(g))   # [0, 1, 4, 9, 16]
print(list(g))   # []  -- WARNING, g is exhausted

Solution: recreate the generator, or materialize it into a list if you need it multiple times:

output
data = [i*i for i in range(5)]   # list, reusable

Profiler and Find the Bottlenecks

NOTEObjective — Apply the method from chapter 1 to our ETL pipeline: use cProfile to see where time is spent, snakeviz to visualize, and line_profiler to zoom in on the critical function.

Learning Objectives

TIPBy the end of this module — You will know how to profile a complete production script, read the report, identify the 3-4 lines that consume 90 % of the time, and write a clear “diagnostic report” for your team.

1. Global cProfile

output
python -m cProfile -o pipeline.prof pipeline_v0.py

To explore interactively with pstats:

output
python -m pstats pipeline.prof
% sort cumulative
% stats 15
output
42_847_310 function calls in 1083.42 seconds

ncalls    tottime  cumtime  filename:lineno(function)
     1     0.000  1083.42  pipeline_v0.py:1(<module>)
     1     0.005  1083.41  pipeline_v0.py:42(main)
     1   654.21   750.18  pipeline_v0.py:11(traiter_transactions)
5000001    34.20   34.20  <built-in method strip>
5000001    28.45   28.45  <built-in method upper>
3750000    21.89   45.30  pipeline_v0.py:18(traiter_transactions/dict.get)
5000001    18.95   18.95  <built-in method float>
3750000   180.45   180.45  list.append (resultats)
     1   268.32   268.32  pipeline_v0.py:32(agreger)
     1    65.10    65.10  pipeline_v0.py:39(sauver)
NOTEReadingtraiter_transactions = 70 % of the time. Inside: strip/upper (60 s), append (180 s), float() (19 s). agreger = 25 %. sauver = 6 %. So priority #1 = traiter_transactions.

2. Visualize with snakeviz

output
snakeviz pipeline.prof

A browser opens. “Sunburst” view: a central circle representing the entire program, divided into sectors proportional to time. Click to zoom.

On our profile we immediately see:

3. Zoom with line_profiler

Decorate the critical function:

output
@profile
def traiter_transactions(produits):
    ...
output
kernprof -l -v pipeline_v0.py
go-further

This article covers the most useful excerpts — the complete Advanced Python Performance course (11 chapters, 35 lessons, corrected exercises and final project) takes you all the way.

./access-the-full-course free course: Mastering Claude Code

FAQ

How long does it take to learn Advanced Python Performance?
With a structured progression (11 chapters, 35 short practical lessons), you reach an operational level in a few weeks at 30 to 60 minutes per day. The key is to practice each concept immediately.
Are there any prerequisites?
It is best to be comfortable with the fundamentals of the domain: this content goes in depth, with real-world cases.
Where to start concretely?
Reproduce the commands in this article, then follow the complete Advanced Python Performance course: it chains the 35 lessons in order, with exercises and a final project.

📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.