Advanced Python Performance Explained Simply (with Diagrams and Real Code)
Advanced Python Performance: The Essentials in One Article — Real Code, Diagrams and Concrete Steps, Excerpts from a 35-Lesson Course.
A guide that gets straight to the point: Advanced Python Performance dissected with diagrams, concrete examples and tested commands. Everything comes from a structured 11-chapter course — here are the highlights.
- Introduction and Installation
- Code Profiling
- Comprehensions iterators generators
- Multithreading vs Multiprocessing
- asyncio and coroutines
Course Overview
Learning Objectives
The Concrete Problem
Have you ever experienced this? You launch your Python script at 5 pm to compute a report, you go grab a coffee, and when you come back 30 minutes later the script is still running. Worse: your colleague who does the same thing in R or Julia finished in 3 minutes.
Typical Symptoms of a Slow Python Program
What You Will Learn to Do
The 3 Main Optimization Axes
| Axis | Question asked | Typical gain |
|---|---|---|
| 1. Algorithm | Is my code O(n²) when it could be O(n log n)? | 10× to 10,000× |
| 2. Data Structure | Am I using a list when a set would do? |
10× to 1000× |
| 3. Concurrency / Parallelism | Can I run these 8 tasks at the same time? | 2× to 16× (depending on CPU/IO) |
The 80/20 Rule (Pareto)
In 80 % of programs, 80 % of execution time is spent in 20 % of the code. Often it is even 90/10 or 95/5.
Donald Knuth, the computing legend, summarized it in 1974:
(Premature optimization is the root of all evil.)
Practical translation: first write clear and correct code. Enjoy it. If it is too slow, optimize only the hot spots. Otherwise you waste time making unreadable code that did not affect anyone.
A Telling Before/After
Here is a real example: summing the squares of numbers from 0 to 10 million.
# Naive version: Python loop total = 0 for i in range(10_000_000): total += i * i # Time: ~1.2 seconds on a modern laptop # Vectorized version with NumPy import numpy as np arr = np.arange(10_000_000) total = (arr * arr).sum() # Time: ~0.04 seconds -> 30× faster # Compiled version with Numba @jit from numba import jit @jit(nopython=True) def somme_carres(n): total = 0 for i in range(n): total += i * i return total # Time: ~0.01 seconds -> 120× faster
Same problem, same language, same mathematical result — but 120 times faster. This is exactly what you will learn to do, systematically, on your own code.
What You Will Build
Phase 1: Measure (ch. 0-1)
Install the tools, run your first profile, learn to read a cProfile report. You will be able to identify the bottleneck in under 5 minutes.
Phase 2: Optimize (ch. 2-7)
Generators, threading, multiprocessing, asyncio, NumPy, Numba, Cython, caching. The entire modern Python developer toolkit.
Generators with yield
yield, the keyword that turns a function into a generator, and learn to build lazy data-processing pipelines capable of handling multi-gigabyte files with only a few megabytes of RAM.Learning Objectives
A First Generator
def compter(max): n = 0 while n < max: yield n # suspend the function and return n n += 1 # Call: does NOTHING, we get a generator back g = compter(5) print(type(g)) # <class 'generator'> # Consumption for n in g: print(n) # 0 1 2 3 4
yield, Python turns it into a generator factory. The call does not trigger the code: it returns a generator object. The code only executes on each call to next().yield vs return
| return | yield |
|---|---|
| Terminates the function | Suspends the function |
| State is lost | State is preserved |
| Returns a value (once) | Can be called multiple times |
| Returns everything at once | Returns one element at a time |
Real-World Use Case: Reading a Large Log File
def lire_log(chemin): """Generator that yields each line without loading everything.""" with open(chemin, encoding="utf-8") as f: for ligne in f: yield ligne.rstrip("\n") def filtrer_erreurs(lignes): """Generator that keeps only ERROR lines.""" for ligne in lignes: if "ERROR" in ligne: yield ligne def extraire_codes(lignes): """Generator that yields the HTTP code of each line.""" for ligne in lignes: try: code = int(ligne.split()[-1]) yield code except (ValueError, IndexError): continue # Pipeline: none of the steps consume RAM, even for 50 GB lignes = lire_log("acces.log") erreurs = filtrer_erreurs(lignes) codes = extraire_codes(erreurs) # Final consumption from collections import Counter print(Counter(codes).most_common(5)) # [(500, 1284), (502, 412), (503, 309), ...]
cat file | grep ERROR | awk '{print $NF}' | sort | uniq -c. Readable, modular, constant memory.yield from: delegate to another generator
def sous_compter(a, b): for i in range(a, b): yield i def compter_tout(): yield from sous_compter(0, 3) # 0,1,2 yield from sous_compter(10, 13) # 10,11,12 yield 99 print(list(compter_tout())) # [0, 1, 2, 10, 11, 12, 99]
yield from avoids the for x in autre: yield x loop and also correctly handles exceptions and sent values.
send(): bidirectional generators
You can send values into a generator (rarely used but powerful).
def echo(): while True: recu = yield print("Received:", recu) g = echo() next(g) # start the generator g.send("hello") g.send("world") # Prints: Received: hello / Received: world
This mechanism is at the origin of asyncio before Python 3.5. Today we prefer async/await.
Pitfall #1: a generator can only be iterated once
g = (i*i for i in range(5)) print(list(g)) # [0, 1, 4, 9, 16] print(list(g)) # [] -- WARNING, g is exhausted
Solution: recreate the generator, or materialize it into a list if you need it multiple times:
data = [i*i for i in range(5)] # list, reusable
Profiler and Find the Bottlenecks
Learning Objectives
1. Global cProfile
python -m cProfile -o pipeline.prof pipeline_v0.py
To explore interactively with pstats:
python -m pstats pipeline.prof % sort cumulative % stats 15
42_847_310 function calls in 1083.42 seconds
ncalls tottime cumtime filename:lineno(function)
1 0.000 1083.42 pipeline_v0.py:1(<module>)
1 0.005 1083.41 pipeline_v0.py:42(main)
1 654.21 750.18 pipeline_v0.py:11(traiter_transactions)
5000001 34.20 34.20 <built-in method strip>
5000001 28.45 28.45 <built-in method upper>
3750000 21.89 45.30 pipeline_v0.py:18(traiter_transactions/dict.get)
5000001 18.95 18.95 <built-in method float>
3750000 180.45 180.45 list.append (resultats)
1 268.32 268.32 pipeline_v0.py:32(agreger)
1 65.10 65.10 pipeline_v0.py:39(sauver)traiter_transactions = 70 % of the time. Inside: strip/upper (60 s), append (180 s), float() (19 s). agreger = 25 %. sauver = 6 %. So priority #1 = traiter_transactions.2. Visualize with snakeviz
snakeviz pipeline.prof
A browser opens. “Sunburst” view: a central circle representing the entire program, divided into sectors proportional to time. Click to zoom.
On our profile we immediately see:
3. Zoom with line_profiler
Decorate the critical function:
@profile
def traiter_transactions(produits):
...kernprof -l -v pipeline_v0.py
This article covers the most useful excerpts — the complete Advanced Python Performance course (11 chapters, 35 lessons, corrected exercises and final project) takes you all the way.
./access-the-full-course free course: Mastering Claude CodeFAQ
How long does it take to learn Advanced Python Performance?
Are there any prerequisites?
Where to start concretely?
📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.