EDA with pandas, NumPy, Matplotlib & Seaborn: The 9 Key Steps from Zero to Operational

EDA pandas NumPy Matplotlib Seaborn: the essentials in one article — real code, diagrams and concrete steps, excerpts from a 44-lesson course.

EDA with pandas, NumPy, Matplotlib & Seaborn: The 9 Key Steps from Zero to Operational

Everyone can learn EDA pandas NumPy Matplotlib Seaborn — provided they follow the steps in the right order. We have condensed a complete 44-lesson course into a clear path, with the most useful code snippets.

tl;dr
  • Introduction to Data Analysis
  • Introduction and installation
  • Getting started with Pandas DataFrames
  • Cleaning and Preparing Data
  • Descriptive Statistics and Aggregation
~$ cat ./parcours.md # EDA pandas NumPy Matplotlib Seaborn — 9 chapters
01
Introduction to Data Analysis
→ Data Analysis — The Job of the Century→ Chapter 00 — Course data sources
02
Introduction and installation
→ Why EDA and these four libraries?→ Install your working environment+ 2 more lessons
03
Getting started with Pandas DataFrames
→ Create and load a DataFrame (CSV, Excel, JSON)→ Explore a DataFrame — head, info, describe, shape+ 1 more lessons
04
Clean and Prepare Data
→ Detect and handle missing values→ Remove duplicates and correct data types+ 2 more lessons
05
Descriptive Statistics and Aggregation
→ Central tendency and dispersion — mean, median, standard deviation→ Correlation and covariance between variables+ 1 more lessons
06
Visualization with Matplotlib
→ Introduction to Matplotlib: Figure, Axes and subplots→ Essential charts: bars, lines, scatter+ 1 more lessons
07
Advanced Visualization with Seaborn
→ Introduction to Seaborn: histplot, boxplot, violinplot→ Visualize relationships: scatterplot and correlation heatmap+ 2 more lessons
08
Complete Exploratory Analysis
→ EDA Methodology: the 5 steps of a good analysis→ Detect outliers and anomalies in the data+ 1 more lessons
🏁
Final project (+ 1 chapters along the way)
→ You leave with a concrete and demonstrable project

Set up your working environment

NOTEWhat you will learn — Choose between Google Colab (zero installation, in the browser) and Anaconda + Jupyter (local installation), then install NumPy, Pandas, Matplotlib and Seaborn, and verify that everything works with a test script.

0. Google Colab — The zero-installation option

Google Colaboratory (Colab) is a free Jupyter environment that runs directly in your browser, with nothing to install. It runs on Google’s servers and already includes NumPy, Pandas, Matplotlib and Seaborn pre-installed.

TIPAnalogy — Google Colab is like working in a fully equipped office that Google lends you for free. You bring nothing: the desk, tools and libraries are already there. You open your browser and start immediately.

How to get started with Google Colab

Check the pre-installed versions in Colab

In the first cell of your Colab notebook, copy and run this code:

output
import numpy as np
import pandas as pd
import matplotlib
import seaborn as sns

print("NumPy     :", np.__version__)
print("Pandas    :", pd.__version__)
print("Matplotlib:", matplotlib.__version__)
print("Seaborn   :", sns.__version__)
print("\nEverything is ready. Happy analysis!")
output
# Method 1: Upload a file from your computer
from google.colab import files
uploaded = files.upload()   # a file-selection dialog opens

import pandas as pd
import io
df = pd.read_csv(io.BytesIO(uploaded['mon_fichier.csv']))

# Method 2: Read from Google Drive
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/mon_fichier.csv')

# Method 3: Read directly from a public URL
df = pd.read_csv('https://raw.githubusercontent.com/exemple/repo/main/data.csv')

Python only

Anaconda (recommended)

TIPAnalogy — Choosing between plain Python and Anaconda is like choosing between buying IKEA furniture piece by piece or buying a fully furnished apartment. Both work, but Anaconda saves you a lot of time at the start.

2. Step 1 — Download and install Anaconda

Download

Installation on Windows

WARNINGWindows only — If you do not check “Add Anaconda to PATH”, always use the Anaconda Prompt (not the regular Windows terminal) to run your conda and jupyter commands.

Verify the installation

Open Anaconda Prompt (Windows) or Terminal (macOS/Linux) and type:

output
conda --version
output
# Create an environment named "eda-cours" with Python 3.11
conda create -n eda-cours python=3.11

# Activate the environment
conda activate eda-cours

# Verify that the environment is active (the name appears in parentheses)
# (eda-cours) C:\Users\votre_nom>

Option A — With conda (recommended)

output
# Install all libraries in one command
conda install numpy pandas matplotlib seaborn jupyter -y

Launch from the terminal

output
# Make sure your environment is active
conda activate eda-cours

# Launch Jupyter Notebook
jupyter notebook

Chapter 08 – Introduction to data-science libraries

NOTEModule objectives
  • Understand what a Python library is
  • Import a library (import)
  • Import a specific module from a library (from ... import)
  • Use aliases (import numpy as np)
  • Use the math library as a first example
  • Install, update and verify a library’s configuration with PIP

1. What is a library?

Libraries are collections of ready-made modules that let you perform complex operations in just a few lines. There are many of them:

💻 CPU libraries Standard

🌞 GPU libraries NVIDIA RAPIDS

2. Importing a library — the math library

The math library is the perfect example for understanding imports. It is built into Python; no installation is required.

Official documentation: docs.python.org/2/library/math.html

2.1 Full import

output
import math

# Round up
print(math.ceil(0.1))    # round up
NOTERule%command applies to a single line. %%command applies to the entire cell. The %% command must be on the first line of the cell.

6.1 Measuring execution time

CommandDescriptionExample
%timeMeasures the time of a single line%time sum(range(1_000_000))
%%timeMeasures the time of the entire cellPlace on the first line of the cell
%timeitRuns the line N times, returns the average%timeit sum(range(1_000_000))
%%timeitRuns the cell N times, returns the averagePlace on the first line of the cell
output
%%time
# %%time — measures the TOTAL time of the cell (single execution)
import numpy as np
a = np.random.randn(1_000_000)
result = np.sort(a)
output
%timeit np.random.randn(1_000_000)
# %timeit — runs the line multiple times for an accurate measurement
output
%%timeit
# %%timeit — precise measurement of the entire cell (multiple executions)
import numpy as np
a = np.random.randn(10_000)
np.sort(a)
TIPWhen to use what?
%%time → to quickly measure a cell (1 execution)
%%timeit → for a reliable benchmark (multiple executions, average)
%timeit → to compare two expressions on a single line

6.2 Profiling — detailed performance analysis

output
%prun sum(range(1_000_000))
# Displays the time spent in each called function
output
%%prun
# Profiling of the entire cell
import numpy as np
a = np.random.randn(100_000)
b = np.sort(a)
c = np.cumsum(b)

Chapter 08 – Practice 2: Pandas — DataFrame manipulation (CPU)

NOTEPandas
  • Extremely popular in data science
  • Allows manipulation of very large data tables (a kind of Excel on steroids)
  • Enormous number of features (filters, transformations, analyses…)
  • Bridges to other libraries (ML, data viz…)

1. Create a DataFrame

1.1 From a dictionary

output
import pandas as pd

produitsDict = {
    'smartphone': {'prix': 1000, 'enStock': True},
    'chaussures':  {'prix': 100,  'enStock': False},
    'console':     {'prix': 400,  'enStock': True}
}
print(produitsDict)

df = pd.DataFrame(produitsDict)
df

1.2 From a list of lists

output
pays = [
    [70, 55, 85],           # Population in millions
    [0.901, 0.922, 0.936],  # HDI
    [2091, 2077, 3045]      # GDP
]
df = pd.DataFrame(pays, columns=['France', 'England', 'Germany'])
df

1.3 Import a CSV file

output
import pandas as pd

data = pd.read_csv('metal-bands.csv', encoding='latin-1', sep=';')
data.head()

2. First look at the data

output
data.head(3)          # first 3 rows
data.info()           # types, non-null values, memory
data.dtypes           # type of each column
data.fans.dtypes      # type of a specific column
data.shape            # (rows, columns)
len(data)             # number of rows

3. Navigating a DataFrame — iloc and loc

NOTERuleiloc = numeric index (position). loc = label index (row/column name).

3.1 Select one or more columns

output
data['band_name'].head(10)             # 1 column
data[['band_name', 'fans']].head(15)   # multiple columns

3.2 iloc — by numeric position

output
data.iloc[0, 0]        # row 0, column 0
data.iloc[0:5, 0]      # rows 0-4, column 0
data.iloc[0, 0:5]      # row 0, columns 0-4
data.iloc[0:3, 0:5]    # 3-row × 5-column block
go-further

This article covers the most useful snippets — the complete EDA pandas NumPy Matplotlib Seaborn course (12 chapters, 44 lessons, corrected exercises and final project) takes you all the way.

./access-the-full-course free course: Mastering Claude Code

FAQ

How long does it take to learn EDA pandas NumPy Matplotlib Seaborn?
With a structured progression (12 chapters, 44 short practical lessons), you reach an operational level in a few weeks at 30–60 minutes per day. The key is to practice each concept immediately.
Are there any prerequisites?
Basic computer knowledge is enough. If you can use a terminal and read simple code, you are ready.
Where to start concretely?
Reproduce the commands in this article, then follow the complete EDA pandas NumPy Matplotlib Seaborn course: it chains the 44 lessons in order, with exercises and a final project.

📬 Want to receive this kind of guide every week? Subscribe for free — real code, zero fluff.