EDA with pandas, NumPy, Matplotlib & Seaborn: The 9 Key Steps from Zero to Operational
EDA pandas NumPy Matplotlib Seaborn: the essentials in one article — real code, diagrams and concrete steps, excerpts from a 44-lesson course.
Everyone can learn EDA pandas NumPy Matplotlib Seaborn — provided they follow the steps in the right order. We have condensed a complete 44-lesson course into a clear path, with the most useful code snippets.
- Introduction to Data Analysis
- Introduction and installation
- Getting started with Pandas DataFrames
- Cleaning and Preparing Data
- Descriptive Statistics and Aggregation
Set up your working environment
0. Google Colab — The zero-installation option
Google Colaboratory (Colab) is a free Jupyter environment that runs directly in your browser, with nothing to install. It runs on Google’s servers and already includes NumPy, Pandas, Matplotlib and Seaborn pre-installed.
How to get started with Google Colab
Check the pre-installed versions in Colab
In the first cell of your Colab notebook, copy and run this code:
import numpy as np
import pandas as pd
import matplotlib
import seaborn as sns
print("NumPy :", np.__version__)
print("Pandas :", pd.__version__)
print("Matplotlib:", matplotlib.__version__)
print("Seaborn :", sns.__version__)
print("\nEverything is ready. Happy analysis!")# Method 1: Upload a file from your computer
from google.colab import files
uploaded = files.upload() # a file-selection dialog opens
import pandas as pd
import io
df = pd.read_csv(io.BytesIO(uploaded['mon_fichier.csv']))
# Method 2: Read from Google Drive
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/mon_fichier.csv')
# Method 3: Read directly from a public URL
df = pd.read_csv('https://raw.githubusercontent.com/exemple/repo/main/data.csv')Python only
Anaconda (recommended)
2. Step 1 — Download and install Anaconda
Download
Installation on Windows
Verify the installation
Open Anaconda Prompt (Windows) or Terminal (macOS/Linux) and type:
conda --version
# Create an environment named "eda-cours" with Python 3.11 conda create -n eda-cours python=3.11 # Activate the environment conda activate eda-cours # Verify that the environment is active (the name appears in parentheses) # (eda-cours) C:\Users\votre_nom>
Option A — With conda (recommended)
# Install all libraries in one command conda install numpy pandas matplotlib seaborn jupyter -y
Launch from the terminal
# Make sure your environment is active conda activate eda-cours # Launch Jupyter Notebook jupyter notebook
Chapter 08 – Introduction to data-science libraries
- Understand what a Python library is
- Import a library (
import) - Import a specific module from a library (
from ... import) - Use aliases (
import numpy as np) - Use the
mathlibrary as a first example - Install, update and verify a library’s configuration with PIP
1. What is a library?
Libraries are collections of ready-made modules that let you perform complex operations in just a few lines. There are many of them:
💻 CPU libraries Standard
🌞 GPU libraries NVIDIA RAPIDS
2. Importing a library — the math library
The math library is the perfect example for understanding imports. It is built into Python; no installation is required.
Official documentation: docs.python.org/2/library/math.html
2.1 Full import
import math # Round up print(math.ceil(0.1)) # round up
%command applies to a single line. %%command applies to the entire cell. The %% command must be on the first line of the cell.6.1 Measuring execution time
| Command | Description | Example |
|---|---|---|
%time | Measures the time of a single line | %time sum(range(1_000_000)) |
%%time | Measures the time of the entire cell | Place on the first line of the cell |
%timeit | Runs the line N times, returns the average | %timeit sum(range(1_000_000)) |
%%timeit | Runs the cell N times, returns the average | Place on the first line of the cell |
%%time # %%time — measures the TOTAL time of the cell (single execution) import numpy as np a = np.random.randn(1_000_000) result = np.sort(a)
%timeit np.random.randn(1_000_000) # %timeit — runs the line multiple times for an accurate measurement
%%timeit # %%timeit — precise measurement of the entire cell (multiple executions) import numpy as np a = np.random.randn(10_000) np.sort(a)
•
%%time → to quickly measure a cell (1 execution)•
%%timeit → for a reliable benchmark (multiple executions, average)•
%timeit → to compare two expressions on a single line6.2 Profiling — detailed performance analysis
%prun sum(range(1_000_000)) # Displays the time spent in each called function
%%prun # Profiling of the entire cell import numpy as np a = np.random.randn(100_000) b = np.sort(a) c = np.cumsum(b)
Chapter 08 – Practice 2: Pandas — DataFrame manipulation (CPU)
- Extremely popular in data science
- Allows manipulation of very large data tables (a kind of Excel on steroids)
- Enormous number of features (filters, transformations, analyses…)
- Bridges to other libraries (ML, data viz…)
1. Create a DataFrame
1.1 From a dictionary
import pandas as pd
produitsDict = {
'smartphone': {'prix': 1000, 'enStock': True},
'chaussures': {'prix': 100, 'enStock': False},
'console': {'prix': 400, 'enStock': True}
}
print(produitsDict)
df = pd.DataFrame(produitsDict)
df1.2 From a list of lists
pays = [
[70, 55, 85], # Population in millions
[0.901, 0.922, 0.936], # HDI
[2091, 2077, 3045] # GDP
]
df = pd.DataFrame(pays, columns=['France', 'England', 'Germany'])
df1.3 Import a CSV file
import pandas as pd
data = pd.read_csv('metal-bands.csv', encoding='latin-1', sep=';')
data.head()2. First look at the data
data.head(3) # first 3 rows data.info() # types, non-null values, memory data.dtypes # type of each column data.fans.dtypes # type of a specific column data.shape # (rows, columns) len(data) # number of rows
3. Navigating a DataFrame — iloc and loc
iloc = numeric index (position). loc = label index (row/column name).3.1 Select one or more columns
data['band_name'].head(10) # 1 column data[['band_name', 'fans']].head(15) # multiple columns
3.2 iloc — by numeric position
data.iloc[0, 0] # row 0, column 0 data.iloc[0:5, 0] # rows 0-4, column 0 data.iloc[0, 0:5] # row 0, columns 0-4 data.iloc[0:3, 0:5] # 3-row × 5-column block
This article covers the most useful snippets — the complete EDA pandas NumPy Matplotlib Seaborn course (12 chapters, 44 lessons, corrected exercises and final project) takes you all the way.
./access-the-full-course free course: Mastering Claude CodeFAQ
How long does it take to learn EDA pandas NumPy Matplotlib Seaborn?
Are there any prerequisites?
Where to start concretely?
📬 Want to receive this kind of guide every week? Subscribe for free — real code, zero fluff.