AWS Data Engineering Bootcamp Explained Simply (with Diagrams and Real Code)

AWS Data Engineering Bootcamp: The Essentials in One Article — Real Code, Diagrams, and Concrete Steps, Excerpts from a 17-Lesson Course.

AWS Data Engineering Bootcamp Explained Simply (with Diagrams and Real Code)

A no-nonsense guide: AWS Data Engineering Bootcamp dissected with diagrams, concrete examples and tested commands. Everything comes from a structured 5-chapter course — here are the best parts.

tl;dr
  • Introduction to AWS Data Engineering
  • Partitioning and Bucketing
  • Athena and Glue in Action
  • Cardinality and Advanced Optimization
  • Athena Guided Lab
~$ cat ./parcours.md # AWS Data Engineering Bootcamp — 5 chapters
01
Introduction to AWS Data Engineering
→ Chapter 00 – Lesson 1 : Key Concepts in Data Engineering→ Chapter 00 – Lesson 2 : Data Warehouses and their Cloud Equivalents+ 1 more lessons
02
Partitioning and Bucketing
→ Chapter 01 – Lesson 1 : Partitioning and Bucketing in a Data Lake→ Chapter 01 – Lesson 2 : Bucketing in Data Warehouses+ 1 more lessons
03
Athena and Glue in Action
→ Chapter 02 – Lesson 1 : Amazon Athena — Key Concepts and Architecture→ Chapter 02 – Lesson 2 : Configure Athena and Create a Glue Database+ 2 more lessons
04
Cardinality and Advanced Optimization
→ Chapter 03 – Lesson 1 : SQL Views, Named Queries and AWS CloudFormation→ Chapter 03 – Lesson 2 : IAM Permissions for Athena and Glue — Best Practices+ 2 more lessons
05
Guided Athena Lab
→ Chapter 04 (OPTIONAL) – Lab 1 : Create an AWS Glue Database and Table in Athena→ Chapter 04 (OPTIONAL) – Lab 2 : Bucketing, Partitioning and Parquet Format+ 1 more lessons
🏁
Final project
→ You leave with a concrete and demonstrable project

Chapter 02 – Lesson 3: Optimization with Bucketing and Partitioning in Athena

NOTEObjective — Apply bucketing and partitioning on real data in Athena and measure performance gains and cost reduction.

Optimization with Bucketing in Athena

Why Bucketing?

Bucketing distributes data across multiple files according to a hash key. It dramatically improves performance when working with high-cardinality columns.

Reduce the scan

Athena targets only the required buckets during a query.

Improve joins

When two tables are bucketed on the same key, Athena performs very fast "bucket joins".

High cardinality

Ideal for user_id, transaction_id, device_id.

Practical Example: Bucketed E-Commerce Transactions

output
-- Step 1: Configure bucketing when creating the table
-- Replace user_id with your high-cardinality column
CLUSTERED BY (user_id) INTO 8 BUCKETS

-- Complete example with CTAS (Create Table As Select)
CREATE TABLE taxidata.transactions_bucketized
WITH (
    format = 'PARQUET',
    external_location = 's3://mon-bucket/transactions-bucketized/',
    bucketed_by = ARRAY['user_id'],
    bucket_count = 8
) AS
SELECT *
FROM taxidata.transactions_source;
TIPVerification in S3 — After creation, you should see 8 files in S3:
s3://mon-bucket/transactions-bucketized/
├── 000000_0.parquet
├── 000001_0.parquet
├── ...
└── 000007_0.parquet

Steps to Configure Bucketing

Optimization with Partitioning in Athena

Partition Structure in S3

Partitioning organizes data into separate directories in S3 according to the values of a low-cardinality column:

bash
# Partition structure in S3 (Hive format)
s3://taxi-datalake/yellow/
├── year=2015/month=01/
│   └── data.parquet
├── year=2015/month=02/
│   └── data.parquet
├── year=2016/month=01/
│   └── data.parquet
└── year=2017/month=01/
    └── data.parquet

Create a Partitioned Table via CTAS

output
-- CTAS (Create Table As Select) with partitioning
-- CSV → Parquet conversion WITH partitioning
CREATE TABLE taxidata.yellow_partitioned
WITH (
    format = 'PARQUET',
    external_location = 's3://taxi-datalake/yellow-partitioned/',
    partitioned_by = ARRAY['year', 'month']
) AS
SELECT
    vendor, pickup, dropoff, count,
    distance, ratecode, storeflag, pulocid, dolocid,
    paytype, fare, extra, mta_tax, tip, tolls, surcharge, total,
    YEAR(pickup) AS year,
    MONTH(pickup) AS month
FROM taxidata.yellow;
NOTEImportant — Partition columns (year, month) must be listed last in the SELECT and in the partitioned_by array.

Discover Partitions

output
-- After adding data, discover the new partitions
MSCK REPAIR TABLE taxidata.yellow_partitioned;

-- Manually add a specific partition
ALTER TABLE taxidata.yellow_partitioned
ADD PARTITION (year='2017', month='02')
LOCATION 's3://taxi-datalake/yellow-partitioned/year=2017/month=02/';

Optimized Queries with Partitioning

output
-- Query WITHOUT partitioning: scan the ENTIRE table
SELECT COUNT(*), SUM(fare) AS total_fares
FROM taxidata.yellow
WHERE YEAR(pickup) = 2017 AND MONTH(pickup) = 1;
-- → Scan: 100% of the data (e.g. 1 TB)

-- Query WITH partitioning: scan only the relevant partition
SELECT COUNT(*), SUM(fare) AS total_fares
FROM taxidata.yellow_partitioned
WHERE year = 2017 AND month = 1;
-- → Scan: ~1/36 of the data (if 3 years × 12 months)
-- → ~97% reduction in scanned data!

Optimal Combination: Partitioning + Bucketing

Chapter 02 – Lesson 2: Configure Athena and Create a Glue Database

NOTEObjective — Configure the Athena environment, create an AWS Glue database, define an external table on S3 data and run the first SQL queries.

Step 1 — Configure the Athena Environment

WARNINGMandatory prerequisite — Before any Athena query, configure the S3 bucket for results. Without it, Athena will refuse to run your queries.
bash
# Alternative: configure via AWS CLI
aws athena update-work-group \
  --work-group primary \
  --configuration-updates "ResultConfigurationUpdates={OutputLocation=s3://mon-bucket-athena-results/}"

Step 2 — Create a Glue Database

In the Athena editor, run:

output
-- Create the taxidata database in AWS Glue Data Catalog
CREATE DATABASE taxidata;
TIPResult — This command creates a taxidata database in the AWS Glue Data Catalog. You can see it in the Glue console under "Databases".
bash
# Verify creation via AWS CLI
aws glue get-database --name taxidata

# List all databases
aws glue get-databases

Step 3 — Define the Table Schema (NYC Taxi Dataset)

Let's create the external yellow table on New York taxi data stored in S3:

output
-- Create the external yellow table on S3 data
CREATE EXTERNAL TABLE IF NOT EXISTS taxidata.yellow (
   `vendor`    string,
   `pickup`    timestamp,
   `dropoff`   timestamp,
   `count`     int,
   `distance`  int,
   `ratecode`  string,
   `storeflag` string,
   `pulocid`   string,
   `dolocid`   string,
   `paytype`   string,
   `fare`      decimal,
   `extra`     decimal,
   `mta_tax`   decimal,
   `tip`       decimal,
   `tolls`     decimal,
   `surcharge` decimal,
   `total`     decimal
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
   'serialization.format' = ',',
   'field.delim' = ','
) LOCATION 's3://aws-tc-largeobjects/CUR-TF-200-ACDSCI-1/Lab2/yellow/'
TBLPROPERTIES ('has_encrypted_data'='false');
NOTEExplanation of key parameters:
  • CREATE EXTERNAL TABLE — the table points to S3, Athena does not copy the data
  • LazySimpleSerDe — serializer/deserializer for delimited CSV files
  • field.delim = ',' — fields are comma-separated
  • LOCATION — S3 path containing the data files
  • has_encrypted_data='false' — data is not encrypted

Step 4 — Preview the Data

output
-- Preview the first 10 rows
SELECT * FROM taxidata.yellow LIMIT 10;

-- Count the total number of trips
SELECT COUNT(*) AS total_trips FROM taxidata.yellow;

-- Check available payment types
SELECT DISTINCT paytype, COUNT(*) AS nb_trips
FROM taxidata.yellow
GROUP BY paytype
ORDER BY nb_trips DESC;

Step 5 — Create an Optimized Table (January 2017)

output
-- Create a table for January 2017 data only
-- Faster for testing and validation
CREATE EXTERNAL TABLE IF NOT EXISTS taxidata.jan (
   `vendor`    string,
   `pickup`    timestamp,
   `dropoff`   timestamp,
   `count`     int,
   `distance`  int,
   `ratecode`  string,
   `storeflag` string,
   `pulocid`   string,
   `dolocid`   string,
   `paytype`   string,
   `fare`      decimal,
   `extra`     decimal,
   `mta_tax`   decimal,
   `tip`       decimal,
   `tolls`     decimal,
   `surcharge` decimal,
   `total`     decimal
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
   'serialization.format' = ',',
   'field.delim' = ','
) LOCATION 's3://aws-tc-largeobjects/CUR-TF-200-ACDSCI-1/Lab2/January2017/'
TBLPROPERTIES ('has_encrypted_data'='false');

Chapter 03 – Lesson 1: SQL Views, Named Queries and AWS CloudFormation

NOTEObjective — Create SQL views in Athena to simplify analyses, define reusable named queries, and automate their deployment with AWS CloudFormation.

SQL Views in Athena

A SQL view is a saved query that you can use like a table. It does not store data — it executes the underlying query on every call.

TIPBenefit of views — Simplify complex queries, create abstraction layers and share standardized analyses with your team.

Example 1: Credit Card Trips View

output
-- Create a view for credit card payments
-- paytype = '1' means credit card payment
CREATE VIEW taxidata.cctrips AS
SELECT
    vendor,
    SUM(fare)  AS total_fares_cc,
    COUNT(*)   AS nb_trips_cc,
    AVG(fare)  AS avg_fare_cc
FROM taxidata.yellow
WHERE paytype = '1'
GROUP BY vendor;

-- Use the view
SELECT * FROM taxidata.cctrips;

Example 2: Cash Trips View

output
-- Create a view for cash payments
-- paytype = '2' means cash payment
CREATE VIEW taxidata.cashtrips AS
SELECT
    vendor,
    SUM(fare)  AS total_fares_cash,
    COUNT(*)   AS nb_trips_cash,
    AVG(fare)  AS avg_fare_cash
FROM taxidata.yellow
WHERE paytype = '2'
GROUP BY vendor;

-- Use the view
SELECT * FROM taxidata.cashtrips;

Example 3: Comparison View with CTE (WITH)

output
-- Comparison view combining credit card and cash
-- Using CTE (Common Table Expressions) with WITH
CREATE VIEW taxidata.comparepay AS
WITH
    cc AS (
        SELECT
            vendor,
            SUM(fare) AS cctotal,
            COUNT(*)  AS nb_cc
        FROM taxidata.yellow
        WHERE paytype = '1'
        GROUP BY vendor
    ),
    cash AS (
        SELECT
            vendor,
            SUM(fare) AS cashtotal,
            COUNT(*)  AS nb_cash
        FROM taxidata.yellow
        WHERE paytype = '2'
        GROUP BY vendor
    )
SELECT
    cc.vendor,
    cc.cctotal,
    cc.nb_cc,
    cash.cashtotal,
    cash.nb_cash,
    ROUND(cc.cctotal / cash.cashtotal, 2) AS ratio_cc_cash
FROM cc
JOIN cash ON cc.vendor = cash.vendor;

-- Query the view
SELECT * FROM taxidata.comparepay ORDER BY ratio_cc_cash DESC;

Managing Views

output
-- List available views
SHOW VIEWS IN taxidata;

-- View the definition of a view
SHOW CREATE VIEW taxidata.comparepay;

-- Replace an existing view
CREATE OR REPLACE VIEW taxidata.cctrips AS
SELECT
    vendor,
    SUM(fare) AS total_fares_cc,
    COUNT(*) AS nb_trips_cc
FROM taxidata.yellow
WHERE paytype = '1'
GROUP BY vendor;

-- Drop a view
DROP VIEW taxidata.cashtrips;

Named Queries in Athena

Named queries allow you to save frequently used SQL queries directly in Athena. They are shared within your AWS account.

Benefits

Limitation

Automate with AWS CloudFormation

Instead of manually creating named queries via the console, you can deploy them as code with AWS CloudFormation.

CloudFormation Template for a Named Query

Create a file named athenaquery.cf.yml:

output
AWSTemplateFormatVersion: 2010-09-09
Description: "Athena named query for NYC taxi analysis"

Resources:
  AthenaNamedQuery:
    Type: AWS::Athena::NamedQuery
    Properties:
      Database: "taxidata"
      Description: "Select trips whose amount exceeds $100"
      Name: "FaresOver100DollarsUS"
      QueryString: >
        SELECT
            distance,
            paytype,
            fare,
            tip,
            tolls,
            surcharge,
            total
        FROM yellow
        WHERE total >= 100.0
        ORDER BY total DESC

Outputs:
  AthenaNamedQuery:
    Description: "ID of the created Athena named query"
    Value: !Ref AthenaNamedQuery

Deploy the CloudFormation Template via AWS CLI

bash
# Step 1: Validate the template before deployment
aws cloudformation validate-template \
    --template-body file://athenaquery.cf.yml

# Step 2: Create the CloudFormation stack
aws cloudformation create-stack \
    --stack-name mon-athena-stack \
    --template-body file://athenaquery.cf.yml \
    --capabilities CAPABILITY_IAM

# Step 3: Check creation status
aws cloudformation describe-stacks \
    --stack-name mon-athena-stack \
    --query 'Stacks[0].StackStatus'

# Step 4: List created named queries
aws athena list-named-queries \
    --work-group primary

# Step 5: Retrieve details of a named query
aws athena get-named-query \
    --named-query-id "ID_RETURNED_BY_LIST"
go-further

This article covers the most useful excerpts — the full AWS Data Engineering Bootcamp course (5 chapters, 17 lessons, corrected exercises and final project) takes you all the way.

./access-the-full-course free course: Mastering Claude Code

FAQ

How long does it take to learn AWS Data Engineering Bootcamp?
With a structured progression (5 chapters, 17 short and practical lessons), you reach an operational level in a few weeks at 30–60 minutes per day. The key is to practice each concept immediately.
Are there any prerequisites?
Basic computer knowledge is enough. If you can use a terminal and read simple code, you are ready.
Where to start concretely?
Reproduce the commands in this article, then follow the full AWS Data Engineering Bootcamp course: it chains the 17 lessons in order, with exercises and a final project.

📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.