AWS Data Engineering Bootcamp Explained Simply (with Diagrams and Real Code)
AWS Data Engineering Bootcamp: The Essentials in One Article — Real Code, Diagrams, and Concrete Steps, Excerpts from a 17-Lesson Course.
A no-nonsense guide: AWS Data Engineering Bootcamp dissected with diagrams, concrete examples and tested commands. Everything comes from a structured 5-chapter course — here are the best parts.
- Introduction to AWS Data Engineering
- Partitioning and Bucketing
- Athena and Glue in Action
- Cardinality and Advanced Optimization
- Athena Guided Lab
Chapter 02 – Lesson 3: Optimization with Bucketing and Partitioning in Athena
Optimization with Bucketing in Athena
Why Bucketing?
Bucketing distributes data across multiple files according to a hash key. It dramatically improves performance when working with high-cardinality columns.
Reduce the scan
Athena targets only the required buckets during a query.
Improve joins
When two tables are bucketed on the same key, Athena performs very fast "bucket joins".
High cardinality
Ideal for user_id, transaction_id, device_id.
Practical Example: Bucketed E-Commerce Transactions
-- Step 1: Configure bucketing when creating the table
-- Replace user_id with your high-cardinality column
CLUSTERED BY (user_id) INTO 8 BUCKETS
-- Complete example with CTAS (Create Table As Select)
CREATE TABLE taxidata.transactions_bucketized
WITH (
format = 'PARQUET',
external_location = 's3://mon-bucket/transactions-bucketized/',
bucketed_by = ARRAY['user_id'],
bucket_count = 8
) AS
SELECT *
FROM taxidata.transactions_source;s3://mon-bucket/transactions-bucketized/ ├── 000000_0.parquet ├── 000001_0.parquet ├── ... └── 000007_0.parquet
Steps to Configure Bucketing
Optimization with Partitioning in Athena
Partition Structure in S3
Partitioning organizes data into separate directories in S3 according to the values of a low-cardinality column:
# Partition structure in S3 (Hive format)
s3://taxi-datalake/yellow/
├── year=2015/month=01/
│ └── data.parquet
├── year=2015/month=02/
│ └── data.parquet
├── year=2016/month=01/
│ └── data.parquet
└── year=2017/month=01/
└── data.parquetCreate a Partitioned Table via CTAS
-- CTAS (Create Table As Select) with partitioning
-- CSV → Parquet conversion WITH partitioning
CREATE TABLE taxidata.yellow_partitioned
WITH (
format = 'PARQUET',
external_location = 's3://taxi-datalake/yellow-partitioned/',
partitioned_by = ARRAY['year', 'month']
) AS
SELECT
vendor, pickup, dropoff, count,
distance, ratecode, storeflag, pulocid, dolocid,
paytype, fare, extra, mta_tax, tip, tolls, surcharge, total,
YEAR(pickup) AS year,
MONTH(pickup) AS month
FROM taxidata.yellow;year, month) must be listed last in the SELECT and in the partitioned_by array.Discover Partitions
-- After adding data, discover the new partitions MSCK REPAIR TABLE taxidata.yellow_partitioned; -- Manually add a specific partition ALTER TABLE taxidata.yellow_partitioned ADD PARTITION (year='2017', month='02') LOCATION 's3://taxi-datalake/yellow-partitioned/year=2017/month=02/';
Optimized Queries with Partitioning
-- Query WITHOUT partitioning: scan the ENTIRE table SELECT COUNT(*), SUM(fare) AS total_fares FROM taxidata.yellow WHERE YEAR(pickup) = 2017 AND MONTH(pickup) = 1; -- → Scan: 100% of the data (e.g. 1 TB) -- Query WITH partitioning: scan only the relevant partition SELECT COUNT(*), SUM(fare) AS total_fares FROM taxidata.yellow_partitioned WHERE year = 2017 AND month = 1; -- → Scan: ~1/36 of the data (if 3 years × 12 months) -- → ~97% reduction in scanned data!
Optimal Combination: Partitioning + Bucketing
Chapter 02 – Lesson 2: Configure Athena and Create a Glue Database
Step 1 — Configure the Athena Environment
# Alternative: configure via AWS CLI
aws athena update-work-group \
--work-group primary \
--configuration-updates "ResultConfigurationUpdates={OutputLocation=s3://mon-bucket-athena-results/}"Step 2 — Create a Glue Database
In the Athena editor, run:
-- Create the taxidata database in AWS Glue Data Catalog CREATE DATABASE taxidata;
taxidata database in the AWS Glue Data Catalog. You can see it in the Glue console under "Databases".# Verify creation via AWS CLI aws glue get-database --name taxidata # List all databases aws glue get-databases
Step 3 — Define the Table Schema (NYC Taxi Dataset)
Let's create the external yellow table on New York taxi data stored in S3:
-- Create the external yellow table on S3 data
CREATE EXTERNAL TABLE IF NOT EXISTS taxidata.yellow (
`vendor` string,
`pickup` timestamp,
`dropoff` timestamp,
`count` int,
`distance` int,
`ratecode` string,
`storeflag` string,
`pulocid` string,
`dolocid` string,
`paytype` string,
`fare` decimal,
`extra` decimal,
`mta_tax` decimal,
`tip` decimal,
`tolls` decimal,
`surcharge` decimal,
`total` decimal
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://aws-tc-largeobjects/CUR-TF-200-ACDSCI-1/Lab2/yellow/'
TBLPROPERTIES ('has_encrypted_data'='false');CREATE EXTERNAL TABLE— the table points to S3, Athena does not copy the dataLazySimpleSerDe— serializer/deserializer for delimited CSV filesfield.delim = ','— fields are comma-separatedLOCATION— S3 path containing the data fileshas_encrypted_data='false'— data is not encrypted
Step 4 — Preview the Data
-- Preview the first 10 rows SELECT * FROM taxidata.yellow LIMIT 10; -- Count the total number of trips SELECT COUNT(*) AS total_trips FROM taxidata.yellow; -- Check available payment types SELECT DISTINCT paytype, COUNT(*) AS nb_trips FROM taxidata.yellow GROUP BY paytype ORDER BY nb_trips DESC;
Step 5 — Create an Optimized Table (January 2017)
-- Create a table for January 2017 data only
-- Faster for testing and validation
CREATE EXTERNAL TABLE IF NOT EXISTS taxidata.jan (
`vendor` string,
`pickup` timestamp,
`dropoff` timestamp,
`count` int,
`distance` int,
`ratecode` string,
`storeflag` string,
`pulocid` string,
`dolocid` string,
`paytype` string,
`fare` decimal,
`extra` decimal,
`mta_tax` decimal,
`tip` decimal,
`tolls` decimal,
`surcharge` decimal,
`total` decimal
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://aws-tc-largeobjects/CUR-TF-200-ACDSCI-1/Lab2/January2017/'
TBLPROPERTIES ('has_encrypted_data'='false');Chapter 03 – Lesson 1: SQL Views, Named Queries and AWS CloudFormation
SQL Views in Athena
A SQL view is a saved query that you can use like a table. It does not store data — it executes the underlying query on every call.
Example 1: Credit Card Trips View
-- Create a view for credit card payments
-- paytype = '1' means credit card payment
CREATE VIEW taxidata.cctrips AS
SELECT
vendor,
SUM(fare) AS total_fares_cc,
COUNT(*) AS nb_trips_cc,
AVG(fare) AS avg_fare_cc
FROM taxidata.yellow
WHERE paytype = '1'
GROUP BY vendor;
-- Use the view
SELECT * FROM taxidata.cctrips;Example 2: Cash Trips View
-- Create a view for cash payments
-- paytype = '2' means cash payment
CREATE VIEW taxidata.cashtrips AS
SELECT
vendor,
SUM(fare) AS total_fares_cash,
COUNT(*) AS nb_trips_cash,
AVG(fare) AS avg_fare_cash
FROM taxidata.yellow
WHERE paytype = '2'
GROUP BY vendor;
-- Use the view
SELECT * FROM taxidata.cashtrips;Example 3: Comparison View with CTE (WITH)
-- Comparison view combining credit card and cash
-- Using CTE (Common Table Expressions) with WITH
CREATE VIEW taxidata.comparepay AS
WITH
cc AS (
SELECT
vendor,
SUM(fare) AS cctotal,
COUNT(*) AS nb_cc
FROM taxidata.yellow
WHERE paytype = '1'
GROUP BY vendor
),
cash AS (
SELECT
vendor,
SUM(fare) AS cashtotal,
COUNT(*) AS nb_cash
FROM taxidata.yellow
WHERE paytype = '2'
GROUP BY vendor
)
SELECT
cc.vendor,
cc.cctotal,
cc.nb_cc,
cash.cashtotal,
cash.nb_cash,
ROUND(cc.cctotal / cash.cashtotal, 2) AS ratio_cc_cash
FROM cc
JOIN cash ON cc.vendor = cash.vendor;
-- Query the view
SELECT * FROM taxidata.comparepay ORDER BY ratio_cc_cash DESC;Managing Views
-- List available views
SHOW VIEWS IN taxidata;
-- View the definition of a view
SHOW CREATE VIEW taxidata.comparepay;
-- Replace an existing view
CREATE OR REPLACE VIEW taxidata.cctrips AS
SELECT
vendor,
SUM(fare) AS total_fares_cc,
COUNT(*) AS nb_trips_cc
FROM taxidata.yellow
WHERE paytype = '1'
GROUP BY vendor;
-- Drop a view
DROP VIEW taxidata.cashtrips;Named Queries in Athena
Named queries allow you to save frequently used SQL queries directly in Athena. They are shared within your AWS account.
Benefits
Limitation
Automate with AWS CloudFormation
Instead of manually creating named queries via the console, you can deploy them as code with AWS CloudFormation.
CloudFormation Template for a Named Query
Create a file named athenaquery.cf.yml:
AWSTemplateFormatVersion: 2010-09-09
Description: "Athena named query for NYC taxi analysis"
Resources:
AthenaNamedQuery:
Type: AWS::Athena::NamedQuery
Properties:
Database: "taxidata"
Description: "Select trips whose amount exceeds $100"
Name: "FaresOver100DollarsUS"
QueryString: >
SELECT
distance,
paytype,
fare,
tip,
tolls,
surcharge,
total
FROM yellow
WHERE total >= 100.0
ORDER BY total DESC
Outputs:
AthenaNamedQuery:
Description: "ID of the created Athena named query"
Value: !Ref AthenaNamedQueryDeploy the CloudFormation Template via AWS CLI
# Step 1: Validate the template before deployment
aws cloudformation validate-template \
--template-body file://athenaquery.cf.yml
# Step 2: Create the CloudFormation stack
aws cloudformation create-stack \
--stack-name mon-athena-stack \
--template-body file://athenaquery.cf.yml \
--capabilities CAPABILITY_IAM
# Step 3: Check creation status
aws cloudformation describe-stacks \
--stack-name mon-athena-stack \
--query 'Stacks[0].StackStatus'
# Step 4: List created named queries
aws athena list-named-queries \
--work-group primary
# Step 5: Retrieve details of a named query
aws athena get-named-query \
--named-query-id "ID_RETURNED_BY_LIST"This article covers the most useful excerpts — the full AWS Data Engineering Bootcamp course (5 chapters, 17 lessons, corrected exercises and final project) takes you all the way.
./access-the-full-course free course: Mastering Claude CodeFAQ
How long does it take to learn AWS Data Engineering Bootcamp?
Are there any prerequisites?
Where to start concretely?
📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.