Data & Big Data

Big Data Fundamentals Architecture Explained Simply (with Diagrams and Real Code)

Big Data Fundamentals Architecture: the essentials in one article — real code, diagrams and concrete steps, excerpts from a 43-lesson course.

REHOUMA Haythem

12 Jun 2026 • 12 min read

A straight-to-the-point guide: Big Data Fundamentals Architecture dissected with diagrams, concrete examples, and tested commands. Everything comes from a structured 11-chapter course — here’s the best of it.

tl;dr

Introduction to Big Data
Distributed Architectures
Hadoop Ecosystem
Apache Spark
Streaming and Real Time

~$ cat ./parcours.md # Big Data Fundamentals Architecture — 10 chapters

Introduction to Big Data

→ Course presentation and Big Data definition→ The 5 Vs — Volume, Velocity, Variety, Veracity, Value+ 1 more lessons

Distributed Architectures

→ Horizontal vs vertical scalability→ CAP Theorem — consistency, availability, partition+ 2 more lessons

Hadoop Ecosystem

→ HDFS, massive distributed storage→ YARN, cluster resource management+ 2 more lessons

Apache Spark

→ Spark architecture: driver, executors, cluster manager→ RDD, DataFrame, Dataset: which to choose?+ 2 more lessons

Streaming and Real Time

→ Apache Kafka: topics, partitions, producers, consumers→ Spark Structured Streaming+ 2 more lessons

Data Lake and Lakehouse Storage

→ Data Warehouse vs Data Lake vs Lakehouse→ Columnar formats: Parquet, ORC+ 2 more lessons

Architecture Patterns

→ Lambda architecture (batch + speed layer)→ Kappa architecture (pure streaming)+ 2 more lessons

Cloud and Modern Solutions

→ AWS Big Data: EMR, Glue, Athena, Redshift→ Databricks and the unified Lakehouse

🏁

Final project (+ 2 chapters along the way)

→ You leave with a concrete and demonstrable project

Quality tests: Great Expectations, dbt tests

NOTEObjective — Learn how to automatically validate data quality in a Big Data pipeline. You will know how to define expectations with Great Expectations, write dbt tests, and understand the six dimensions of data quality.

Learning objectives

TIPAt the end of this module

List the 6 dimensions of data quality
Write an expectations suite with Great Expectations
Define dbt tests (generic and custom)
Choose between blocking validation and non-blocking alert
Integrate quality tests into an orchestrated pipeline

The 6 dimensions of quality

Before testing, you need to know what to test. Data quality is measured according to six classic dimensions. A robust pipeline covers all six, not just “values are not null”.

Dimension	Question asked	Example test
Completeness	Are values missing?	No `email` NULL
Uniqueness	Are there duplicates?	`id_commande` unique
Validity	Is the format correct?	`pays` in an ISO list
Accuracy	Is the value plausible?	`montant` between 0 and 100000
Consistency	Do the tables agree?	`client_id` exists in `clients`
Freshness	Is the data up to date?	Last ingestion < 24 h

Great Expectations: declaring expectations

Great Expectations (GX) lets you express quality as readable expectations, almost in natural language. An expectations suite becomes an executable contract, stored in the catalog from the previous lesson.

Blocking validation (error)

Non-blocking alert (warn)

WARNINGCaution: it is better to fail early (at the bronze or silver stage) than to propagate corrupted data all the way to the gold dashboards. Golden rule: blocking tests must run before publishing the gold layer that decision-makers consult.

Integrate into orchestration

Tests make the most sense when automated in the orchestrator (Airflow, Dagster, Databricks Workflows). The typical diagram looks like this:

Lineage, security, GDPR and rights

NOTEObjective — Complete governance with its three remaining pillars: lineage (end-to-end traceability), security (encryption, access control) and GDPR compliance (right to be forgotten, PII masking). You will know what every Big Data project must plan from the start.

Learning objectives

TIPAt the end of this module

Explain what data lineage is and what it is used for
Implement role-based access control (RBAC)
Distinguish encryption at rest and in transit
Identify GDPR obligations applicable to Big Data
Mask or anonymize personal data (PII)

Data lineage: tracking data end to end

Lineage answers two critical questions: “where does this column come from?” (upstream lineage) and “what breaks if I modify this table?” (downstream lineage). In a medallion bronze → silver → gold architecture, lineage traces every transformation.

Downstream lineage

Used for impact analysis: before changing a schema, you know exactly which dashboards and ML models will be affected.

NOTENote: modern tools (Unity Catalog, DataHub, OpenLineage) capture lineage automatically by analyzing executed SQL queries. No need to document it manually: the engine knows that gold.ca_par_pays reads silver.commandes_propres.

Security: encryption and access control

The security of a Big Data platform rests on two complementary layers: protecting the data itself (encryption) and controlling who can read it (access).

Measure	Role	Example
Encryption at rest	Data encrypted on disk	S3 SSE-KMS, encrypted disks
Encryption in transit	Data encrypted over the network	TLS between services
RBAC	Role-based access	`analysts` group reads gold
ABAC	Attribute-based access	Mask if `tag = PII`
Audit log	Trace every access	Who read what, when

Example: RBAC and column masking

Minimization

Collect only the data that is necessary. The reflex “we keep everything just in case” is exactly what GDPR prohibits.

Right to be forgotten

A user can request deletion of their data. You must be able to erase a specific person — hence the value of Delta/Iceberg formats that support DELETE.

Traceability

Prove who accessed which personal data and when. This is where the audit log and lineage become mandatory.

WARNINGCaution: in a raw Parquet Data Lake (immutable), deleting a single person is very expensive: entire files must be rewritten. This is one of the strong arguments in favor of the Lakehouse (Delta Lake, Iceberg, Hudi) seen in chapter 05: these formats handle row-by-row DELETE and UPDATE, making the right to be forgotten realistic.

Example: erasing a person (right to be forgotten)

Technique	Reversible?	GDPR status
Anonymization	No, irreversible	Out of GDPR scope
Pseudonymization	Yes, via a key	Remains subject to GDPR

NOTENote: replacing a name with the identifier CLI-90421 is pseudonymization, not anonymization: if the mapping table is kept, the data remains personal in the eyes of the law. True anonymization (aggregation, definitive removal of the link) is the only thing that takes the data out of GDPR.

Cost estimation and scaling plan

NOTEObjective — Calculate the monthly cost of your architecture and plan its scaling. You will learn how to estimate expense items, write ADRs (Architecture Decision Records), and plan scalability without over-provisioning from day one.

Learning objectives

TIPAt the end of this module

Identify the main cost items of a Big Data platform
Estimate a monthly budget order of magnitude
Write a clear and reusable ADR
Distinguish vertical and horizontal scaling
Apply FinOps principles to control the bill

Big Data cost items

The cloud bill of a Big Data platform is spread across a few major items. Knowing them lets you target optimizations where they matter.

Item	Example service	Cost-saving lever
Storage	S3, ADLS, GCS	Tiering (hot/cold), compression
Compute	EMR, Databricks, Dataproc	Spot instances, auto-scaling
Streaming	Kafka, Kinesis	Partition sizing
Queries	Athena, BigQuery	Partitioning, columnar formats
Network transfer	Egress inter-region	Stay in a single region

WARNINGCaution: the #1 cost trap is compute running for nothing. A Spark cluster left running overnight or a poorly optimized job that scans 10 TB instead of 100 GB can multiply the bill by 50. Storage is rarely the problem; compute almost always is.

Estimation: e-commerce example

Back to the e-commerce case (2 TB/month, 5000 ev/s at peak). Here is an order-of-magnitude budget. The goal is not dollar precision, but the right order of magnitude.

Vertical (scale up)

More powerful machines. Simple, but limited and expensive. Reserved for components that do not distribute well.

Horizontal (scale out)

More machines. This is the native mode of Big Data: Kafka adds partitions, Spark adds executors, S3 is infinitely scalable.

go-further

This article covers the most useful excerpts — the full Big Data Fundamentals Architecture course (11 chapters, 43 lessons, corrected exercises and final project) takes you all the way.

./access-the-full-course free course: Mastering Claude Code

FAQ

How long does it take to learn Big Data Fundamentals Architecture?

With a structured progression (11 chapters, 43 short and practical lessons), you reach an operational level in a few weeks at 30 to 60 minutes per day. The key is to practice each concept immediately.

Are there any prerequisites?

It is best to be comfortable with the fundamentals of the domain: this content goes in depth, with real-world cases.

Where to start concretely?

Reproduce the commands in this article, then follow the full Big Data Fundamentals Architecture course: it chains the 43 lessons in order, with exercises and a final project.

./read-also

→ AWS Data Engineering Bootcamp explained simply (with diagrams and real code)→ Get started with AWS Real-Time Data: your first concrete step today → Python Data Science: the 9 key steps to go from zero to operational

📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.

Quality tests: Great Expectations, dbt tests

Learning objectives

The 6 dimensions of quality

Great Expectations: declaring expectations

Blocking validation (error)

Non-blocking alert (warn)

Integrate into orchestration

Lineage, security, GDPR and rights

Learning objectives

Data lineage: tracking data end to end

Downstream lineage

Security: encryption and access control

Example: RBAC and column masking

Minimization

Right to be forgotten

Traceability

Example: erasing a person (right to be forgotten)

Cost estimation and scaling plan

Learning objectives

Big Data cost items

Estimation: e-commerce example

Vertical (scale up)

Horizontal (scale out)

FAQ

Stay up to date