Big Data Fundamentals Architecture Explained Simply (with Diagrams and Real Code)
Big Data Fundamentals Architecture: the essentials in one article — real code, diagrams and concrete steps, excerpts from a 43-lesson course.
A straight-to-the-point guide: Big Data Fundamentals Architecture dissected with diagrams, concrete examples, and tested commands. Everything comes from a structured 11-chapter course — here’s the best of it.
- Introduction to Big Data
- Distributed Architectures
- Hadoop Ecosystem
- Apache Spark
- Streaming and Real Time
Quality tests: Great Expectations, dbt tests
Learning objectives
- List the 6 dimensions of data quality
- Write an expectations suite with Great Expectations
- Define dbt tests (generic and custom)
- Choose between blocking validation and non-blocking alert
- Integrate quality tests into an orchestrated pipeline
The 6 dimensions of quality
Before testing, you need to know what to test. Data quality is measured according to six classic dimensions. A robust pipeline covers all six, not just “values are not null”.
| Dimension | Question asked | Example test |
|---|---|---|
| Completeness | Are values missing? | No email NULL |
| Uniqueness | Are there duplicates? | id_commande unique |
| Validity | Is the format correct? | pays in an ISO list |
| Accuracy | Is the value plausible? | montant between 0 and 100000 |
| Consistency | Do the tables agree? | client_id exists in clients |
| Freshness | Is the data up to date? | Last ingestion < 24 h |
Great Expectations: declaring expectations
Great Expectations (GX) lets you express quality as readable expectations, almost in natural language. An expectations suite becomes an executable contract, stored in the catalog from the previous lesson.
Blocking validation (error)
Non-blocking alert (warn)
Integrate into orchestration
Tests make the most sense when automated in the orchestrator (Airflow, Dagster, Databricks Workflows). The typical diagram looks like this:
Lineage, security, GDPR and rights
Learning objectives
- Explain what data lineage is and what it is used for
- Implement role-based access control (RBAC)
- Distinguish encryption at rest and in transit
- Identify GDPR obligations applicable to Big Data
- Mask or anonymize personal data (PII)
Data lineage: tracking data end to end
Lineage answers two critical questions: “where does this column come from?” (upstream lineage) and “what breaks if I modify this table?” (downstream lineage). In a medallion bronze → silver → gold architecture, lineage traces every transformation.
Downstream lineage
Used for impact analysis: before changing a schema, you know exactly which dashboards and ML models will be affected.
gold.ca_par_pays reads silver.commandes_propres.Security: encryption and access control
The security of a Big Data platform rests on two complementary layers: protecting the data itself (encryption) and controlling who can read it (access).
| Measure | Role | Example |
|---|---|---|
| Encryption at rest | Data encrypted on disk | S3 SSE-KMS, encrypted disks |
| Encryption in transit | Data encrypted over the network | TLS between services |
| RBAC | Role-based access | analysts group reads gold |
| ABAC | Attribute-based access | Mask if tag = PII |
| Audit log | Trace every access | Who read what, when |
Example: RBAC and column masking
Minimization
Collect only the data that is necessary. The reflex “we keep everything just in case” is exactly what GDPR prohibits.
Right to be forgotten
A user can request deletion of their data. You must be able to erase a specific person — hence the value of Delta/Iceberg formats that support DELETE.
Traceability
Prove who accessed which personal data and when. This is where the audit log and lineage become mandatory.
DELETE and UPDATE, making the right to be forgotten realistic.Example: erasing a person (right to be forgotten)
| Technique | Reversible? | GDPR status |
|---|---|---|
| Anonymization | No, irreversible | Out of GDPR scope |
| Pseudonymization | Yes, via a key | Remains subject to GDPR |
CLI-90421 is pseudonymization, not anonymization: if the mapping table is kept, the data remains personal in the eyes of the law. True anonymization (aggregation, definitive removal of the link) is the only thing that takes the data out of GDPR.Cost estimation and scaling plan
Learning objectives
- Identify the main cost items of a Big Data platform
- Estimate a monthly budget order of magnitude
- Write a clear and reusable ADR
- Distinguish vertical and horizontal scaling
- Apply FinOps principles to control the bill
Big Data cost items
The cloud bill of a Big Data platform is spread across a few major items. Knowing them lets you target optimizations where they matter.
| Item | Example service | Cost-saving lever |
|---|---|---|
| Storage | S3, ADLS, GCS | Tiering (hot/cold), compression |
| Compute | EMR, Databricks, Dataproc | Spot instances, auto-scaling |
| Streaming | Kafka, Kinesis | Partition sizing |
| Queries | Athena, BigQuery | Partitioning, columnar formats |
| Network transfer | Egress inter-region | Stay in a single region |
Estimation: e-commerce example
Back to the e-commerce case (2 TB/month, 5000 ev/s at peak). Here is an order-of-magnitude budget. The goal is not dollar precision, but the right order of magnitude.
Vertical (scale up)
More powerful machines. Simple, but limited and expensive. Reserved for components that do not distribute well.
Horizontal (scale out)
More machines. This is the native mode of Big Data: Kafka adds partitions, Spark adds executors, S3 is infinitely scalable.
This article covers the most useful excerpts — the full Big Data Fundamentals Architecture course (11 chapters, 43 lessons, corrected exercises and final project) takes you all the way.
./access-the-full-course free course: Mastering Claude CodeFAQ
How long does it take to learn Big Data Fundamentals Architecture?
Are there any prerequisites?
Where to start concretely?
📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.