Data & Big Data

Scala PySpark Big Data in Practice: The Code and Commands That Really Matter

Scala PySpark Big Data: The Essentials in One Article — Real Code, Diagrams and Concrete Steps, Excerpts from a 47-Lesson Course.

REHOUMA Haythem

12 Jun 2026 • 16 min read

No endless theory here: open the terminal and practice. Here is the essentials of Scala PySpark Big Data, extracted directly from a complete 47-lesson course — with real code you can copy-paste right now.

tl;dr

Introduction to Programming Paradigms
Install the Big Data Environment
Discover Big Data and Spark
Scala Fundamentals
Advanced Scala for Spark

~$ cat ./parcours.md # Scala PySpark Big Data — 12 chapters

Introduction to Programming Paradigms

→ The imperative and procedural paradigm→ The object-oriented paradigm+ 2 more lessons

Install the Big Data environment

→ Install Java JDK and Scala→ Install Apache Spark and PySpark+ 2 more lessons

Discover Big Data and Spark

→ History and challenges of Big Data→ Apache Spark Ecosystem+ 1 more lessons

Scala Fundamentals

→ Basic syntax and data types→ Functions, conditions and loops+ 2 more lessons

Advanced Scala for Spark

→ Classes, objects and case classes→ Pattern matching and Options+ 4 more lessons

Theoretical foundations of Spark RDD DataFrame Dataset

→ Internal architecture of Spark→ RDD vs DataFrame vs Dataset+ 1 more lessons

RDD Resilient Distributed Datasets

→ What is an RDD?→ Transformations and actions on RDDs+ 1 more lessons

DataFrames and Spark SQL

→ What is a DataFrame?→ Operations and transformations on DataFrames+ 2 more lessons

🏁

Final project (+ 4 chapters along the way)

→ You leave with a concrete and demonstrable project

Complete Spark Hands-on Labs

NOTEObjective — Put into practice all the concepts covered in the course through concrete exercises in spark-shell (Scala) and PySpark. This module is a guided practical lab: you type the commands one by one and observe the results.

WARNINGRest assured — This lab is designed to be followed step by step. Every command is explained. You do not need to understand everything immediately: the goal is to manipulate and see the results. Understanding will come with practice.

Learning Objectives

TIPAt the end of this module — You will be able to master these essential skills.

NOTEPrerequisites — To follow this lab, you must have:

Spark installed (see Chapter 01)
The spark-shell accessible from your terminal
A working directory (example: C:/Users/user01/Desktop/SPARK/)

PART 0 – Mastering the spark-shell (REPL)

NOTEWhat is the spark-shell? — The spark-shell is a REPL (Read-Eval-Print Loop): an interactive terminal where you type a Scala command, Spark executes it immediately, and displays the result. It is the ideal tool for learning and testing your Spark commands quickly, without having to create a full project.

TIPVariables created automatically — When you launch spark-shell, Spark automatically creates two variables for you:

spark: a SparkSession object (Spark's main entry point)
sc: a SparkContext object (used to create RDDs)

You do not need to create them yourself. They are ready to use. In addition, import spark.implicits._ is already imported automatically (which allows you to use .toDF()).

0.1 – Launch the spark-shell

bash

# Windows (PowerShell):
spark-shell

# macOS / Linux:
./bin/spark-shell
# Or if Spark is in your PATH:
spark-shell

NOTESpark Web UI — When the spark-shell starts, Spark automatically creates a web interface accessible at http://localhost:4040. If port 4040 is busy, Spark tries 4041, 4042, etc. This interface lets you view the current jobs, stages and tasks.

0.2 – First test: create a DataFrame

output

// Type these commands one by one in the spark-shell:

scala> import spark.implicits._
// already imported automatically, but useful if you create your own SparkSession

scala> val data = Seq(("Java", "20000"), ("Python", "100000"), ("Scala", "3000"))
// data: Seq[(String, String)] = List((Java,20000), (Python,100000), (Scala,3000))

scala> val df = data.toDF()
// df: org.apache.spark.sql.DataFrame = [_1: string, _2: string]

scala> df.show()
// +------+------+
// |    _1|    _2|
// +------+------+
// |  Java| 20000|
// |Python|100000|
// | Scala|  3000|
// +------+------+

TIPName the columns — The columns are called _1 and _2 by default. To give them real names, use .toDF("language", "offers"):

output

scala> val df = data.toDF("language", "offers")
scala> df.show()
// +-------+------+
// |language|offers|
// +-------+------+
// |   Java| 20000|
// | Python|100000|
// |  Scala|  3000|
// +-------+------+

0.3 – Special spark-shell commands

WARNINGVery important — The spark-shell has special commands that start with : (colon). These commands are not Scala; they are specific to the REPL. Learn them — they will save you a lot of time!

output

// Display the full help (list of all special commands)
scala> :help

Command	Description	Example
`:help`	Displays the list of all available commands	`:help` or `:he`
`:paste`	“Paste” mode: allows you to paste multiple lines of code at once. End with Ctrl+D.	`:paste`
`:load <file>`	Loads and executes a `.scala` file line by line	`:load hello.scala`
`:load -v <file>`	Loads a file in verbose mode (shows each executed line)	`:load -v hello.scala`
`:quit`	Exit the spark-shell cleanly	`:quit` or `:q`
`:history`	Shows the history of typed commands	`:history` or `:history 20`
`:h? <word>`	Search for a word in the history	`:h? toDF`
`:require <jar>`	Add a JAR file to the classpath during the session	`:require /path/to/my.jar`
`:type <expr>`	Displays the type of an expression without executing it	`:type 1 + 2` → `Int`
`:imports`	Displays all active imports in the session	`:imports`
`:implicits`	Displays the available implicits	`:implicits -v`
`:reset`	Resets the REPL (clears all variables)	`:reset`
`:replay`	Resets and replays all previous commands	`:replay`
`:save <file>`	Saves the session to a `.scala` file	`:save my_session.scala`
`:sh <cmd>`	Executes a shell command (Unix/macOS only)	`:sh ls -la`
`:silent`	Enables/disables automatic result display	`:silent`
`:javap <class>`	Disassembles a Java / Scala class	`:javap scala.Int`

NOTEWhy :paste? — In the spark-shell, if you paste multi-line code directly, the REPL tries to execute each line separately, which causes errors. :paste mode lets you paste an entire block of code and execute it as a whole.

output

// Step 1: type :paste and press Enter
scala> :paste
// Entering paste mode (ctrl-D to finish)

// Step 2: paste your multi-line code
val names = Seq("Alice", "Bob", "Charlie")
val rdd = sc.parallelize(names)
val upper = rdd.map(_.toUpperCase)
upper.collect()

// Step 3: press Ctrl+D to execute
// Exiting paste mode, now interpreting.

// names: Seq[String] = List(Alice, Bob, Charlie)
// rdd: org.apache.spark.rdd.RDD[String] = ...
// upper: org.apache.spark.rdd.RDD[String] = ...
// res0: Array[String] = Array(ALICE, BOB, CHARLIE)

WARNINGCtrl+D, not Ctrl+C! — To exit :paste mode, press Ctrl+D. Pressing Ctrl+C cancels the pasted code.

NOTEUse case — You have a .scala file containing functions or processing you want to run in the spark-shell. Instead of retyping everything, use :load.

Step 1 – Create the Scala file:

bash

# Windows (PowerShell):
@"
println("Hello from the Scala file!")
val animals = Seq("cat", "dog", "bird")
val rdd = sc.parallelize(animals)
println("Number of animals: " + rdd.count())
rdd.collect().foreach(println)
"@ | Out-File -Encoding utf8 "C:\Users\user01\Desktop\SPARK\hello.scala"

# macOS / Linux:
cat > ~/Desktop/SPARK/hello.scala <<'EOF'
println("Hello from the Scala file!")
val animals = Seq("cat", "dog", "bird")
val rdd = sc.parallelize(animals)
println("Number of animals: " + rdd.count())
rdd.collect().foreach(println)
EOF

Spark SQL Practice — AAPL & Income

NOTEObjective: Apply Spark SQL on two real datasets — Apple stock data (AAPL.csv) and income data (income.csv) — using case class, RDD, DataFrame and Spark SQL aggregation functions.

TIPPrerequisites: Have completed Part 5 (Maven IntelliJ project). Spark installed or access to a Spark environment (Databricks, IntelliJ with Spark, or spark-shell).

0. Data Presentation & Download

0.1 – AAPL.csv file (Apple stock data)

NOTEThe file AAPL.csv contains the historical prices of Apple stock (NASDAQ: AAPL). Each line represents a trading day.

Column	Type	Description	Example
`dt`	String	Transaction date	`1984-09-07`
`openprice`	Double	Opening price of the day	`25.50`
`highprice`	Double	Highest price of the day	`26.10`
`lowprice`	Double	Lowest price of the day	`24.80`
`closeprice`	Double	Closing price of the day	`25.90`
`volume`	Double	Number of shares traded	`1234567.0`
`adjcloseprice`	Double	Adjusted closing price (dividends, splits)	`25.85`

NOTEHow to download AAPL.csv — 3 methods

Method 1 — wget (recommended, Windows/macOS/Linux)
Download the raw file directly from GitHub:

bash

wget https://raw.githubusercontent.com/inskillflow/data/refs/heads/main/AAPL.csv

NOTEUnder Windows PowerShell, use the Invoke-WebRequest alias if wget is unavailable:

bash

# PowerShell — download AAPL.csv in the current folder
Invoke-WebRequest -Uri "https://raw.githubusercontent.com/inskillflow/data/refs/heads/main/AAPL.csv" -OutFile "AAPL.csv"

NOTEMethod 2 — git clone (clone the entire repository)
Clone the full repository and retrieve all data files:

bash

# Clone the inskillflow/data repository
git clone https://github.com/inskillflow/data.git

# The AAPL.csv file will be in:
#   data/AAPL.csv

TIPMethod 3 — GitHub interface (no terminal)

Open https://github.com/inskillflow/data/blob/main/AAPL.csv
Click the “Raw” button at the top right of the file
Press Ctrl+S (or Cmd+S on Mac) to save the page as AAPL.csv

Direct link to the raw file:
https://raw.githubusercontent.com/inskillflow/data/refs/heads/main/AAPL.csv

NOTEPlacing the file after download: Place AAPL.csv in an accessible folder, then adjust the path in your code:

Windows: C:\Users\YourName\Desktop\Spark\AAPL.csv
Linux / macOS: /home/user/data/AAPL.csv
Spark Shell (relative path): ./AAPL.csv

0.2 – Income.csv file (income data)

Column	Type	Description
`id`	Double	Unique identifier
`workclass`	String	Employment category (Private, Self-emp, Gov...)
`education`	String	Education level
`maritalstatus`	String	Marital status
`occupation`	String	Occupation
`relationship`	String	Family relationship
`race`	String	Ethnic origin
`gender`	String	Gender
`nativecountry`	String	Country of origin
`income`	String	Income bracket (`<=50K` or `>50K`)
`age`	Double	Age of the individual
`fnlwgt`	Double	Statistical weighting factor
`educationalnum`	Double	Number of years of education
`capitalgain`	Double	Capital gains
`capitalloss`	Double	Capital losses
`hoursperweek`	Double	Hours worked per week

NOTEHow to download Income.csv — 3 methods

Method 1 — wget

bash

wget https://raw.githubusercontent.com/inskillflow/data/refs/heads/main/Income.csv

NOTEUnder Windows PowerShell:

bash

Invoke-WebRequest -Uri "https://raw.githubusercontent.com/inskillflow/data/refs/heads/main/Income.csv" -OutFile "Income.csv"

NOTEMethod 2 — git clone (clone both files at once)

bash

# If you have already cloned the repository for AAPL.csv, Income.csv is already present
# Otherwise:
git clone https://github.com/inskillflow/data.git
# The Income.csv file will be in:  data/Income.csv

Pattern Matching and Options

NOTEObjective — Master pattern matching, Scala’s most powerful tool for comparing, sorting and deconstructing data. Discover the Option[T] type that eliminates NullPointerException, and go further with Either and for-comprehensions.

WARNINGRest assured — Pattern matching is simply the equivalent of a switch/case in Java or a match in Python 3.10+, but much more powerful. If you already know if/elif/else in Python, you will understand without difficulty. Scala adds the ability to check types, extract data from objects and combine type + value + condition in a single case.

Learning Objectives

TIPAt the end of this module — You will be able to master these essential skills.

1. The postal sorting analogy

Imagine a postal sorting center. Each letter arrives on a conveyor belt. An employee looks at the address and sends the letter to the correct box: Paris to the left, Lyon to the right, Marseille straight ahead. If the address matches nothing known, the letter goes into an “Other” box.

Pattern matching works exactly like this postal sorting: we examine a value, compare it to several patterns, and execute the code associated with the first matching pattern.

TIPWhy it is better than an if/else — Pattern matching is more readable, safer (the compiler checks that all cases are covered with sealed trait) and far more powerful, because it can deconstruct complex objects, extract values, and combine type + value + condition in a single case.

2. The `match/case` syntax

The match expression takes a value and compares it to a series of case clauses. The first matching clause is executed. Unlike Java’s switch, there is no break to write — Scala automatically stops at the first matching case.

NOTEKey concept — In Scala, match is an expression, not a statement. This means it always returns a value. You can therefore write val x = value match { ... }.

Example 1: days of the week

output

val day = "Wednesday"

val dayType = day match {
  case "Monday" | "Tuesday" | "Wednesday" | "Thursday" | "Friday" =>
    "Weekday"
  case "Saturday" | "Sunday" =>
    "Weekend"
  case _ =>
    "Unknown day"  // _ = the “Other” box of postal sorting
}

println(dayType)  // "Weekday"

NOTEThe wildcard _ — The symbol _ (underscore) matches “everything else”. It must always be placed last, because it catches everything. Without it, if no case matches, Scala throws a MatchError at runtime.

Example 2: convert a grade into a mention

output

val grade = 15

val mention = grade match {
  case 20                 => "Perfect"
  case 16 | 17 | 18 | 19 => "Very good"
  case 14 | 15            => "Good"
  case 12 | 13            => "Fairly good"
  case 10 | 11            => "Pass"
  case _                  => "Fail"
}

println(mention)  // "Good"

In Java or Python, a switch or match is a statement: it does something but does not directly return a value. In Scala, match is an expression: it always returns a value.

What this changes concretely:

output

// Scala: match is an expression, it can be assigned directly
val category = age match {
  case a if a < 18  => "Minor"
  case a if a < 65  => "Adult"
  case _            => "Senior"
}

// It can also be used as a function argument
println(age match {
  case a if a < 18 => "Minor"
  case _           => "Adult"
})

// Or inside a string interpolation
val message = s"Status: ${age match {
  case a if a < 18 => "Minor"
  case _           => "Adult"
}}"

output

// Java: the switch (before Java 14) does not return a value
// You had to write:
String category;
switch (age) {
  case ...: category = "Minor"; break;
  default:  category = "Adult";
}
// Java 14+ added switch expressions to fill this gap

TIPPractical rule — Because match returns a value, the compiler checks that all branches return the same type. If one case returns a String and another an Int, Scala infers the common type (Any). This is often a sign of a design error.

output

# Python 3.10+ structural pattern matching
day = "Wednesday"
match day:
    case "Monday" | "Tuesday" | "Wednesday" | "Thursday" | "Friday":
        day_type = "Weekday"
    case "Saturday" | "Sunday":
        day_type = "Weekend"
    case _:
        day_type = "Unknown day"

# Difference: in Python, match is a statement (no direct assignment)
# In Scala: val x = value match { ... }  <-- expression that returns a value
# In Python: you must assign in each branch manually

output

# Python < 3.10: no match/case, we use if/elif/else
day = "Wednesday"

if day in ("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"):
    day_type = "Weekday"
elif day in ("Saturday", "Sunday"):
    day_type = "Weekend"
else:
    day_type = "Unknown day"

go-further

This article covers the most useful excerpts — the complete Scala PySpark Big Data course (13 chapters, 47 lessons, corrected exercises and final project) takes you all the way.

./access-the-complete-course free course: Mastering Claude Code

FAQ

How long does it take to learn Scala PySpark Big Data?

With a structured progression (13 chapters, 47 short and practical lessons), you reach an operational level in a few weeks at 30 to 60 minutes per day. The important thing is to practice each concept immediately.

Are prerequisites required?

Basic computer knowledge is sufficient. If you know how to use a terminal and read simple code, you are ready.

Where to start concretely?

Reproduce the commands in this article, then follow the complete Scala PySpark Big Data course: it chains the 47 lessons in order, with exercises and a final project.

./also-read

→ AWS Data Engineering Bootcamp explained simply (with diagrams and real code)→ Get started with AWS Real-Time Data: your first concrete step today → Python Data Science: the 9 key steps to go from zero to operational

📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.

Scala PySpark Big Data in Practice: The Code and Commands That Really Matter

REHOUMA Haythem

Complete Spark Hands-on Labs

Learning Objectives

PART 0 – Mastering the spark-shell (REPL)

0.1 – Launch the spark-shell

0.2 – First test: create a DataFrame

0.3 – Special spark-shell commands

Spark SQL Practice — AAPL & Income

Table of Contents

0. Data Presentation & Download

0.1 – AAPL.csv file (Apple stock data)

0.2 – Income.csv file (income data)

Pattern Matching and Options

Learning Objectives

1. The postal sorting analogy

2. The `match/case` syntax

Example 1: days of the week

Example 2: convert a grade into a mention

FAQ

Stay up to date

Complete Spark Hands-on Labs

Learning Objectives

PART 0 – Mastering the spark-shell (REPL)

0.1 – Launch the spark-shell

0.2 – First test: create a DataFrame

0.3 – Special spark-shell commands

Spark SQL Practice — AAPL & Income

Table of Contents

0. Data Presentation & Download

0.1 – AAPL.csv file (Apple stock data)

0.2 – Income.csv file (income data)

Pattern Matching and Options

Learning Objectives

1. The postal sorting analogy

2. The match/case syntax

Example 1: days of the week

Example 2: convert a grade into a mention

FAQ

Stay up to date

2. The `match/case` syntax