Scala PySpark Big Data in Practice: The Code and Commands That Really Matter

Scala PySpark Big Data: The Essentials in One Article — Real Code, Diagrams and Concrete Steps, Excerpts from a 47-Lesson Course.

Scala PySpark Big Data in Practice: The Code and Commands That Really Matter

No endless theory here: open the terminal and practice. Here is the essentials of Scala PySpark Big Data, extracted directly from a complete 47-lesson course — with real code you can copy-paste right now.

tl;dr
  • Introduction to Programming Paradigms
  • Install the Big Data Environment
  • Discover Big Data and Spark
  • Scala Fundamentals
  • Advanced Scala for Spark
~$ cat ./parcours.md # Scala PySpark Big Data — 12 chapters
01
Introduction to Programming Paradigms
→ The imperative and procedural paradigm→ The object-oriented paradigm+ 2 more lessons
02
Install the Big Data environment
→ Install Java JDK and Scala→ Install Apache Spark and PySpark+ 2 more lessons
03
Discover Big Data and Spark
→ History and challenges of Big Data→ Apache Spark Ecosystem+ 1 more lessons
04
Scala Fundamentals
→ Basic syntax and data types→ Functions, conditions and loops+ 2 more lessons
05
Advanced Scala for Spark
→ Classes, objects and case classes→ Pattern matching and Options+ 4 more lessons
06
Theoretical foundations of Spark RDD DataFrame Dataset
→ Internal architecture of Spark→ RDD vs DataFrame vs Dataset+ 1 more lessons
07
RDD Resilient Distributed Datasets
→ What is an RDD?→ Transformations and actions on RDDs+ 1 more lessons
08
DataFrames and Spark SQL
→ What is a DataFrame?→ Operations and transformations on DataFrames+ 2 more lessons
🏁
Final project (+ 4 chapters along the way)
→ You leave with a concrete and demonstrable project

Complete Spark Hands-on Labs

NOTEObjective — Put into practice all the concepts covered in the course through concrete exercises in spark-shell (Scala) and PySpark. This module is a guided practical lab: you type the commands one by one and observe the results.
WARNINGRest assured — This lab is designed to be followed step by step. Every command is explained. You do not need to understand everything immediately: the goal is to manipulate and see the results. Understanding will come with practice.

Learning Objectives

TIPAt the end of this module — You will be able to master these essential skills.
NOTEPrerequisites — To follow this lab, you must have:
  • Spark installed (see Chapter 01)
  • The spark-shell accessible from your terminal
  • A working directory (example: C:/Users/user01/Desktop/SPARK/)

PART 0 – Mastering the spark-shell (REPL)

NOTEWhat is the spark-shell? — The spark-shell is a REPL (Read-Eval-Print Loop): an interactive terminal where you type a Scala command, Spark executes it immediately, and displays the result. It is the ideal tool for learning and testing your Spark commands quickly, without having to create a full project.
TIPVariables created automatically — When you launch spark-shell, Spark automatically creates two variables for you:
  • spark: a SparkSession object (Spark's main entry point)
  • sc: a SparkContext object (used to create RDDs)
You do not need to create them yourself. They are ready to use. In addition, import spark.implicits._ is already imported automatically (which allows you to use .toDF()).

0.1 – Launch the spark-shell

bash
# Windows (PowerShell):
spark-shell

# macOS / Linux:
./bin/spark-shell
# Or if Spark is in your PATH:
spark-shell
NOTESpark Web UI — When the spark-shell starts, Spark automatically creates a web interface accessible at http://localhost:4040. If port 4040 is busy, Spark tries 4041, 4042, etc. This interface lets you view the current jobs, stages and tasks.

0.2 – First test: create a DataFrame

output
// Type these commands one by one in the spark-shell:

scala> import spark.implicits._
// already imported automatically, but useful if you create your own SparkSession

scala> val data = Seq(("Java", "20000"), ("Python", "100000"), ("Scala", "3000"))
// data: Seq[(String, String)] = List((Java,20000), (Python,100000), (Scala,3000))

scala> val df = data.toDF()
// df: org.apache.spark.sql.DataFrame = [_1: string, _2: string]

scala> df.show()
// +------+------+
// |    _1|    _2|
// +------+------+
// |  Java| 20000|
// |Python|100000|
// | Scala|  3000|
// +------+------+
TIPName the columns — The columns are called _1 and _2 by default. To give them real names, use .toDF("language", "offers"):
output
scala> val df = data.toDF("language", "offers")
scala> df.show()
// +-------+------+
// |language|offers|
// +-------+------+
// |   Java| 20000|
// | Python|100000|
// |  Scala|  3000|
// +-------+------+

0.3 – Special spark-shell commands

WARNINGVery important — The spark-shell has special commands that start with : (colon). These commands are not Scala; they are specific to the REPL. Learn them — they will save you a lot of time!
output
// Display the full help (list of all special commands)
scala> :help
CommandDescriptionExample
:help Displays the list of all available commands :help or :he
:paste “Paste” mode: allows you to paste multiple lines of code at once. End with Ctrl+D. :paste
:load <file> Loads and executes a .scala file line by line :load hello.scala
:load -v <file> Loads a file in verbose mode (shows each executed line) :load -v hello.scala
:quit Exit the spark-shell cleanly :quit or :q
:history Shows the history of typed commands :history or :history 20
:h? <word> Search for a word in the history :h? toDF
:require <jar> Add a JAR file to the classpath during the session :require /path/to/my.jar
:type <expr> Displays the type of an expression without executing it :type 1 + 2Int
:imports Displays all active imports in the session :imports
:implicits Displays the available implicits :implicits -v
:reset Resets the REPL (clears all variables) :reset
:replay Resets and replays all previous commands :replay
:save <file> Saves the session to a .scala file :save my_session.scala
:sh <cmd> Executes a shell command (Unix/macOS only) :sh ls -la
:silent Enables/disables automatic result display :silent
:javap <class> Disassembles a Java / Scala class :javap scala.Int
NOTEWhy :paste? — In the spark-shell, if you paste multi-line code directly, the REPL tries to execute each line separately, which causes errors. :paste mode lets you paste an entire block of code and execute it as a whole.
output
// Step 1: type :paste and press Enter
scala> :paste
// Entering paste mode (ctrl-D to finish)

// Step 2: paste your multi-line code
val names = Seq("Alice", "Bob", "Charlie")
val rdd = sc.parallelize(names)
val upper = rdd.map(_.toUpperCase)
upper.collect()

// Step 3: press Ctrl+D to execute
// Exiting paste mode, now interpreting.

// names: Seq[String] = List(Alice, Bob, Charlie)
// rdd: org.apache.spark.rdd.RDD[String] = ...
// upper: org.apache.spark.rdd.RDD[String] = ...
// res0: Array[String] = Array(ALICE, BOB, CHARLIE)
WARNINGCtrl+D, not Ctrl+C! — To exit :paste mode, press Ctrl+D. Pressing Ctrl+C cancels the pasted code.
NOTEUse case — You have a .scala file containing functions or processing you want to run in the spark-shell. Instead of retyping everything, use :load.

Step 1 – Create the Scala file:

bash
# Windows (PowerShell):
@"
println("Hello from the Scala file!")
val animals = Seq("cat", "dog", "bird")
val rdd = sc.parallelize(animals)
println("Number of animals: " + rdd.count())
rdd.collect().foreach(println)
"@ | Out-File -Encoding utf8 "C:\Users\user01\Desktop\SPARK\hello.scala"

# macOS / Linux:
cat > ~/Desktop/SPARK/hello.scala <<'EOF'
println("Hello from the Scala file!")
val animals = Seq("cat", "dog", "bird")
val rdd = sc.parallelize(animals)
println("Number of animals: " + rdd.count())
rdd.collect().foreach(println)
EOF

Spark SQL Practice — AAPL & Income

NOTEObjective: Apply Spark SQL on two real datasets — Apple stock data (AAPL.csv) and income data (income.csv) — using case class, RDD, DataFrame and Spark SQL aggregation functions.
TIPPrerequisites: Have completed Part 5 (Maven IntelliJ project). Spark installed or access to a Spark environment (Databricks, IntelliJ with Spark, or spark-shell).

Table of Contents

0. Data Presentation & Download

0.1 – AAPL.csv file (Apple stock data)

NOTEThe file AAPL.csv contains the historical prices of Apple stock (NASDAQ: AAPL). Each line represents a trading day.
ColumnTypeDescriptionExample
dtStringTransaction date1984-09-07
openpriceDoubleOpening price of the day25.50
highpriceDoubleHighest price of the day26.10
lowpriceDoubleLowest price of the day24.80
closepriceDoubleClosing price of the day25.90
volumeDoubleNumber of shares traded1234567.0
adjclosepriceDoubleAdjusted closing price (dividends, splits)25.85
NOTEHow to download AAPL.csv — 3 methods
Method 1 — wget (recommended, Windows/macOS/Linux)
Download the raw file directly from GitHub:
bash
wget https://raw.githubusercontent.com/inskillflow/data/refs/heads/main/AAPL.csv
NOTEUnder Windows PowerShell, use the Invoke-WebRequest alias if wget is unavailable:
bash
# PowerShell — download AAPL.csv in the current folder
Invoke-WebRequest -Uri "https://raw.githubusercontent.com/inskillflow/data/refs/heads/main/AAPL.csv" -OutFile "AAPL.csv"
NOTEMethod 2 — git clone (clone the entire repository)
Clone the full repository and retrieve all data files:
bash
# Clone the inskillflow/data repository
git clone https://github.com/inskillflow/data.git

# The AAPL.csv file will be in:
#   data/AAPL.csv
TIPMethod 3 — GitHub interface (no terminal)
  1. Open https://github.com/inskillflow/data/blob/main/AAPL.csv
  2. Click the “Raw” button at the top right of the file
  3. Press Ctrl+S (or Cmd+S on Mac) to save the page as AAPL.csv
Direct link to the raw file:
https://raw.githubusercontent.com/inskillflow/data/refs/heads/main/AAPL.csv
NOTEPlacing the file after download: Place AAPL.csv in an accessible folder, then adjust the path in your code:
  • Windows: C:\Users\YourName\Desktop\Spark\AAPL.csv
  • Linux / macOS: /home/user/data/AAPL.csv
  • Spark Shell (relative path): ./AAPL.csv

0.2 – Income.csv file (income data)

ColumnTypeDescription
idDoubleUnique identifier
workclassStringEmployment category (Private, Self-emp, Gov...)
educationStringEducation level
maritalstatusStringMarital status
occupationStringOccupation
relationshipStringFamily relationship
raceStringEthnic origin
genderStringGender
nativecountryStringCountry of origin
incomeStringIncome bracket (<=50K or >50K)
ageDoubleAge of the individual
fnlwgtDoubleStatistical weighting factor
educationalnumDoubleNumber of years of education
capitalgainDoubleCapital gains
capitallossDoubleCapital losses
hoursperweekDoubleHours worked per week
NOTEHow to download Income.csv — 3 methods
Method 1 — wget
bash
wget https://raw.githubusercontent.com/inskillflow/data/refs/heads/main/Income.csv
NOTEUnder Windows PowerShell:
bash
Invoke-WebRequest -Uri "https://raw.githubusercontent.com/inskillflow/data/refs/heads/main/Income.csv" -OutFile "Income.csv"
NOTEMethod 2 — git clone (clone both files at once)
bash
# If you have already cloned the repository for AAPL.csv, Income.csv is already present
# Otherwise:
git clone https://github.com/inskillflow/data.git
# The Income.csv file will be in:  data/Income.csv

Pattern Matching and Options

NOTEObjective — Master pattern matching, Scala’s most powerful tool for comparing, sorting and deconstructing data. Discover the Option[T] type that eliminates NullPointerException, and go further with Either and for-comprehensions.
WARNINGRest assured — Pattern matching is simply the equivalent of a switch/case in Java or a match in Python 3.10+, but much more powerful. If you already know if/elif/else in Python, you will understand without difficulty. Scala adds the ability to check types, extract data from objects and combine type + value + condition in a single case.

Learning Objectives

TIPAt the end of this module — You will be able to master these essential skills.

1. The postal sorting analogy

Imagine a postal sorting center. Each letter arrives on a conveyor belt. An employee looks at the address and sends the letter to the correct box: Paris to the left, Lyon to the right, Marseille straight ahead. If the address matches nothing known, the letter goes into an “Other” box.

Pattern matching works exactly like this postal sorting: we examine a value, compare it to several patterns, and execute the code associated with the first matching pattern.

TIPWhy it is better than an if/else — Pattern matching is more readable, safer (the compiler checks that all cases are covered with sealed trait) and far more powerful, because it can deconstruct complex objects, extract values, and combine type + value + condition in a single case.

2. The match/case syntax

The match expression takes a value and compares it to a series of case clauses. The first matching clause is executed. Unlike Java’s switch, there is no break to write — Scala automatically stops at the first matching case.

NOTEKey concept — In Scala, match is an expression, not a statement. This means it always returns a value. You can therefore write val x = value match { ... }.

Example 1: days of the week

output
val day = "Wednesday"

val dayType = day match {
  case "Monday" | "Tuesday" | "Wednesday" | "Thursday" | "Friday" =>
    "Weekday"
  case "Saturday" | "Sunday" =>
    "Weekend"
  case _ =>
    "Unknown day"  // _ = the “Other” box of postal sorting
}

println(dayType)  // "Weekday"
NOTEThe wildcard _ — The symbol _ (underscore) matches “everything else”. It must always be placed last, because it catches everything. Without it, if no case matches, Scala throws a MatchError at runtime.

Example 2: convert a grade into a mention

output
val grade = 15

val mention = grade match {
  case 20                 => "Perfect"
  case 16 | 17 | 18 | 19 => "Very good"
  case 14 | 15            => "Good"
  case 12 | 13            => "Fairly good"
  case 10 | 11            => "Pass"
  case _                  => "Fail"
}

println(mention)  // "Good"

In Java or Python, a switch or match is a statement: it does something but does not directly return a value. In Scala, match is an expression: it always returns a value.

What this changes concretely:

output
// Scala: match is an expression, it can be assigned directly
val category = age match {
  case a if a < 18  => "Minor"
  case a if a < 65  => "Adult"
  case _            => "Senior"
}

// It can also be used as a function argument
println(age match {
  case a if a < 18 => "Minor"
  case _           => "Adult"
})

// Or inside a string interpolation
val message = s"Status: ${age match {
  case a if a < 18 => "Minor"
  case _           => "Adult"
}}"
output
// Java: the switch (before Java 14) does not return a value
// You had to write:
String category;
switch (age) {
  case ...: category = "Minor"; break;
  default:  category = "Adult";
}
// Java 14+ added switch expressions to fill this gap
TIPPractical rule — Because match returns a value, the compiler checks that all branches return the same type. If one case returns a String and another an Int, Scala infers the common type (Any). This is often a sign of a design error.
output
# Python 3.10+ structural pattern matching
day = "Wednesday"
match day:
    case "Monday" | "Tuesday" | "Wednesday" | "Thursday" | "Friday":
        day_type = "Weekday"
    case "Saturday" | "Sunday":
        day_type = "Weekend"
    case _:
        day_type = "Unknown day"

# Difference: in Python, match is a statement (no direct assignment)
# In Scala: val x = value match { ... }  <-- expression that returns a value
# In Python: you must assign in each branch manually
output
# Python < 3.10: no match/case, we use if/elif/else
day = "Wednesday"

if day in ("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"):
    day_type = "Weekday"
elif day in ("Saturday", "Sunday"):
    day_type = "Weekend"
else:
    day_type = "Unknown day"
go-further

This article covers the most useful excerpts — the complete Scala PySpark Big Data course (13 chapters, 47 lessons, corrected exercises and final project) takes you all the way.

./access-the-complete-course free course: Mastering Claude Code

FAQ

How long does it take to learn Scala PySpark Big Data?
With a structured progression (13 chapters, 47 short and practical lessons), you reach an operational level in a few weeks at 30 to 60 minutes per day. The important thing is to practice each concept immediately.
Are prerequisites required?
Basic computer knowledge is sufficient. If you know how to use a terminal and read simple code, you are ready.
Where to start concretely?
Reproduce the commands in this article, then follow the complete Scala PySpark Big Data course: it chains the 47 lessons in order, with exercises and a final project.

📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.