Scala PySpark Big Data in Practice: The Code and Commands That Really Matter
Scala PySpark Big Data: The Essentials in One Article — Real Code, Diagrams and Concrete Steps, Excerpts from a 47-Lesson Course.
No endless theory here: open the terminal and practice. Here is the essentials of Scala PySpark Big Data, extracted directly from a complete 47-lesson course — with real code you can copy-paste right now.
- Introduction to Programming Paradigms
- Install the Big Data Environment
- Discover Big Data and Spark
- Scala Fundamentals
- Advanced Scala for Spark
Complete Spark Hands-on Labs
spark-shell (Scala) and PySpark. This module is a guided practical lab: you type the commands one by one and observe the results.Learning Objectives
- Spark installed (see Chapter 01)
- The
spark-shellaccessible from your terminal - A working directory (example:
C:/Users/user01/Desktop/SPARK/)
PART 0 – Mastering the spark-shell (REPL)
spark-shell is a REPL (Read-Eval-Print Loop): an interactive terminal where you type a Scala command, Spark executes it immediately, and displays the result. It is the ideal tool for learning and testing your Spark commands quickly, without having to create a full project.spark-shell, Spark automatically creates two variables for you:
spark: a SparkSession object (Spark's main entry point)sc: a SparkContext object (used to create RDDs)
import spark.implicits._ is already imported automatically (which allows you to use .toDF()).0.1 – Launch the spark-shell
# Windows (PowerShell): spark-shell # macOS / Linux: ./bin/spark-shell # Or if Spark is in your PATH: spark-shell
http://localhost:4040. If port 4040 is busy, Spark tries 4041, 4042, etc. This interface lets you view the current jobs, stages and tasks.0.2 – First test: create a DataFrame
// Type these commands one by one in the spark-shell:
scala> import spark.implicits._
// already imported automatically, but useful if you create your own SparkSession
scala> val data = Seq(("Java", "20000"), ("Python", "100000"), ("Scala", "3000"))
// data: Seq[(String, String)] = List((Java,20000), (Python,100000), (Scala,3000))
scala> val df = data.toDF()
// df: org.apache.spark.sql.DataFrame = [_1: string, _2: string]
scala> df.show()
// +------+------+
// | _1| _2|
// +------+------+
// | Java| 20000|
// |Python|100000|
// | Scala| 3000|
// +------+------+_1 and _2 by default. To give them real names, use .toDF("language", "offers"):scala> val df = data.toDF("language", "offers")
scala> df.show()
// +-------+------+
// |language|offers|
// +-------+------+
// | Java| 20000|
// | Python|100000|
// | Scala| 3000|
// +-------+------+0.3 – Special spark-shell commands
: (colon). These commands are not Scala; they are specific to the REPL. Learn them — they will save you a lot of time!// Display the full help (list of all special commands) scala> :help
| Command | Description | Example |
|---|---|---|
:help |
Displays the list of all available commands | :help or :he |
:paste |
“Paste” mode: allows you to paste multiple lines of code at once. End with Ctrl+D. | :paste |
:load <file> |
Loads and executes a .scala file line by line |
:load hello.scala |
:load -v <file> |
Loads a file in verbose mode (shows each executed line) | :load -v hello.scala |
:quit |
Exit the spark-shell cleanly | :quit or :q |
:history |
Shows the history of typed commands | :history or :history 20 |
:h? <word> |
Search for a word in the history | :h? toDF |
:require <jar> |
Add a JAR file to the classpath during the session | :require /path/to/my.jar |
:type <expr> |
Displays the type of an expression without executing it | :type 1 + 2 → Int |
:imports |
Displays all active imports in the session | :imports |
:implicits |
Displays the available implicits | :implicits -v |
:reset |
Resets the REPL (clears all variables) | :reset |
:replay |
Resets and replays all previous commands | :replay |
:save <file> |
Saves the session to a .scala file |
:save my_session.scala |
:sh <cmd> |
Executes a shell command (Unix/macOS only) | :sh ls -la |
:silent |
Enables/disables automatic result display | :silent |
:javap <class> |
Disassembles a Java / Scala class | :javap scala.Int |
:paste? — In the spark-shell, if you paste multi-line code directly, the REPL tries to execute each line separately, which causes errors. :paste mode lets you paste an entire block of code and execute it as a whole.// Step 1: type :paste and press Enter
scala> :paste
// Entering paste mode (ctrl-D to finish)
// Step 2: paste your multi-line code
val names = Seq("Alice", "Bob", "Charlie")
val rdd = sc.parallelize(names)
val upper = rdd.map(_.toUpperCase)
upper.collect()
// Step 3: press Ctrl+D to execute
// Exiting paste mode, now interpreting.
// names: Seq[String] = List(Alice, Bob, Charlie)
// rdd: org.apache.spark.rdd.RDD[String] = ...
// upper: org.apache.spark.rdd.RDD[String] = ...
// res0: Array[String] = Array(ALICE, BOB, CHARLIE):paste mode, press Ctrl+D. Pressing Ctrl+C cancels the pasted code..scala file containing functions or processing you want to run in the spark-shell. Instead of retyping everything, use :load.Step 1 – Create the Scala file:
# Windows (PowerShell):
@"
println("Hello from the Scala file!")
val animals = Seq("cat", "dog", "bird")
val rdd = sc.parallelize(animals)
println("Number of animals: " + rdd.count())
rdd.collect().foreach(println)
"@ | Out-File -Encoding utf8 "C:\Users\user01\Desktop\SPARK\hello.scala"
# macOS / Linux:
cat > ~/Desktop/SPARK/hello.scala <<'EOF'
println("Hello from the Scala file!")
val animals = Seq("cat", "dog", "bird")
val rdd = sc.parallelize(animals)
println("Number of animals: " + rdd.count())
rdd.collect().foreach(println)
EOFSpark SQL Practice — AAPL & Income
AAPL.csv) and income data (income.csv) — using case class, RDD, DataFrame and Spark SQL aggregation functions.Table of Contents
0. Data Presentation & Download
0.1 – AAPL.csv file (Apple stock data)
AAPL.csv contains the historical prices of Apple stock (NASDAQ: AAPL). Each line represents a trading day.| Column | Type | Description | Example |
|---|---|---|---|
dt | String | Transaction date | 1984-09-07 |
openprice | Double | Opening price of the day | 25.50 |
highprice | Double | Highest price of the day | 26.10 |
lowprice | Double | Lowest price of the day | 24.80 |
closeprice | Double | Closing price of the day | 25.90 |
volume | Double | Number of shares traded | 1234567.0 |
adjcloseprice | Double | Adjusted closing price (dividends, splits) | 25.85 |
AAPL.csv — 3 methods
Method 1 —
wget (recommended, Windows/macOS/Linux)Download the raw file directly from GitHub:
wget https://raw.githubusercontent.com/inskillflow/data/refs/heads/main/AAPL.csv
Invoke-WebRequest alias if wget is unavailable:# PowerShell — download AAPL.csv in the current folder Invoke-WebRequest -Uri "https://raw.githubusercontent.com/inskillflow/data/refs/heads/main/AAPL.csv" -OutFile "AAPL.csv"
git clone (clone the entire repository)Clone the full repository and retrieve all data files:
# Clone the inskillflow/data repository git clone https://github.com/inskillflow/data.git # The AAPL.csv file will be in: # data/AAPL.csv
- Open https://github.com/inskillflow/data/blob/main/AAPL.csv
- Click the “Raw” button at the top right of the file
- Press Ctrl+S (or Cmd+S on Mac) to save the page as
AAPL.csv
https://raw.githubusercontent.com/inskillflow/data/refs/heads/main/AAPL.csv
AAPL.csv in an accessible folder, then adjust the path in your code:
- Windows:
C:\Users\YourName\Desktop\Spark\AAPL.csv - Linux / macOS:
/home/user/data/AAPL.csv - Spark Shell (relative path):
./AAPL.csv
0.2 – Income.csv file (income data)
| Column | Type | Description |
|---|---|---|
id | Double | Unique identifier |
workclass | String | Employment category (Private, Self-emp, Gov...) |
education | String | Education level |
maritalstatus | String | Marital status |
occupation | String | Occupation |
relationship | String | Family relationship |
race | String | Ethnic origin |
gender | String | Gender |
nativecountry | String | Country of origin |
income | String | Income bracket (<=50K or >50K) |
age | Double | Age of the individual |
fnlwgt | Double | Statistical weighting factor |
educationalnum | Double | Number of years of education |
capitalgain | Double | Capital gains |
capitalloss | Double | Capital losses |
hoursperweek | Double | Hours worked per week |
Income.csv — 3 methods
Method 1 —
wgetwget https://raw.githubusercontent.com/inskillflow/data/refs/heads/main/Income.csv
Invoke-WebRequest -Uri "https://raw.githubusercontent.com/inskillflow/data/refs/heads/main/Income.csv" -OutFile "Income.csv"
git clone (clone both files at once)# If you have already cloned the repository for AAPL.csv, Income.csv is already present # Otherwise: git clone https://github.com/inskillflow/data.git # The Income.csv file will be in: data/Income.csv
Pattern Matching and Options
Option[T] type that eliminates NullPointerException, and go further with Either and for-comprehensions.switch/case in Java or a match in Python 3.10+, but much more powerful. If you already know if/elif/else in Python, you will understand without difficulty. Scala adds the ability to check types, extract data from objects and combine type + value + condition in a single case.Learning Objectives
1. The postal sorting analogy
Imagine a postal sorting center. Each letter arrives on a conveyor belt. An employee looks at the address and sends the letter to the correct box: Paris to the left, Lyon to the right, Marseille straight ahead. If the address matches nothing known, the letter goes into an “Other” box.
Pattern matching works exactly like this postal sorting: we examine a value, compare it to several patterns, and execute the code associated with the first matching pattern.
if/else — Pattern matching is more readable, safer (the compiler checks that all cases are covered with sealed trait) and far more powerful, because it can deconstruct complex objects, extract values, and combine type + value + condition in a single case.2. The match/case syntax
The match expression takes a value and compares it to a series of case clauses. The first matching clause is executed. Unlike Java’s switch, there is no break to write — Scala automatically stops at the first matching case.
match is an expression, not a statement. This means it always returns a value. You can therefore write val x = value match { ... }.Example 1: days of the week
val day = "Wednesday"
val dayType = day match {
case "Monday" | "Tuesday" | "Wednesday" | "Thursday" | "Friday" =>
"Weekday"
case "Saturday" | "Sunday" =>
"Weekend"
case _ =>
"Unknown day" // _ = the “Other” box of postal sorting
}
println(dayType) // "Weekday"_ — The symbol _ (underscore) matches “everything else”. It must always be placed last, because it catches everything. Without it, if no case matches, Scala throws a MatchError at runtime.Example 2: convert a grade into a mention
val grade = 15
val mention = grade match {
case 20 => "Perfect"
case 16 | 17 | 18 | 19 => "Very good"
case 14 | 15 => "Good"
case 12 | 13 => "Fairly good"
case 10 | 11 => "Pass"
case _ => "Fail"
}
println(mention) // "Good"In Java or Python, a switch or match is a statement: it does something but does not directly return a value. In Scala, match is an expression: it always returns a value.
What this changes concretely:
// Scala: match is an expression, it can be assigned directly
val category = age match {
case a if a < 18 => "Minor"
case a if a < 65 => "Adult"
case _ => "Senior"
}
// It can also be used as a function argument
println(age match {
case a if a < 18 => "Minor"
case _ => "Adult"
})
// Or inside a string interpolation
val message = s"Status: ${age match {
case a if a < 18 => "Minor"
case _ => "Adult"
}}"// Java: the switch (before Java 14) does not return a value
// You had to write:
String category;
switch (age) {
case ...: category = "Minor"; break;
default: category = "Adult";
}
// Java 14+ added switch expressions to fill this gapmatch returns a value, the compiler checks that all branches return the same type. If one case returns a String and another an Int, Scala infers the common type (Any). This is often a sign of a design error.# Python 3.10+ structural pattern matching
day = "Wednesday"
match day:
case "Monday" | "Tuesday" | "Wednesday" | "Thursday" | "Friday":
day_type = "Weekday"
case "Saturday" | "Sunday":
day_type = "Weekend"
case _:
day_type = "Unknown day"
# Difference: in Python, match is a statement (no direct assignment)
# In Scala: val x = value match { ... } <-- expression that returns a value
# In Python: you must assign in each branch manually# Python < 3.10: no match/case, we use if/elif/else
day = "Wednesday"
if day in ("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"):
day_type = "Weekday"
elif day in ("Saturday", "Sunday"):
day_type = "Weekend"
else:
day_type = "Unknown day"This article covers the most useful excerpts — the complete Scala PySpark Big Data course (13 chapters, 47 lessons, corrected exercises and final project) takes you all the way.
./access-the-complete-course free course: Mastering Claude CodeFAQ
How long does it take to learn Scala PySpark Big Data?
Are prerequisites required?
Where to start concretely?
📬 Want to receive this type of guide every week? Subscribe for free — real code, zero fluff.