Jobs, Stages and Tasks in Spark
What is a Job?
-
A Job is a complete Spark operation triggered by an action like
collect()
,count()
, orsave()
. -
Every action creates a new Job.
Example: If you say: “Count all even numbers in a dataset.”
- Spark creates 1 Job to do that.
What is a Stage?
-
A Stage is a group of operations that Spark can do without moving data between machines (nodes).
-
If Spark needs to shuffle data across nodes (due to wide transformations like
reduceByKey()
), it creates a new Stage.
Example: If you say: “Double all numbers and then add numbers with the same key.”
-
Stage 1: Double the numbers (done independently).
-
Stage 2: Add numbers with the same key (needs data from multiple nodes).
What is a Task?
-
A Task is a single piece of work that runs on one part of the data (partition).
-
Each Stage runs one Task per partition.
Example: If you have a dataset split into 4 partitions, each stage will run 4 Tasks (one task per partition).
Layman Example
- Job: Deliver pizzas to all customers (entire task).
- Stages:
- Stage 1: Make pizzas in the kitchen.
- Stage 2: Deliver pizzas to customers.
- Tasks:
- For each delivery driver (partition), assign a delivery task.
Code Example:
from pyspark.sql import SparkSession
# Start Spark
spark = SparkSession.builder.appName("SimpleExample").getOrCreate()
# Create RDD with 4 partitions
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7, 8], 4)
# Transformation (Stage 1 - Narrow)
mapped_rdd = rdd.map(lambda x: x * 2)
# Action (Stage 2 - Wide, triggers Job)
result = mapped_rdd.reduce(lambda x, y: x + y)
print(f"Final Result: {result}")
How Spark Works:
- Job: The action reduce() triggers a job.
- Stages:
- Stage 1: Multiply numbers by 2 (map()).
- Stage 2: Add numbers together (reduce()).
- Tasks: Since there are 4 partitions, Spark runs 4 Tasks in each stage.
Number of Jobs = Number of Actions in DAG
A Job in Spark is triggered by an Action like collect()
, count()
, or save()
. Each action creates a DAG (Directed Acyclic Graph), which Spark breaks down into stages and tasks.
Example:
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder.appName("JobsStagesTasks").getOrCreate()
# Create an RDD
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6], 3)
# Action 1 (Job 1)
result1 = rdd.filter(lambda x: x % 2 == 0).collect()
# Action 2 (Job 2)
result2 = rdd.count()
Explanation:
-
Actions Trigger Jobs:
collect()
→ Job 1count()
→ Job 2
Total Jobs: 2 (One per action)
Number of Stages = Number of Wide Transformations + 1
A Stage is a group of tasks that can be executed without requiring data from other partitions. Wide Transformations like reduceByKey()
, groupByKey()
, and join()
cause data shuffling, splitting the DAG into multiple stages.
Example:
# Create an RDD of key-value pairs
rdd2 = spark.sparkContext.parallelize([("a", 1), ("b", 2), ("a", 3), ("b", 4)], 2)
# Transformation 1: map (Narrow Transformation)
mapped_rdd = rdd2.mapValues(lambda x: x * 2)
# Transformation 2: reduceByKey (Wide Transformation)
reduced_rdd = mapped_rdd.reduceByKey(lambda x, y: x + y)
# Action: collect()
result = reduced_rdd.collect()
Explanation:
mapValues()
→ Narrow Transformation (Same stage)reduceByKey()
→ Wide Transformation (New stage due to shuffling)
Total Stages: 2 (1 for mapValues()
+ 1 for reduceByKey()
)
Number of Tasks = Number of Partitions per Stage
A Task is the smallest unit of work in Spark, assigned to each partition of a stage.
How It Works:
- Each stage runs one task per partition.
- If you have 4 partitions, each stage will have 4 tasks.
Full Example: Jobs, Stages, and Tasks
# Create RDD with 4 partitions
rdd3 = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7, 8], 4)
# Transformation 1: map (Narrow)
mapped_rdd = rdd3.map(lambda x: x * 2)
# Transformation 2: reduceByKey (Wide)
paired_rdd = mapped_rdd.map(lambda x: (x % 2, x))
result_rdd = paired_rdd.reduceByKey(lambda x, y: x + y)
# Action: collect()
final_result = result_rdd.collect()
print(final_result)
Breakdown:
- Jobs:
- Action:
collect()
→ 1 Job
- Action:
- Stages:
- Stage 1:
map()
(Narrow) - Stage 2:
reduceByKey()
(Wide, shuffling needed) Total Stages: 2
- Stage 1:
- Tasks:
- Stage 1: 4 Tasks (since there are 4 partitions)
- Stage 2: 4 Tasks (again, due to 4 partitions after shuffle)
Key Takeaways
-
Number of Jobs = Number of Actions ✅
-
Number of Stages = Wide Transformations + 1 ✅
-
Number of Tasks = Number of Partitions per Stage ✅
Leave a comment