Recent posts

Repartition vs Coalesce in Spark

2 minute read

What are Repartition and Coalesce in Spark? Repartition repartition() increases or decreases the number of partitions in an RDD or DataFrame.

Caching an RDD in Spark

2 minute read

What is Caching an RDD in Spark? Definition Caching an RDD in Spark means storing it in memory so that subsequent actions on the same RDD can reuse the data ...

Broadcast Join in Spark

3 minute read

What is a Broadcast Join in Spark? A Broadcast Join in Spark is an optimized join strategy where one of the datasets is broadcasted (shared) to all the nodes...

reduceByKey() vs groupByKey() in Spark

3 minute read

What are reduceByKey() and groupByKey() in Spark? reduceByKey() Definition: Combines values of the same key using a specified reduce function (like sum, m...

Transformations and Actions in Spark

1 minute read

In Apache Spark, two key operations work together to process data: Transformations and Actions. Understanding these concepts helps us efficiently work with l...