Recent posts

Cache vs Persist in Spark

2 minute read

What is Cache and Persist in Spark? Cache Definition: The cache() method stores the RDD or DataFrame in memory. By default, it uses the MEMORY_AND_DISK st...

Managed vs external table in Spark

3 minute read

What Are Managed Tables and External Tables in Spark? Managed Tables Definition: In a managed table, Spark manages both the metadata and the data itself. ...

Repartition vs Coalesce in Spark

2 minute read

What are Repartition and Coalesce in Spark? Repartition repartition() increases or decreases the number of partitions in an RDD or DataFrame.

Caching an RDD in Spark

2 minute read

What is Caching an RDD in Spark? Definition Caching an RDD in Spark means storing it in memory so that subsequent actions on the same RDD can reuse the data ...