difference between repartition and coalesce in Databricks

 Spark is basically known for in memory computation with distributed mechanism. We can create partition on large files or tables so that we don't need to scan all data.

For example, We have table with multiple countries data and need to fetch data from specific country only so we can create partition on country column and Spark will spilit our tables for each country data file .

Sometimes we need to repartition our table to adjust number of partitions. Now we will discuss difference between repartition and coalesce : 

Repartition :

1. Repartition is used to increase/ decrease the partition.

2. Repartition shuffles the data across multiple partitions so this data movement is not good thing.

3. It's slow in performance.

4. It shuffles data from multiple partitions and create equal size of newly defined partitions.

Coalesce : 

1. It's used to only decrease the number of partitions.

2. It doesn't shuffle the data across partitions.

3. Performance is really good .

4. It doesn't shuffle across multiple partitions so create uneven size of newly defined partitions.

Comments

Popular posts from this blog

Performance optimization in Copy data activity in Azure

Azure blob Storage interview questions

Why do we have two keys in storage account and need for rotate them ?