Dynamic Partition (partitionOverwriteMode) in Spark | Scala | PySpark
In Spark after processing huge amount of data we partition our data by key before saving it in order to optimize its performance. Spark introduced Dynamic partitioning concept in 2.3.0 which provides two options:
- Dynamic mode : In dynamic mode, Spark doesn’t delete partitions ahead, and only overwrite those partitions that have data written into it at runtime.
def main( args: Array[String])={
val sparksession:SparkSession=SparkSession.builder()
.master("local").appName("House rent data")
.config("spark.sql.sources.partitionOverwriteMode", "dynamic")
.getOrCreate()var rentDataset:DataFrame=sparksession.read.option("header","true")
.csv("/Users/pamkin/Extra/train.csv");
rentDataset.write.mode(("overwrite")).partitionBy("City") .csv("/Users/pamkin/Extra/result")
}
Output:
In the above output you can notice the output which contains multiple cities data. Now if next time, we receive data only for city “Kolkata” then only this partition will be overwritten.
2. Static mode : In static mode, Spark deletes all the partitions that match the partition specification (e.g. PARTITION(a=1,b)) before overwriting.
config("spark.sql.sources.partitionOverwriteMode", "static")
Now if next time, we receive data only for city “Kolkata” then all other partitions will be deleted.
Note : By default spark uses static mode to keep the same behavior of Spark prior to 2.3.
Thanks for reading. Please follow me for more articles like this.