How to control number of files per partition in Spark | Pyspark | Scala

Reduce number of output files in Spark

Aug 30, 2022

We often deal with huge amount of data where we need to partition by some column and at the same time we need limited number of files in each partition.

In this article, i will demonstrate how to manage partitions using spark to avoid small files in each partition.

Let’s get started !

I am going to work with House Rent Dataset. which can be downloaded from Kaggle website.

Read CSV dataset

val sparksession:SparkSession=SparkSession.builder()
   .master("local").appName("House rent data")
   .getOrCreate()

val rentDataset:DataFrame=sparksession.read.option("header","true")
   .csv("/Users/pamkin/rent.csv");rentDataset.show()

Partitioning and saving dataset on local machine

rentDataset.repartition(3).write.partitionBy("City")
    .csv("/Users/pamkin/result/")

Here, you can notice each partition has only 3 CSV files. Repartition controls number of output files.

Thanks for reading!

Please do share the article, if you liked it. Any comments or suggestions are welcome.

How to control number of files per partition in Spark | Pyspark | Scala

Reduce number of output files in Spark

Written by Parmanand