How to limit size of a output file in Spark (maxRecordsPerFile)

2 min readSep 8, 2022

While working with data we may come across a situation where we want to restrict the number of records per file. This may be useful when you want to submit files to an API which can not accept a file with more than N records.

In this article, i will demonstrate how to control size of an output file in spark.

I am going to work with House Rent Dataset. which can be downloaded from Kaggle website.

Read CSV dataset

val sparksession:SparkSession=SparkSession.builder()
.master("local").appName("House rent data")

Partitioning and saving dataset on local machine

rentDataset.write.option("maxRecordsPerFile", 500).partitionBy("City").csv("/Users/pamkin/Extra/projects/scalaSpark1/src/main/scala/result")

Here, I have used maxRecordsPerFile to set number of records per file.

you can see that the file contains only 500 records and for remaining records spark creates a new file.

Note : This is only available in Spark 2.2.0 and above version.
