How to control number of records per file in Spark | Scala | Pyspark

How to limit size of a output file in Spark (maxRecordsPerFile)

Parmanand
2 min readSep 8, 2022

While working with data we may come across a situation where we want to restrict the number of records per file. This may be useful when you want to submit files to an API which can not accept a file with more than N records.

In this article, i will demonstrate how to control size of an output file in spark.

Let’s get started !

I am going to work with House Rent Dataset. which can be downloaded from Kaggle website.

Read CSV dataset

val sparksession:SparkSession=SparkSession.builder()
.master("local").appName("House rent data")
.getOrCreate()
val rentDataset:DataFrame=sparksession.read.option("header","true")
.csv("/Users/pamkin/rent.csv");rentDataset.show()

Partitioning and saving dataset on local machine

rentDataset.write.option("maxRecordsPerFile", 500).partitionBy("City").csv("/Users/pamkin/Extra/projects/scalaSpark1/src/main/scala/result")

Here, I have used maxRecordsPerFile to set number of records per file.

you can see that the file contains only 500 records and for remaining records spark creates a new file.

Note : This is only available in Spark 2.2.0 and above version.
Best books for Spark :
https://amzn.to/3VPc886 & https://amzn.to/3zb6xiS

--

--

Responses (1)