How to control number of records per file in Spark | Scala | Pyspark
How to limit size of a output file in Spark (maxRecordsPerFile)
While working with data we may come across a situation where we want to restrict the number of records per file. This may be useful when you want to submit files to an API which can not accept a file with more than N records.
In this article, i will demonstrate how to control size of an output file in spark.
Let’s get started !
I am going to work with House Rent Dataset. which can be downloaded from Kaggle website.
Read CSV dataset
val sparksession:SparkSession=SparkSession.builder()
.master("local").appName("House rent data")
.getOrCreate()val rentDataset:DataFrame=sparksession.read.option("header","true")
.csv("/Users/pamkin/rent.csv");rentDataset.show()
Partitioning and saving dataset on local machine
rentDataset.write.option("maxRecordsPerFile", 500).partitionBy("City").csv("/Users/pamkin/Extra/projects/scalaSpark1/src/main/scala/result")
Here, I have used maxRecordsPerFile to set number of records per file.
you can see that the file contains only 500 records and for remaining records spark creates a new file.
Note : This is only available in Spark 2.2.0 and above version.
Best books for Spark :
https://amzn.to/3VPc886 & https://amzn.to/3zb6xiS