Handling Null values in spark scala

Parmanand
2 min readNov 4, 2020

Spark is one of the powerful data processing framework. It offers many functions to handle null values in spark Dataframe in different ways. Spark also includes a function to allow us to replace null values in Dataframe. It’s na package contains functions to deal with null values.

In this article, will talk about these function one by one with example.

Let’s get started !

Drop(): Easiest way to deal with null values is to drop rows which contain null or NaN values 😄

For example -

val tempDF=sparkSession.createDataFrame(Seq(
("rahul sharma",32,"Patna",20000,null),
("Joy don",30,"NY",23455,"27-Sep-20"),
("Steve boy",42,"Delhi",294884,"27-Sep-20"))
).toDF("name","age","city","salary","date")

The first row contains a null value.

val finalDF=tempDF.na.drop();
finalDF.show()

Output-

Note- it is possible to mention few column names which may contain null values instead of searching in all columns.

val finalDF=tempDF.na.drop(Seq("name","date"));

In this case, if name and date column have null values then only entire row will be removed from DataFrame.

fill(): Returns a new DataFrame that replaces null or NaN values in specified columns.

For example -

val data=Seq(
Row ("rahul sharma",32,null),
Row("Joy don",null,"NY"),
Row("Steve",42,"Delhi")
)val rdd = sparkSession.sparkContext.parallelize(data)val schema=StructType(
Seq(
StructField("name", StringType,true),
StructField("age", IntegerType,true),
StructField("city", StringType,true)
)
)val tempDF=sparkSession.createDataFrame(rdd,schema)
val finalDF=tempDF.na.fill(25);
finalDF.show()

output-

Earlier age was null for Joy. now, i shows 25 which we have filled. but, city column still contains null.

Why ?

Because 25 is being treated as an integer value and only age is of integer type . If you write tempDF.na.fill(“25”). It will replace null value in city column.

coalesce (): coalesce function in spark returns first non-null values from set of columns

For example-

val sparkSession = createSparkSession();
val data= Seq(
Row (null,32,null),
Row("Joy don",null,"NY"),
Row("Steve",42,"Delhi")
)
val rdd = sparkSession.sparkContext.parallelize(data)val schema=StructType(
Seq(
StructField("name", StringType,true),
StructField("age", IntegerType,true),
StructField("city", StringType,true)
)
)val tempDF=sparkSession.createDataFrame(rdd,schema)
val finalDF= tempDF.select(
coalesce(col("name"),lit("unknown")).as("name"),
col("age"),
col("city")
) finalDF.show()

Thanks for reading!

Please do share the article, if you liked it. Any comments or suggestions are welcome.

References

--

--