Program to read a CSV file with multiple character as delimiter | Spark | Scala | Pyspark

CSV File with multiple character as delimiter

Parmanand
1 min readFeb 23, 2023

In this auricle , we will learn to handle multiple delimiters in csv file using spark Scala.

Step 1: Read a text file and convert into an RDD

val df:RDD[String]=sparkSession.read.textFile("temp.csv").rdd

Step 2: Loop through the records & split using delimiter

val df1:RDD[Array[String]]=df.map(row => row.split(",,",111))

Step 3: Convert records into Row type

val df2:RDD[Row]=df1.filter( row=> ! row(0).equals(first) ).map(row =>
Row(row(0), row(1), row(2))
)

Final Step : Create Dataframe using the above rdd

val dfWithSchema:DataFrame = sparkSession.createDataFrame(df2, schema)

Comple code:


import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Dataset, DatasetHolder, Row, SparkSession}

object Main {

def main( args: Array[String]){

val sparkSession:SparkSession=SparkSession.builder().appName("temp").master("local")
.getOrCreate();
val df:RDD[String]=sparkSession.read.textFile("temp.csv").rdd

val schema = new StructType()
.add(StructField("Name", StringType, false))
.add(StructField("Job", StringType, true))
.add(StructField("Age", StringType, true))

val df1:RDD[Array[String]]=df.map(row => row.split(",,",111))
val first:String=df1.first()(0)
val df2:RDD[Row]=df1.filter( row=> ! row(0).equals(first) ).map(row =>
Row(row(0), row(1), row(2))
)
val dfWithSchema:DataFrame = sparkSession.createDataFrame(df2, schema)
dfWithSchema.show()

}

}

Thanks for reading. Please follow me for more articles like this.

--

--