Program to read a CSV file with multiple character as delimiter | Spark | Scala | Pyspark
In this auricle , we will learn to handle multiple delimiters in csv file using spark Scala.
Step 1: Read a text file and convert into an RDD
val df:RDD[String]=sparkSession.read.textFile("temp.csv").rdd
Step 2: Loop through the records & split using delimiter
val df1:RDD[Array[String]]=df.map(row => row.split(",,",111))
Step 3: Convert records into Row type
val df2:RDD[Row]=df1.filter( row=> ! row(0).equals(first) ).map(row =>
Row(row(0), row(1), row(2))
)
Final Step : Create Dataframe using the above rdd
val dfWithSchema:DataFrame = sparkSession.createDataFrame(df2, schema)
Comple code:
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Dataset, DatasetHolder, Row, SparkSession}
object Main {
def main( args: Array[String]){
val sparkSession:SparkSession=SparkSession.builder().appName("temp").master("local")
.getOrCreate();
val df:RDD[String]=sparkSession.read.textFile("temp.csv").rdd
val schema = new StructType()
.add(StructField("Name", StringType, false))
.add(StructField("Job", StringType, true))
.add(StructField("Age", StringType, true))
val df1:RDD[Array[String]]=df.map(row => row.split(",,",111))
val first:String=df1.first()(0)
val df2:RDD[Row]=df1.filter( row=> ! row(0).equals(first) ).map(row =>
Row(row(0), row(1), row(2))
)
val dfWithSchema:DataFrame = sparkSession.createDataFrame(df2, schema)
dfWithSchema.show()
}
}
Thanks for reading. Please follow me for more articles like this.