I have to read all the special characters in some dat file (e.g.- testdata.dat) without being corrupted and initialise it into a dataframe in scala using spark.
I have one dat file (eg - testdata.dat), where I have some special characters in its contents such as en-dash, em-dash, pound symbol. When I am trying to read it in scala using 'spark.read.format' and then trying to show the dataframe, special characters are being replaced by ?. I have tried several approaches like .option("encoding", "UTF-8) or .option("charset", "ISO-8859-1"), But none of them helped. I have also tried regexp_replace() method with pattern "[\u0000-\u007F]+", but it will replace all special characters with one replacement that we will be defining in this regex method. So this is also not helping. I have seen most the suggestions are like above only. Is there any other possible approach, please help!
Sample Code:
val test = sparkSession.read.format("dat")
.option("header", "false")
.option("delimiter", "|")
//.option("charset", "ISO-8859-1")
//.option("encoding","UTF-8")
.csv(csvPath)
.filter(col("_c0").isNotNull)
//.withColumn("_c1",regexp_replace(col("_c1"),"[^\\u0000-\\u007F]+","-"))
.select(col("_c0").alias("id"),
col("_c1").alias("test_string")
).dropDuplicates().cache
Sample Output (IntelliJ Output Console):
+---+-----------------------+
|id |test_string |
+---+-----------------------+
|2 |Test Part2 � Test Part2|
|1 |Test Part1 � Test Part1|
+---+-----------------------+
?
symbols? Is this in some UI? In some files? After which code? Basically you need to give us a minimal example of input / output and the code you're using for us to help you.