dataframe - Spark : Read special characters from the content of dat file without corrupting it in scala

I have to read all the special characters in some dat file (e.g.- testdata.dat) without being corrupted and initialise it into a dataframe in scala using spark.

I have one dat file (eg - testdata.dat), where I have some special characters in its contents such as en-dash, em-dash, pound symbol. When I am trying to read it in scala using 'spark.read.format' and then trying to show the dataframe, special characters are being replaced by ?. I have tried several approaches like .option("encoding", "UTF-8) or .option("charset", "ISO-8859-1"), But none of them helped. I have also tried regexp_replace() method with pattern "[\u0000-\u007F]+", but it will replace all special characters with one replacement that we will be defining in this regex method. So this is also not helping. I have seen most the suggestions are like above only. Is there any other possible approach, please help!

Sample Code:

    val test = sparkSession.read.format("dat")
      .option("header", "false")
      .option("delimiter", "|")
      //.option("charset", "ISO-8859-1")
      //.option("encoding","UTF-8")
      .csv(csvPath)
      .filter(col("_c0").isNotNull)
      //.withColumn("_c1",regexp_replace(col("_c1"),"[^\\u0000-\\u007F]+","-"))
      .select(col("_c0").alias("id"),
        col("_c1").alias("test_string")
      ).dropDuplicates().cache

Sample Output (IntelliJ Output Console):

+---+-----------------------+
|id |test_string            |
+---+-----------------------+
|2  |Test Part2 � Test Part2|
|1  |Test Part1 � Test Part1|
+---+-----------------------+

Contents of testdata.dat

edited Jul 3 at 13:48

asked Jul 3 at 11:36

Prantik Banerjee

12 bronze badges

1

What's the encoding of the file? Where do you see the ? symbols? Is this in some UI? In some files? After which code? Basically you need to give us a minimal example of input / output and the code you're using for us to help you.
– Gaël J
Commented Jul 3 at 12:37
Hi @GaëlJ, I have updated with sample code and snippets. Let me know if you need any more info. I have manually encoded the file to UTF-8 (visible in dat file snippet now). Hoping for a solution!!!
– Prantik Banerjee
Commented Jul 3 at 13:33

Add a comment |

Collectives™ on Stack Overflow

Spark : Read special characters from the content of dat file without corrupting it in scala

0

Browse other questions tagged
dataframe
scala
apache-spark
file
special-characters
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Browse other questions tagged dataframescalaapache-sparkfilespecial-characters or ask your own question.

Browse other questions tagged
dataframe
scala
apache-spark
file
special-characters
or ask your own question.