How to read a part-00000.deflate file on linux

Question

I have written a spark word count program using below code:

package com.practice

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

object WordCount {
  val sparkConf = new SparkConf()
  def main(args: Array[String]): Unit = {
    val spark   = SparkSession.builder().config(sparkConf).master("local[2]").getOrCreate()
    val input   = args(0)
    val output  = args(1)
    val text    = spark.sparkContext.textFile(input)
    val outPath = text.flatMap(line => line.split(" "))
    val words   = outPath.map(w => (w,1))
    val wc      = words.reduceByKey((x,y)=>(x+y))

    wc.saveAsTextFile(output)
  }
}

Using spark submit, I run the jar and got the output in the output dir:

SPARK_MAJOR_VERSION=2 spark-submit --master local[2] --class com.practice.WordCount sparkwordcount_2.11-0.1.jar file:///home/hmusr/ReconTest/inputdir/sample file:///home/hmusr/ReconTest/inputdir/output

Both the input and output files are on local and not on HFDS. In the output dir, I see two files: part-00000.deflate _SUCCESS. The output file is present with .deflate extension. I understood that the output was saved in a compressed file after checking internet but is there any way I can read the file ?

Isn't deflate a hadoop compatible format? You should be able to read it with hdfs dfs -text /path/to/part-00000 — philantrovert, Commented Jul 21, 2018 at 15:52

Yucci Mel · Accepted Answer · 2019-01-02 18:58:14Z

3

Try this one.

cat part-00000.deflate | perl -MCompress::Zlib -e 'undef $/; print uncompress(<>)'

answered Jan 2, 2019 at 18:58

Yucci Mel

1891 silver badge5 bronze badges

1

While this code may answer the question, it is better to include some context, explaining how it works and when to use it. Code-only answers are not useful in the long run. See stackoverflow.com/help/how-to-answer for some more info.
– Mark Ormesher
Commented Jan 2, 2019 at 19:16
This answer worked perfectly and saved me a lot of time! Thanks!
– user1675386
Commented Jan 21, 2021 at 6:11

Add a comment |

Collectives™ on Stack Overflow

How to read a part-00000.deflate file on linux

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
apache-spark
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged apache-spark or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
apache-spark
or ask your own question.