Hadoop: compress file in HDFS?

Question

I recently set up LZO compression in Hadoop. What is the easiest way to compress a file in HDFS? I want to compress a file and then delete the original. Should I create a MR job with an IdentityMapper and an IdentityReducer that uses LZO compression?

Afshin Moazami · Accepted Answer · 2015-10-16 19:23:50Z

21

For me, it's lower overhead to write a Hadoop Streaming job to compress files.

This is the command I run:

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \
  -Dmapred.output.compress=true \
  -Dmapred.compress.map.output=true \
  -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
  -Dmapred.reduce.tasks=0 \
  -input <input-path> \
  -output $OUTPUT \
  -mapper "cut -f 2"

I'll also typically stash the output in a temp folder in case something goes wrong:

OUTPUT=/tmp/hdfs-gzip-`basename $1`-$RANDOM

One additional note, I do not specify a reducer in the streaming job, but you certainly can. It will force all the lines to be sorted which can take a long time with a large file. There might be a way to get around this by overriding the partitioner but I didn't bother figuring that out. The unfortunate part of this is that you potentially end up with many small files that do not utilize HDFS blocks efficiently. That's one reason to look into Hadoop Archives

edited Oct 16, 2015 at 19:23

Afshin Moazami

2,0985 gold badges33 silver badges56 bronze badges

answered Mar 5, 2012 at 19:31

Jeff Wu

2,5381 gold badge21 silver badges25 bronze badges

why "cut -f 2" instead of, say, "cat" ?
– dranxo
Commented Aug 2, 2013 at 23:17
2

The input to the mapper is a key and a value separated by a tab. The key is the byte offset of the line in the file and the value is the text of the line. cut -f 2 outputs only the value.
– Jeff Wu
Commented Aug 3, 2013 at 16:07
How can i compress the folder in hdfs?
– subhashlg26
Commented Jan 9, 2014 at 8:20
1

The answer below actually uses the cat command, which is the correct answer.
– rjurney
Commented Nov 14, 2014 at 2:03
Above command gives extra "tab" character at the end of each line of the compressed output.
– PradeepKumbhar
Commented Apr 28, 2017 at 9:03

Add a comment |

Donald Miner · Accepted Answer · 2011-08-25 20:48:40Z

7

I suggest you write a MapReduce job that, as you say, just uses the Identity mapper. While you are at it, you should consider writing the data out to sequence files to improve performance loading. You can also store sequence files in block-level and record-level compression. Yo should see what works best for you, as both are optimized for different types of records.

answered Aug 25, 2011 at 20:48

Donald Miner

39.6k9 gold badges96 silver badges118 bronze badges

Add a comment |

Chitra · Accepted Answer · 2012-11-21 22:42:01Z

5

The streaming command from Jeff Wu along with a concatenation of the compressed files will give a single compressed file. When a non java mapper is passed to the streaming job and the input format is text streaming outputs just the value and not the key.

hadoop jar contrib/streaming/hadoop-streaming-1.0.3.jar \
            -Dmapred.reduce.tasks=0 \
            -Dmapred.output.compress=true \
            -Dmapred.compress.map.output=true \
            -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
            -input filename \
            -output /filename \
            -mapper /bin/cat \
            -inputformat org.apache.hadoop.mapred.TextInputFormat \
            -outputformat org.apache.hadoop.mapred.TextOutputFormat
hadoop fs -cat /path/part* | hadoop fs -put - /path/compressed.gz

answered Nov 21, 2012 at 22:42

Chitra

1984 silver badges13 bronze badges

Just want to make sure I understand the commands. The first one produces the output in gzipped file but the actual file isn't in the *.gz format so the second command is to rename it?
– nevets1219
Commented Dec 11, 2014 at 18:26
No, the first command generates the compressed *.gz part files (many of them). And the second command is for concatenating those part files together into a single 'compressed.gz' file.
– PradeepKumbhar
Commented Apr 28, 2017 at 8:46
Above command gives extra tab character at the end of each line of the compressed output
– PradeepKumbhar
Commented Apr 28, 2017 at 9:03

Add a comment |

dranxo · Accepted Answer · 2013-08-03 18:18:56Z

4

This is what I've used:

/*
 * Pig script to compress a directory
 * input:   hdfs input directory to compress
 *          hdfs output directory
 * 
 * 
 */

set output.compression.enabled true;
set output.compression.codec org.apache.hadoop.io.compress.BZip2Codec;

--comma seperated list of hdfs directories to compress
input0 = LOAD '$IN_DIR' USING PigStorage();

--single output directory
STORE input0 INTO '$OUT_DIR' USING PigStorage();

Though it's not LZO so it may be a bit slower.

edited Aug 3, 2013 at 18:18

answered Aug 3, 2013 at 0:44

dranxo

3,3984 gold badges36 silver badges49 bronze badges

Does this compress each individual file in the input directory, or does the compression treat all the files as one big file and compress that, then output potentially many fewer files? If the latter case, is there a way to specify how much data pig should try to compress at a time, e.g. 3Gb at a time?
– AatG
Commented Aug 21, 2014 at 21:15
Yes, it will load an entire input directory into a single alias and output as ${OUT_DIR}/part-m-*.bz2. If you want a 3Gb input directory then control IN_DIR
– dranxo
Commented Aug 23, 2014 at 0:53

Add a comment |

PradeepKumbhar · Accepted Answer · 2017-05-12 11:37:56Z

@Chitra I cannot comment due to reputation issue

Here is everything in one command: Instead of using the second command, you can reduce into one compressed file directly

hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \
        -Dmapred.reduce.tasks=1 \
        -Dmapred.output.compress=true \
        -Dmapred.compress.map.output=true \
        -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
        -input /input/raw_file \
        -output /archives/ \
        -mapper /bin/cat \
        -reducer /bin/cat \
        -inputformat org.apache.hadoop.mapred.TextInputFormat \
        -outputformat org.apache.hadoop.mapred.TextOutputFormat

Thus, you gain a lot of space by having only one compress file

For example, let's say i have 4 files of 10MB (it's plain text, JSON formatted)

The map only is giving me 4 files of 650 KB If I map and reduce I have 1 file of 1.05 MB

Naga · Accepted Answer · 2018-09-06 13:59:12Z

I know this is old thread, but if anyone following this thread (like me) it would be useful to know that any of following 2 methods gives you a tab (\t) character at the end of each line

 hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \
      -Dmapred.output.compress=true \
      -Dmapred.compress.map.output=true \
      -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
      -Dmapred.reduce.tasks=0 \
      -input <input-path> \
      -output $OUTPUT \
      -mapper "cut -f 2"


hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar \
        -Dmapred.reduce.tasks=1 \
        -Dmapred.output.compress=true \
        -Dmapred.compress.map.output=true \
        -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
        -input /input/raw_file \
        -output /archives/ \
        -mapper /bin/cat \
        -reducer /bin/cat \
        -inputformat org.apache.hadoop.mapred.TextInputFormat \
        -outputformat org.apache.hadoop.mapred.TextOutputFormat

From this hadoop-streaming.jar adds x'09' at the end of each line, I found the fix and we need to set following 2 parameters to respecitve delimiter you use (in my case it was ,)

 -Dstream.map.output.field.separator=, \
 -Dmapred.textoutputformat.separator=, \

full command to execute

hadoop jar <HADOOP_HOME>/jars/hadoop-streaming-2.6.0-cdh5.4.11.jar \
        -Dmapred.reduce.tasks=1 \
        -Dmapred.output.compress=true \
        -Dmapred.compress.map.output=true \
 -Dstream.map.output.field.separator=, \
 -Dmapred.textoutputformat.separator=, \
        -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec \
        -input file:////home/admin.kopparapu/accenture/File1_PII_Phone_part3.csv \
        -output file:///home/admin.kopparapu/accenture/part3 \
 -mapper /bin/cat \
        -reducer /bin/cat \
        -inputformat org.apache.hadoop.mapred.TextInputFormat \
        -outputformat org.apache.hadoop.mapred.TextOutputFormat

Drizzt321 · Accepted Answer · 2011-08-22 22:06:47Z

-4

Well, if you compress a single file, you may save some space, but you can't really use Hadoop's power to process that file since the decompression has to be done by a single Map task sequentially. If you have lots of files, there's Hadoop Archive, but I'm not sure it includes any kind of compression. The main use case for compression I can think of is compressing the output of Maps to be sent to Reduces (save on network I/O).

Oh, to answer your question more complete, you'd probably need to implement your own RecordReader and/or InputFormat to make sure the entire file got read by a single Map task, and also it used the correct decompression filter.

answered Aug 22, 2011 at 22:06

Drizzt321

99113 silver badges28 bronze badges

Hadoop has integrated compression libraries, see cloudera.com/blog/2009/06/….
– schmmd
Commented Aug 22, 2011 at 22:33
Interesting. I thought you were talking about input being compressed, not compressing the output, sorry. Do you care about the sorting of the data in the output file? You could easily just use the filesystem APIs and wrap the FSDataOutputStream in the LZO compression filter if you don't care about the sorting of the output file. If you do, then FileOutputFormat.setCompressOutput() and setOutputCompressorClass(). It's right in the Javadoc, found it in 10 seconds via Google.
– Drizzt321
Commented Aug 23, 2011 at 16:28

Add a comment |

Collectives™ on Stack Overflow

Hadoop: compress file in HDFS?

7 Answers 7

Not the answer you're looking for? Browse other questions tagged
compression
hadoop
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Not the answer you're looking for? Browse other questions tagged compressionhadoop or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
compression
hadoop
or ask your own question.