I found this question looking for a work-around with a bug with the -text
utility in the new version of the hadoop dfs
client I just installed. The -text
utility works like cat
, except if the file being read is compressed, it transparently decompresses and outputs the plain-text (hence the name).
The answers already posted were definitely helpful, but some of them have one problem when dealing with Hadoop-sized amounts of data - they read everything into memory before decompressing.
So, here are my variations on the Perl
and Python
answers above that do not have that limitation:
Python:
hadoop fs -cat /path/to/example.deflate |
python -c 'import zlib,sys;map(lambda b:sys.stdout.write(zlib.decompress(b)),iter(lambda:sys.stdin.read(4096),""))'
hadoop fs -cat /path/to/example.deflate |
python -c 'import zlib,sys;map(lambda b:sys.stdout.write(zlib.decompress(b)),iter(lambda:sys.stdin.read(4096),""))'
Perl:
hadoop fs -cat /path/to/example.deflate |
perl -MCompress::Zlib -e 'print uncompress($buf) while sysread(STDIN,$buf,4096)'
hadoop fs -cat /path/to/example.deflate |
perl -MCompress::Zlib -e 'print uncompress($buf) while sysread(STDIN,$buf,4096)'
Note the use of the -cat
sub-command, instead of -text
. This is so that my work-around does not break after they've fixed the bug. Apologies for the readability of the python version.