Skip to main content
Add Shell syntax highlighting
Source Link
Benjamin Loison
  • 5.5k
  • 4
  • 18
  • 37

I found this question looking for a work-around with a bug with the -text utility in the new version of the hadoop dfs client I just installed. The -text utility works like cat, except if the file being read is compressed, it transparently decompresses and outputs the plain-text (hence the name).

The answers already posted were definitely helpful, but some of them have one problem when dealing with Hadoop-sized amounts of data - they read everything into memory before decompressing.

So, here are my variations on the Perl and Python answers above that do not have that limitation:

Python:

hadoop fs -cat /path/to/example.deflate |
  python -c 'import zlib,sys;map(lambda b:sys.stdout.write(zlib.decompress(b)),iter(lambda:sys.stdin.read(4096),""))'
hadoop fs -cat /path/to/example.deflate |
  python -c 'import zlib,sys;map(lambda b:sys.stdout.write(zlib.decompress(b)),iter(lambda:sys.stdin.read(4096),""))'

Perl:

hadoop fs -cat /path/to/example.deflate |
  perl -MCompress::Zlib -e 'print uncompress($buf) while sysread(STDIN,$buf,4096)'
hadoop fs -cat /path/to/example.deflate |
  perl -MCompress::Zlib -e 'print uncompress($buf) while sysread(STDIN,$buf,4096)'

Note the use of the -cat sub-command, instead of -text. This is so that my work-around does not break after they've fixed the bug. Apologies for the readability of the python version.

I found this question looking for a work-around with a bug with the -text utility in the new version of the hadoop dfs client I just installed. The -text utility works like cat, except if the file being read is compressed, it transparently decompresses and outputs the plain-text (hence the name).

The answers already posted were definitely helpful, but some of them have one problem when dealing with Hadoop-sized amounts of data - they read everything into memory before decompressing.

So, here are my variations on the Perl and Python answers above that do not have that limitation:

Python:

hadoop fs -cat /path/to/example.deflate |
  python -c 'import zlib,sys;map(lambda b:sys.stdout.write(zlib.decompress(b)),iter(lambda:sys.stdin.read(4096),""))'

Perl:

hadoop fs -cat /path/to/example.deflate |
  perl -MCompress::Zlib -e 'print uncompress($buf) while sysread(STDIN,$buf,4096)'

Note the use of the -cat sub-command, instead of -text. This is so that my work-around does not break after they've fixed the bug. Apologies for the readability of the python version.

I found this question looking for a work-around with a bug with the -text utility in the new version of the hadoop dfs client I just installed. The -text utility works like cat, except if the file being read is compressed, it transparently decompresses and outputs the plain-text (hence the name).

The answers already posted were definitely helpful, but some of them have one problem when dealing with Hadoop-sized amounts of data - they read everything into memory before decompressing.

So, here are my variations on the Perl and Python answers above that do not have that limitation:

Python:

hadoop fs -cat /path/to/example.deflate |
  python -c 'import zlib,sys;map(lambda b:sys.stdout.write(zlib.decompress(b)),iter(lambda:sys.stdin.read(4096),""))'

Perl:

hadoop fs -cat /path/to/example.deflate |
  perl -MCompress::Zlib -e 'print uncompress($buf) while sysread(STDIN,$buf,4096)'

Note the use of the -cat sub-command, instead of -text. This is so that my work-around does not break after they've fixed the bug. Apologies for the readability of the python version.

Source Link
Hercynium
  • 949
  • 9
  • 18

I found this question looking for a work-around with a bug with the -text utility in the new version of the hadoop dfs client I just installed. The -text utility works like cat, except if the file being read is compressed, it transparently decompresses and outputs the plain-text (hence the name).

The answers already posted were definitely helpful, but some of them have one problem when dealing with Hadoop-sized amounts of data - they read everything into memory before decompressing.

So, here are my variations on the Perl and Python answers above that do not have that limitation:

Python:

hadoop fs -cat /path/to/example.deflate |
  python -c 'import zlib,sys;map(lambda b:sys.stdout.write(zlib.decompress(b)),iter(lambda:sys.stdin.read(4096),""))'

Perl:

hadoop fs -cat /path/to/example.deflate |
  perl -MCompress::Zlib -e 'print uncompress($buf) while sysread(STDIN,$buf,4096)'

Note the use of the -cat sub-command, instead of -text. This is so that my work-around does not break after they've fixed the bug. Apologies for the readability of the python version.