8

I find if you create a compute engine (CentOS or Debian) machine and using gsutil to download (cp) a tgz file will cause a crcmod error...

$ gsutil cp gs://mybucket/data.tgz .
Copying gs://mybucket/data.tgz...
CommandException:
Downloading this composite object requires integrity checking with CRC32c, but
your crcmod installation isn't using the module's C extension, so the the hash
computation will likely throttle download performance. For help installing the
extension, please see:
  $ gsutil help crcmod
To download regardless of crcmod performance or to skip slow integrity checks,
see the "check_hashes" option in your boto config file.

Currently I use "check_hashes = never" to bypass the check...

$ vi /etc/boto.cfg
[GSUtil]
default_project_id = 429100748693
default_api_version = 2
check_hashes = never
...

But, what is the root cause? and is there any good solution to solve the problem?

1 Answer 1

11

The object you're trying to download is a composite object, which basically means it was uploaded in parallel chunks. gsutil automatically does this when uploading objects larger than 150M (a configurable threshold), to provide better performance.

Composite objects only have a crc32c checksum (no MD5), so in order to validate data integrity when downloading composite objects, gsutil needs to perform a crc32c checksum. Unfortunately, the libraries distributed with Python don't include a compiled crc32c implementation, so unless you install a compiled crc32c, gsutil will use a non-compiled Python implementation of crc32c that's quite slow. That warning is printed to let you know there's a way to fix that performance problem: Please run:

gsutil help crcmod

and follow the instructions there for installing a compiled crc32c. It's pretty easy to do it, and worth the effort.

One other note: I strongly recommend against setting check_hashes = never in your boto config file. That will disable integrity checking, which means it's possible your download could get corrupted and you wouldn't know it. You want data integrity checking enabled to ensure you're working with correct data.

6
  • 1
    Is there any way to pass check_hashes as a parameter of gsutil for a single command execution? Commented May 24, 2017 at 9:22
  • 6
    Robert - you can pass config file params to gsutil using the gsutil -o option, for example gsutil -o GSUtil:check_hashes=if_fast_else_fail cp file gs://my-bucket Commented Jun 27, 2017 at 15:09
  • @MikeSchwartz I'm working on a HPC with a conda environment and installed crcmod using conda install crcmod. It still shows compiled crcmod: False under gsutil version -l and my downloads are getting stopped because of this. I don't have admin previlgeses to compile it from source. Any suggestion please?
    – Enigma
    Commented Apr 9, 2021 at 16:09
  • @Enigma - I haven't worked with Conda but from a quick web search it looks like it's a package & environment manager. So, I suspect the problem is it's installing crcmod and setting up environment variables and only software run within that environment will use those packages. And I suspect you're not running gsutil from within that environment. Commented Apr 10, 2021 at 22:16
  • @MikeSchwartz I installed gsutil using conda as well. As of now, I've skipped the ones which are failing to download because of this.
    – Enigma
    Commented Apr 12, 2021 at 12:33

Not the answer you're looking for? Browse other questions tagged or ask your own question.