7

I am trying to upload many thousands of files to Google Cloud Storage, with the following command:

gsutil -m cp *.json gs://mybucket/mydir

But I get this error:

-bash: Argument list too long

What is the best way to handle this? I can obviously write a bash script to iterate over different numbers:

gsutil -m cp 92*.json gs://mybucket/mydir
gsutil -m cp 93*.json gs://mybucket/mydir
gsutil -m cp ...*.json gs://mybucket/mydir

But the problem is that I don't know in advance what my filenames are going to be, so writing that command isn't trivial.

Is there either a way to handle this with gsutil natively (I don't think so, from the documentation), or a way to handle this in bash where I can list say 10,000 files at a time, then pipe them to the gsutil command?

3 Answers 3

25

Eric's answer should work, but another option would be to rely on gsutil's built-in wildcarding, by quoting the wildcard expression:

gsutil -m cp "*.json" gs://mybucket/mydir

To explain more: The "Argument list too long" error is coming from the shell, which has a limited size buffer for expanded wildcards. By quoting the wildcard you prevent the shell from expanding the wildcard and instead the shell passes that literal string to gsutil. gsutil then expands the wildcard in a streaming fashion, i.e., expanding it while performing the operations, so it never needs to buffer an unbounded amount of expanded text. As a result you can use gsutil wildcards over arbitrarily large expressions. The same is true when using gsutil wildcards over object names, so for example this would work:

gsutil -m cp "gs://my-bucket1/*" gs://my-bucket2

even if there are a billion objects at the top-level of gs://my-bucket1.

3
  • As good practice, you should still quote gs://my-bucket1/*. The shell will still treat that string as a pattern to match, and although it will almost certainly fail to match anything, it is possible to set a shell option to treat non-matching patterns as an error rather than as a literal string.
    – chepner
    Commented Jun 27, 2017 at 13:29
  • Thanks chepner - I added quotes to my answer per your suggestion. Commented Jun 27, 2017 at 15:05
  • Thanks, saved me some time! Commented Oct 24, 2019 at 14:17
3

If your filenames are safe from newlines you could use gsutil cp's ability to read from stdin like

find . -maxdepth 1 -type f -name '*.json' | gsutil -m cp -I gs://mybucket/mydir

or if you're not sure if your names are safe and your find and xargs support it you could do

find . -maxdepth 1 -type f -name '*.json' -print0 | xargs -0 -I {} gsutil -m cp {} gs://mybucket/mydir
5
  • 1
    The last example would be simpler as find ... -exec gsutil -m cp {} gs://mybucket/mydir \;. (In either case, I think -m is unnecessary, since you are only passing a single file/URL to each instance of gsutil.)
    – chepner
    Commented Jun 27, 2017 at 13:31
  • @chepner does xargs then spawn a new instance of gsutil for each file if the argument isn't at the end as with -exec using \; instead of +? Your version of the find command would be portable at least though Commented Jun 27, 2017 at 13:33
  • @chepner yup, reading the man page does indeed confirm as you said, -I implies -L 1 Commented Jun 27, 2017 at 13:34
  • 1
    I think there are ways of using xargs to similarly batch like -exec ... +, but I think the issue here is that gsutil either takes a pattern or a single file. I didn't see a way to write something like -exec cp -T src_dir {} + like you could with GNU cp.
    – chepner
    Commented Jun 27, 2017 at 14:30
  • @chepner I worked out a way to do it, shame I didn't refresh the page before posting!
    – Tom Fenech
    Commented Jun 27, 2017 at 17:45
1

Here's a way you could do it, using xargs to limit the number of files that are passed to gsutil at once. Null bytes are used to prevent problems with spaces in or newlines in the filenames.

printf '%s\0' *.json | xargs -0 sh -c 'copy_all () { 
    gsutil -m cp "$@" gs://mybucket/mydir
}
copy_all "$@"'

Here we define a function which is used to put the file arguments in the right place in the gsutil command. This whole process should happen the minimum number of times required to process all arguments, passing the maximum number of filename arguments possible each time.

Alternatively you can define the function separately and then export it (this is bash-specific):

copy_all () { 
    gsutil -m cp "$@" gs://mybucket/mydir
}
printf '%s\0' *.json | xargs -0 bash -c 'export -f copy_all; copy_all "$@"'

Not the answer you're looking for? Browse other questions tagged or ask your own question.