16

Let's say I've got the following files in a Google Cloud Storage bucket:

file_A1.csv
file_B2.csv
file_C3.csv

Now I want to move a subset of these files, lets say file_A1.csv and file_B2.csv. Currently I do this like that:

gsutil mv gs://bucket/file_A1.csv gs://bucket/file_A11.csv
gsutil mv gs://bucket/file_B2.csv gs://bucket/file_B22.csv

This approach requires two call of more or less the same command and moves each file separately. I know, that if I move a complete directory I can add the -m option in order to accelerate this process. However, unfortunately I just want to move a subset of all files and keep the rest untouched in the bucket.

When moving 100 files this way I need to execute 100 commands or so and this becomes quite time consuming. I there a way to combine each of the 100 files into just one command with addtionally the -m option?

1
  • Do you have a rule for what the destination's name is? Is that also in a file, or is it "repeat the last letter of the existing file", or something more elaborate? Commented Apr 29, 2015 at 17:59

7 Answers 7

8

If you have a list of the files you want to move you can use the -I option from the cp command which, according to the docs, is also valid for the mv command:

cat filelist | gsutil -m mv -I gs://my-bucket
1
  • that's what I came here for!
    – masterxilo
    Commented Jan 4, 2022 at 19:17
7

That worked for me for moving all txt files from gs://config to gs://config/new_folder

gsutil mv 'gs://config/*.txt' gs://config/new_folder/

I had some problems using the wildcard * in zsh, so that is the reason for the quotes around the origin path

1
  • 1
    works perfectly in the cloud console shell as well +1
    – 1252748
    Commented Jul 15, 2023 at 1:42
4

gsutil does not support this currently but what you could do is create a number of shell scripts, each performing a portion of the moves, and run them concurrently.

Note that gsutil mv is based on the syntax of the unix mv command, which also doesn't support the feature you're asking for.

3
  • Yeah, I already thought about that. However, is there a limit of concurrent commands that are allowed to be executed simultaneously?
    – toom
    Commented Apr 29, 2015 at 15:56
  • Only normal operating system limitations would apply; the tool itself can be executed any number of times concurrently. Commented Apr 29, 2015 at 19:22
  • Okay, I wrote a small script that move 100 files in parallel. The result was that just 25 files were moved and the whole process took 10 minutes. Defintely not a solution.
    – toom
    Commented Apr 30, 2015 at 14:55
4

you can achieve that using bash by iterating over the gsutil ls output for example:

  • source folder name: old_folder
  • new folder name: new_folder
for x in `gsutil ls "gs://<bucket_name>/old_folder"`; do y=$(basename -- "$x");gsutil mv ${x} gs://<bucket_name>/new_folder/${y}; done

you can run in parallel if you have a huge number of files using:

N=8 # number of parallel workers
(
for x in `gsutil ls "gs://<bucket_name>/old_folder"`; do 
   ((i=i%N)); ((i++==0)) && wait
   y=$(basename -- "$x");gsutil mv ${x} gs://<bucket_name>/new_folder/${y} & 
done
)
3

Not documented widely but this works all the time

To move the contents of the third folder to the root or any folder before it

gsutil ls gs://my-bucket/first/second/third/ | gsutil -m mv -I gs://my-bucket/first/

and to copy

gsutil ls gs://my-bucket/first/second/third/ | gsutil -m cp -I gs://my-bucket/first/
1

To do this you can run the follow gsutil command:

gsutil mv gs://bucket_name/common_file_name*  gs://bucket_destiny_name/common_file_name*    

In your case; common_file_name is "file_"

0

The lack of -m flag is the real hang up here. Facing the same issue I originally managed this by using python multiprocessing and os.system to call gsutil. I had 60k files and it was going to take hours. With some experimenting I found using the python client gave a 20x speed-up!

If you are willing to move away from gsutil - its a better approach.

Here is a copy(or move) method. If you create a list of src keys/uri's you can call this using multi-threading for fast results.

Note: the method a tuple of (destination-name,exception) which you can pop into a dataframe or something to look for failures

def cp_blob(key=None,bucket=BUCKET_NAME,uri=None,delete_src=False):
    try:
        if uri:
            uri=re.sub('gs://','',uri)
            bucket,key=uri.split('/',maxsplit=1)
        client=storage.Client()
        bucket=client.get_bucket(bucket)
        blob=bucket.blob(key)
        dest=re.sub(THING1,THING2,blob.name)  ## OR SOME OTHER WAY TO GET NEW DESTINATIONS
        out=bucket.copy_blob(blob,bucket,dest)
        if delete_src:
            blob.delete()
        return out.name, None
    except Exception as e:
        return None, str(e)

Not the answer you're looking for? Browse other questions tagged or ask your own question.