Creating and arranging files into folders based on date and time in file name

Question

I have many files in a folder Main which are named like these:

2021_10_15_23_35_SIP_CDR_pid3894_ins2_thread_1_4718.csv.gz
2021_11_24_21_15_Gi_pid25961_ins2_thread_1_6438.csv.gz  2021_11_25_20_55_Gi_pid29741_ins5_thread_4_7540.csv.gz
2021_11_24_21_15_Gi_pid27095_ins2_thread_1_6485.csv.gz  2021_11_25_20_55_Gi_pid30842_ins3_thread_2_7489.csv.gz
2021_11_24_21_15_Gi_pid27095_ins3_thread_2_6485.csv.gz  2021_11_25_20_55_Gi_pid30842_ins4_thread_3_7488.csv.gz
2021_11_24_21_15_Gi_pid27095_ins4_thread_3_6485.csv.gz  2021_11_25_20_55_Gi_pid30842_ins5_thread_4_7489.csv.gz
2021_11_24_21_15_Gi_pid681_ins5_thread_4_6457.csv.gz

The first 10 characters shows the date, followed by the digits which is the time in 24 hour format. The rest is the file details which we can ignore.

I want to create folders within the Main folder based on the date in the filename and then another folder inside the date folder based on the hour in file name. Eventually I want to move the files from the Main folder into the respective hour folder.

Main -> Date -> hh -> file.csv.gz

For eg: The file 2021_11_24_21_15_Gi_pid27095_ins3_thread_2_6485.csv.gz in the Main folder will eventually end up in a folder like this with the below path Main/2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins3_thread_2_6485.csv.gz

Can you please help with the bash script to achieve the grouping of files in folders like mentioned above?

cas · Accepted Answer · 2022-04-11 10:06:22Z

Using the perl rename utility:

Note: perl rename is also known as file-rename, perl-rename, or prename. Not to be confused with the rename utility from util-linux which has completely different and incompatible capabilities and command-line options. perl rename is the default rename on Debian...IIRC, it's in the prename package on Centos and the command should be executed as prename rather than rename.

$ rename -n 'if (m/(^\d{4}_\d\d_\d\d)_(\d\d)/) {
               my ($date,$hour) = ($1,$2);
               my $dir = "./$date/$hour/";
               mkdir $date;
               mkdir $dir;
               s=^=$dir=
             }' *
rename(2021_10_15_23_35_SIP_CDR_pid3894_ins2_thread_1_4718.csv.gz, ./2021_10_15/23/2021_10_15_23_35_SIP_CDR_pid3894_ins2_thread_1_4718.csv.gz)
rename(2021_11_24_21_15_Gi_pid25961_ins2_thread_1_6438.csv.gz, ./2021_11_24/21/2021_11_24_21_15_Gi_pid25961_ins2_thread_1_6438.csv.gz)
rename(2021_11_24_21_15_Gi_pid27095_ins2_thread_1_6485.csv.gz, ./2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins2_thread_1_6485.csv.gz)
rename(2021_11_24_21_15_Gi_pid27095_ins3_thread_2_6485.csv.gz, ./2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins3_thread_2_6485.csv.gz)
rename(2021_11_24_21_15_Gi_pid27095_ins4_thread_3_6485.csv.gz, ./2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins4_thread_3_6485.csv.gz)
rename(2021_11_24_21_15_Gi_pid681_ins5_thread_4_6457.csv.gz, ./2021_11_24/21/2021_11_24_21_15_Gi_pid681_ins5_thread_4_6457.csv.gz)
rename(2021_11_25_20_55_Gi_pid29741_ins5_thread_4_7540.csv.gz, ./2021_11_25/20/2021_11_25_20_55_Gi_pid29741_ins5_thread_4_7540.csv.gz)
rename(2021_11_25_20_55_Gi_pid30842_ins3_thread_2_7489.csv.gz, ./2021_11_25/20/2021_11_25_20_55_Gi_pid30842_ins3_thread_2_7489.csv.gz)
rename(2021_11_25_20_55_Gi_pid30842_ins4_thread_3_7488.csv.gz, ./2021_11_25/20/2021_11_25_20_55_Gi_pid30842_ins4_thread_3_7488.csv.gz)
rename(2021_11_25_20_55_Gi_pid30842_ins5_thread_4_7489.csv.gz, ./2021_11_25/20/2021_11_25_20_55_Gi_pid30842_ins5_thread_4_7489.csv.gz)

The -n is a dry-run option, it will only show what it would do without actually doing it. Remove it (or replace with -v for verbose output) when you're sure the rename script is going to do what you want.

The script works by first extracting the date and hour portions of each filename (skipping any filenames that don't match). Then it creates the directories for the date and date/hour, then renames the filename into those directories.

This assumes that the filenames are in the current directory. If they aren't, you'll have to adjust the m// matching regex in the first line AND the s=== substitution regex in the second-last line.

Alternate version using the File::Path perl core module (which is included with perl), instead of using mkdir twice (the make_path function works like the mkdir -p shell command):

$ rename -v 'BEGIN {use File::Path qw(make_path)};
             if (m/(^\d{4}_\d\d_\d\d)_(\d\d)/) {
               my $dir = "./$1/$2/";
               make_path $dir;
               s=^=$dir=
             }' *
2021_10_15_23_35_SIP_CDR_pid3894_ins2_thread_1_4718.csv.gz renamed as ./2021_10_15/23/2021_10_15_23_35_SIP_CDR_pid3894_ins2_thread_1_4718.csv.gz
2021_11_24_21_15_Gi_pid25961_ins2_thread_1_6438.csv.gz renamed as ./2021_11_24/21/2021_11_24_21_15_Gi_pid25961_ins2_thread_1_6438.csv.gz
2021_11_24_21_15_Gi_pid27095_ins2_thread_1_6485.csv.gz renamed as ./2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins2_thread_1_6485.csv.gz
2021_11_24_21_15_Gi_pid27095_ins3_thread_2_6485.csv.gz renamed as ./2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins3_thread_2_6485.csv.gz
2021_11_24_21_15_Gi_pid27095_ins4_thread_3_6485.csv.gz renamed as ./2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins4_thread_3_6485.csv.gz
2021_11_24_21_15_Gi_pid681_ins5_thread_4_6457.csv.gz renamed as ./2021_11_24/21/2021_11_24_21_15_Gi_pid681_ins5_thread_4_6457.csv.gz
2021_11_25_20_55_Gi_pid29741_ins5_thread_4_7540.csv.gz renamed as ./2021_11_25/20/2021_11_25_20_55_Gi_pid29741_ins5_thread_4_7540.csv.gz
2021_11_25_20_55_Gi_pid30842_ins3_thread_2_7489.csv.gz renamed as ./2021_11_25/20/2021_11_25_20_55_Gi_pid30842_ins3_thread_2_7489.csv.gz
2021_11_25_20_55_Gi_pid30842_ins4_thread_3_7488.csv.gz renamed as ./2021_11_25/20/2021_11_25_20_55_Gi_pid30842_ins4_thread_3_7488.csv.gz
2021_11_25_20_55_Gi_pid30842_ins5_thread_4_7489.csv.gz renamed as ./2021_11_25/20/2021_11_25_20_55_Gi_pid30842_ins5_thread_4_7489.csv.gz

This isn't really any better than the first version, but it does demonstrate that you can use any perl code, any perl module to rename and/or move files.

Third version, this one uses File::Basename to split the input pathname into $path and $file portions. It can cope with filenames in the current directory, or in any other directory. File::Basename is a core perl module, so is included with perl. It provides three useful functions, basename() and dirname() (which work similarly to the shell tools of the same name), and fileparse() which is what I'm using in this script to extract both the basename and the directory into separate variables.

rename -n 'BEGIN {use File::Path qw(make_path); use File::Basename};
           my ($file, $path) = fileparse($_);
           if ($file =~ m/(\d{4}_\d\d_\d\d)_(\d\d)/) {
             my $dir = "$path/$1/$2";
             make_path $dir;
             $_ = "$dir/$file"
           }' /home/cas/rename-test/*
rename(/home/cas/rename-test/2021_10_15_23_35_SIP_CDR_pid3894_ins2_thread_1_4718.csv.gz, /home/cas/rename-test/2021_10_15/23/2021_10_15_23_35_SIP_CDR_pid3894_ins2_thread_1_4718.csv.gz)
rename(/home/cas/rename-test/2021_11_24_21_15_Gi_pid25961_ins2_thread_1_6438.csv.gz, /home/cas/rename-test/2021_11_24/21/2021_11_24_21_15_Gi_pid25961_ins2_thread_1_6438.csv.gz)
rename(/home/cas/rename-test/2021_11_24_21_15_Gi_pid27095_ins2_thread_1_6485.csv.gz, /home/cas/rename-test/2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins2_thread_1_6485.csv.gz)
rename(/home/cas/rename-test/2021_11_24_21_15_Gi_pid27095_ins3_thread_2_6485.csv.gz, /home/cas/rename-test/2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins3_thread_2_6485.csv.gz)
rename(/home/cas/rename-test/2021_11_24_21_15_Gi_pid27095_ins4_thread_3_6485.csv.gz, /home/cas/rename-test/2021_11_24/21/2021_11_24_21_15_Gi_pid27095_ins4_thread_3_6485.csv.gz)
rename(/home/cas/rename-test/2021_11_24_21_15_Gi_pid681_ins5_thread_4_6457.csv.gz, /home/cas/rename-test/2021_11_24/21/2021_11_24_21_15_Gi_pid681_ins5_thread_4_6457.csv.gz)
rename(/home/cas/rename-test/2021_11_25_20_55_Gi_pid29741_ins5_thread_4_7540.csv.gz, /home/cas/rename-test/2021_11_25/20/2021_11_25_20_55_Gi_pid29741_ins5_thread_4_7540.csv.gz)
rename(/home/cas/rename-test/2021_11_25_20_55_Gi_pid30842_ins3_thread_2_7489.csv.gz, /home/cas/rename-test/2021_11_25/20/2021_11_25_20_55_Gi_pid30842_ins3_thread_2_7489.csv.gz)
rename(/home/cas/rename-test/2021_11_25_20_55_Gi_pid30842_ins4_thread_3_7488.csv.gz, /home/cas/rename-test/2021_11_25/20/2021_11_25_20_55_Gi_pid30842_ins4_thread_3_7488.csv.gz)
rename(/home/cas/rename-test/2021_11_25_20_55_Gi_pid30842_ins5_thread_4_7489.csv.gz, /home/cas/rename-test/2021_11_25/20/2021_11_25_20_55_Gi_pid30842_ins5_thread_4_7489.csv.gz)

BTW, it would be trivial to modify this so that it moved the files to a completely different path - just make it do something like my $dir = "/my/new/path/$1/$2"; instead of my $dir = "$path/$1/$2";

The key thing to understand about how the perl rename utility works is that iff the rename script modifies the $_ variable then rename will attempt to rename the file to the new value of $_. If $_ is unchanged, it will not try to rename it. This is why you can use any perl code to rename files - has to do is change $_. Most often you'll probably use very simple sed-like rename scripts (e.g. rename 's/ +/_/g' * to rename spaces in filenames to an underscore) but the rename algorithm can be as complex as you need it to be.

$_ is a very important variable in perl - it's used as the default variable to hold input from file handles and iterators for loops if the programmer doesn't specify one. It's also used as the default operand for several operators (like m//, s///, tr///) and as the default argument for many (but not all) functions. See man perlvar and search for $_ (you'll need to escape that in less as \$_).

BTW, one thing I didn't mention about rename earlier is that it can take filenames either as arguments on the command line or from stdin. It defaults to newline-separated input from stdin (so it won't work with filenames that contain newlines - an annoying but completely valid possibility). You can use the -0 argument to make it use NUL separated input instead of newline-separated...so, it can work with any filenames, taking input from anything that can generate a list of NUL-separated filenames (e.g. find ... -print0, but it's probably better to just use find's -exec ... {} + option).

rename will also refuse to rename a file over an existing file unless you use its -f or --force option.

Thank you @cas. Amazing answer. I was not aware of prename as I'm new to linux system. Could you please explain the substitution regex s=^=$dir= ? And also how the code would would change if I'm to put in the path. Thanks again for the brilliant answer :) — nidooooz, Commented Apr 11, 2022 at 8:36
The substitution regex just inserts the new directory at the start (^) of the filename, which causes rename to rename the file. Obviously, this won't work if the start of the "filename" is actually a path. To change it to cope with full pathnames as input, you'd have to either add the new subdirectory in between the existing path and the file's basename, or (easier) replace the entire path with a newly constructed path string. perl's File::Basename core module would help with this, it can easily split a pathname into dir and basename portions. — cas, Commented Apr 11, 2022 at 9:23
Hi @cas, I'm getting the error bash: /bin/find: Argument list too long I'm running the following command for the third version find /home/cas/rename-test/ -type f rename -n 'BEGIN {use File::Path qw(make_path); use File::Basename};my ($file, $path) = fileparse($_);if ($file =~ m/(\d{4}_\d\d_\d\d)_(\d\d)/) {my $dir = "$path/$1/$2";make_path $dir;$_ = "$dir/$file"}' {} \; Hope it works — nidooooz, Commented Apr 13, 2022 at 3:39
With find, you can either pipe the filenames into rename (use -print0 with the find command, and -0 with the rename command for NUL-separated filenames), or you can use find's -exec option (-exec rename ..... {} +). If you use + with -exec, find will try to fit as many filenames as will fit into a max length command line, and will run rename as many times as necessary to process all filenames. If you use -exec ... {} \; instead of -exec ... {} +, it will run rename once per filename. In none of these cases will you ever get an arg list too long error. — cas, Commented Apr 13, 2022 at 5:14

Stéphane Chazelas · Accepted Answer · 2022-04-11 08:50:49Z

1

With zsh instead of bash, from within the Main directory:

zmodload zsh/files # to get builtin mkdir/mv to speed things up

mkdmv() { mkdir -p -- $2:h && mv -- "$@"; }
zmv -n -P mkdmv '(<->_<->_<->)_(<->)_*.csv' '$1/$2/$f'

(remove the dry-run -n if happy).

zmv will run sanity checks before doing any move to help avoid data loss if there are some collisions, one of its advantages over most other batch renaming utilities.

<-> matches any sequence of ASCII digits. If you want the matching to be more specific, you could do (<1970-2099>_<1-12>_<1-31>)_(<0-23>)_*.csv for instance.

edited Apr 11, 2022 at 8:50

answered Apr 11, 2022 at 7:45

Stéphane Chazelas

554k92 gold badges1.1k silver badges1.6k bronze badges

Thank you :) ...
– nidooooz
Commented Apr 11, 2022 at 8:55

Add a comment |

Stack Exchange Network

Creating and arranging files into folders based on date and time in file name

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
linux
bash
shell-script
centos
mkdir
.

Linked

Hot Network Questions

Creating and arranging files into folders based on date and time in file name

2 Answers 2

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged linuxbashshell-scriptcentosmkdir.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
linux
bash
shell-script
centos
mkdir
.