2

I have file names in the following format and I would like to cat files based on substring match(Orange,Apple) and a constant(S4,S5),

file names example _S6_trimmed_, _S8_trimmed_, _S9_trimmed_, _S10_trimmed_

Orange1_S4_trimmed_1.fastq
Orange1_S4_trimmed_2.fastq
Orange2_S4_trimmed_1.fastq
Orange2_S4_trimmed_2.fastq

Apple1_S4_trimmed_1.fastq
Apple1_S4_trimmed_2.fastq
Apple2_S4_trimmed_1.fastq
Apple2_S4_trimmed_2.fastq

Orange1_S5_trimmed_1.fastq
Orange1_S5_trimmed_2.fastq
Orange2_S5_trimmed_1.fastq
Orange2_S5_trimmed_2.fastq

Apple1_S5_trimmed_1.fastq
Apple1_S5_trimmed_2.fastq
Apple2_S5_trimmed_1.fastq
Apple2_S5_trimmed_2.fastq

What I want to do and repeat the same for several samples S4, S5,..

cat Orange*_S4_trimmed_1.fastq >Orange_S4_trimmed_1.fastq
cat Orange*_S4_trimmed_2.fastq >Orange_S4_trimmed_2.fastq

cat Apple*_S4_trimmed_1.fastq >Apple_S4_trimmed_1.fastq
cat Apple*_S4_trimmed_2.fastq >Apple_S4_trimmed_2.fastq

Here is a script I wrote in bash,

#!/bin/bash

filename="samples.txt"

while read -r sample;
do
    echo $sample
    cat ${sample}_trimmed_1.fastq >${sample}_trimmed_1.fastq
    cat ${sample}_trimmed_2.fastq >${sample}_trimmed_2.fastq

done <$filename

Here is the format for my samples.txt file,

 samples.txt
 Apple*_S4
 Apple*_S5
 Orange*_S4
 Orange*_S5

Is there a better way to do this on a large group of files? Thanks for your help in advance.

A solution I'm currently working on based on comments on Bodo,

#!/bin/bash
    
    for file in *1_*_trimmed_1.fastq;
    do
        echo $file
    
        subs=`echo $file | cut -d_ -f1 | tr -d 0-9`
        echo $subs
    
        sample=`echo $file | cut -d_ -f2`
        echo $sample
    
        cat ${subs}*_${sample}_trimmed_1.fastq >${subs}_${sample}_trimmed_1.fastq
        cat ${subs}*_${sample}_trimmed_2.fastq >${subs}_${sample}_trimmed_2.fastq
    
    done
7
  • Please edit your question and add more details. Do the groups of files that should be concatenated always have one file with number 1 (and others with 2 and maybe more numbers) for the placeholder *as in Apple*_S4_trimmed_1.fastq? In this case my suggestion would be a loop over all files with 1 and construct the corresponding concatenation command.
    – Bodo
    Commented Jul 19, 2023 at 16:20
  • Yes, the place holder(*) will be 1 or 2. Commented Jul 19, 2023 at 17:19
  • Please add this to your question.
    – Bodo
    Commented Jul 19, 2023 at 18:27
  • 1
    Please add what's not working with the solution based on my comment. Specify in your question if the file names always contain _S4_trimmed_ or _S5_trimmed_ or show a longer list of example file names. This might help to propose commands to construct the file names. Better use $( ... ) instead of backticks for command substitution.
    – Bodo
    Commented Jul 19, 2023 at 18:37
  • 2
    Side note: you may get away with not quoting here, but sooner or later it will bite you. Get used to quoting properly. Commented Jul 20, 2023 at 4:45

1 Answer 1

1
1. Paste all source files to their destination

As in your question the sample code all files of a file set get pasted to their destination file:

#!/bin/sh

destination=""

# Select all source files, but not any existing the destination files:
for file in *[0-9]*_S[0-9]*_trimmed_*[0-9].fastq
do  
    if [ "${destination}" != "${file%%[0-9]*}_${file#*_}" ]
    then
        # Switch to next file set:
        destination="${file%%[0-9]*}_${file#*_}"
        echo "${destination}"
    fi
    # Copy the current source file to destination
    cat "${file}" >"${destination}"
done

This is almost only useful for named pipes as destination files which will transfer the received data elsewhere.


2. Paste only the last source file to its destination

If the destination is a regular file writing all source files to destination would be a tremendous overhead, as the content gets overwritten every time, just leaving the latest version to persist.

This second solution prohibits the overhead and minimizes the amount of data that will be written:

#!/bin/sh

destination=""
source=""

# Select all source files, but not any existing the destination files:
for file in *[0-9]*_S[0-9]*_trimmed_*[0-9].fastq
do  
    if [ "${destination}" != "${file%%[0-9]*}_${file#*_}" ]
    then
        if [ "${destination}" != "" ]
        then
            # Not the start situation. Write last file of file set to the destination file:
            cat "${source}" >"${destination}"
        fi
        # Switch to next file set:
        destination="${file%%[0-9]*}_${file#*_}"
        echo ${destination}
    fi
    # Store the current source file name for the next cycle
    source=${file}
done
# Write last source file of the last file set to the destination, if it exists:
if [ "${destination}" != "" ]
then
    cat "${source}" >"${destination}"
fi

As you used catfor your file duplication I used it, too. But usually one would use cp instead. So one would exchange the line cat "${file}" >"${destination}" by

cp "${file}" "${destination}"

If source- or destination-files are not subject of editing anymore, one should think about creating a hard-link. So there would be no duplication of data at all (not extra disk usage by the destination file). Only the destination filename would be created, pointing to the same data area (inode) as the selected source file:

cp -l --remove-destination "${file}" "${destination}"

For huge files (like video data) this would be a great speed up.

2
  • Answer updated: Added a second version of the script to cover the use-case of named pipes. // Fixed an error in the original script.
    – dodrg
    Commented Jul 21, 2023 at 20:57
  • Did this solve your question ?
    – dodrg
    Commented Jul 31, 2023 at 7:46

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .