Parse huge amounts of files efficiently

Question

I have a folder that holds hunderds of thousands of files called hp-temps.txt. (There are also tons of subfolders)

The content of these files looks like this for example:

Sensor   Location              Temp       Threshold
------   --------              ----       ---------
#1        PROCESSOR_ZONE       15C/59F    62C/143F 
#2        CPU#1                10C/50F    73C/163F 
#3        I/O_ZONE             25C/77F    68C/154F 
#4        CPU#2                32C/89F    73C/163F 
#5        POWER_SUPPLY_BAY     9C/48F     55C/131F

I need to parse through all the files and find the highest entry for the Temperature in the #1 line.

I have a working script but it takes a very long time, and I was wondering, if there is any way to improve it.

Since I'm rather new in Shell Scripting, I imagine this code of mine is really inefficient:

#!/bin/bash
highesetTemp=0
temps=$(find $1 -name hp-temps.txt -exec cat {} + | grep 'PROCESSOR' | cut -c 32-33)
for t in $temps
do
  if [ $t -gt $highestTemp ]; then
    highestTemp=$t
  fi
done

EDIT:

There has been a very efficient code but I forgot to mention that I not only need the biggest value.

I would like to be able to loop through all the files, since I'd like to output the directory of the file and the temperature whenever a higher value is detected.

So the output could look like this for example:

New MAX: 22 in /path/to/file/hp-temps.txt
New MAX: 24 in /another/path/hp-temps.txt
New MAX: 29 in /some/more/path/hp-temps.txt

Since find does not find files in a particular order, it makes no sense that you want to see any other result than the final maximum and pathname. — Kusalananda, Commented Sep 18, 2022 at 16:34

Kusalananda · Accepted Answer · 2022-09-19 05:49:06Z

Storing intermediate data in a string is going to be slow, and it is very seldom necessary. There's an additional issue with storing multiple strings in a single scalar variable like this in the general case where each sub-string may contain spaces or other characters that you later force the shell to split the string on by using it unquoted in a for loop (it would be better to use an array).

In this case, finding each file, extracting the temperature, and reading that stream of temperatures would be more efficient. It would also avoid creating a shell variable with a 300 KB (or more) string in it.

You may parse out the temperature in Celcius from one file using

awk '$2 == "PROCESSOR_ZONE" { printf "%d\n", $3 }' file

It outputs the temperature from the 3rd field when the 2nd field is exactly the string PROCESSOR_ZONE. Since we're converting the 3rd field to an integer upon writing it, only the first part of the value, up to the first non-digit, will be outputted.

Calling this from find:

find . -name hp-temps.txt \
    -exec awk '$2 == "PROCESSOR_ZONE" { printf "%d\n", $3 }' {} +

This executes the awk command for one or several batches of found files and outputs the temperatures, one after the other on standard output.

If you are using an awk that understands the non-standard nextfile statement, then you may use this to skip ahead to the next file as soon as possible:

find . -name hp-temps.txt \
    -exec awk '$2 == "PROCESSOR_ZONE" { printf "%d\n", $3; nextfile }' {} +

To find the largest value outputted by the above command, we may use one more awk command:

awk 'NR == 1 || $1 > max { max = $1 } END { print max }'

This sets the value of the awk variable max to the current input value if it's the first or the largest value seen so far. In the end, the value of max is outputted.

I would expect this to be many times faster than a shell loop.

Putting this together:

find . -name hp-temps.txt \
    -exec awk '$2 == "PROCESSOR_ZONE" { printf "%d\n", $3; nextfile }' {} + |
awk 'NR == 1 || $1 > max { max = $1 } END { print max }'

There was an additional request to also pick out the filename of the file with the largest value. We'll do that by simply passing the filename along with the value from each file. In awk, the pathname of the current input file is available as the special variable FILENAME.

find . -name hp-temps.txt \
    -exec awk '$2 == "PROCESSOR_ZONE" { printf "%d\t%s\n", $3, FILENAME; nextfile }' {} + |
awk 'NR == 1 || $1 > max { max = $1; fname = $2 } END { print max, fname }'

If several files have the same maximum value, this would report the filename of the first one found by find. The find utility finds files in the same order as ls -f would list them.

This is great. I learned a lot. Thank you for your detailled explanation. I would however like to have a loop so I can handle some more stuff. I tried your approach and it is way way faster even when using a loop. I would for example also like to get the location of the file with the highest value. — Lumnezia, Commented Sep 18, 2022 at 15:03
@Lumnezia You should make sure that your question is complete. — Kusalananda, Commented Sep 18, 2022 at 15:05
Sorry, I thought I would be able once I know how to efficiently load the files. I'll update the Question asap. — Lumnezia, Commented Sep 18, 2022 at 15:28

Stack Exchange Network

Parse huge amounts of files efficiently

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
bash
read
optimization
.

Hot Network Questions

Parse huge amounts of files efficiently

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged bashreadoptimization.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
bash
read
optimization
.