5
\$\begingroup\$

I would like to request a review of an Awk program I have written. This script meets an immediate business need and, ideally, serves as a coding style example for new hires (who may not yet be proficient in Awk). Consequently, the script is not as compact as it could be, and not all error conditions (e.g., negative data amounts, empty machine names) are checked. The main review questions are, therefore, more pedagogical in nature:

  • Would you put all of the logic into the Awk program, or leave some of the validation in the shell driver?
  • Does the Awk program need more comments?
  • Can you think of anything else to make the programs easier to understand to novices?
  • Regarding easy maintenance, is there any reason for a coding style change?

I would also like to ask an expert question.

  • Are there more intuitive ways to sort the output? Prepending 1; or 2; or 3; works, but.... meh.

First, the specification:

Objective

Create an AWK program to process a semicolon-separated data file containing information about machine data usage over two months. The program should validate the data, identify and report any errors or inconsistencies, and calculate usage ratios between the two months for each machine.

Input Data Format

  • Each record is semicolon-separated.
  • Fields:
    1. Machine name (string)
    2. Month (in MM.YYYY format)
    3. Data size (numeric value with a "GB" suffix)

Requirements

  1. Input Validation:

    • Ensure each record's second field (month) follows the MM.YYYY format.
    • Ensure each record's third field (data size) is a valid numeric value followed by "GB".
    • If a record does not conform to these formats, print an error message and terminate the program.
  2. Duplicate Handling:

    • Detect and handle duplicate entries for the same machine and month.
    • If running in strict mode, terminate the program on encountering a duplicate.
    • If not in strict mode, print a warning and continue processing.
  3. Data Consistency:

    • Ensure the data set contains records from exactly two distinct months.
    • Ensure each machine has exactly one record for each of the two months.
    • If a machine does not have records for both months, exclude it from the ratio calculation and print a warning.
  4. Ratio Calculation:

    • For each machine with valid records for both months, calculate the ratio of data size between the second and the first month.
    • If the ratio is outside the 95-105% range, print a warning for that machine.
  5. Output:

    • Print the ratio for each machine along with the original data sizes if the verbose mode is enabled.
    • Print a summary message indicating whether all machines have ratios within the 95-105% range.

Typical input looks like this (cat 04.csv 05.csv | shuf | head -n 20):

05.2024;1178.88 GB;IMBI-0002
04.2024;8259.91 GB;OCI-0001
04.2024;1972.24 GB;MATHI-0001
05.2024;4377.87 GB;TEST-0001
05.2024;42930.72 GB;IUP-0001
05.2024;98.79 GB;URZ-0002
05.2024;6999.23 GB;IDF-0001
04.2024;0.11 GB;INTERN
04.2024;13560.51 GB;IMSE-A-0001
05.2024;125161.29 GB;MEDMA-0002
05.2024;900878.88 GB;HITS-A-0001
05.2024;4704.86 GB;IPMB-0001
05.2024;74438.6 GB;ZMBH-A-0001
05.2024;2935.98 GB;ZUV-0004
04.2024;958.78 GB;DBIO-0001
04.2024;1244.26 GB;ZITI-0003
05.2024;17610.2 GB;UMA-0002
04.2024;654.47 GB;URZ-A-0001
04.2024;90342.36 GB;IWR-0001
05.2024;47060.74 GB;GEOG-0001

Calling the driver:

./monitor-data-changes.sh 04.csv 05.csv lenient verbose

Typical output:

ZUV-0003                    100% [    161.98 GB ->     161.29 GB]
ZUV-0004                     99% [   2966.57 GB ->    2935.98 GB]
ZUV-0005                    101% [      3.06 GB ->       3.08 GB]
Ratio for COS-0001     is outside the 95-105% range
Ratio for DBIO-0001    is outside the 95-105% range
Ratio for DMATH-0001   is outside the 95-105% range

The driver script:

#!/bin/bash

# This is a driver for Awk script "./monitor-data-changes.awk",
# which see for detailed explanations.

# Thure Dührsen, 2024-06-11..12        Initial version

if [ "$#" -lt 3 ] || [ "$#" -gt 4 ]; then
    echo "Usage: "
    echo "             $0 file1 file2 strict"
    echo "             $0 file1 file2 lenient"
    echo "             $0 file1 file2 strict  verbose"
    echo "             $0 file1 file2 lenient verbose"
    echo
    echo "where file1 and file2 are CSV files,"
    echo "and 'strict', 'lenient', 'verbose' are typed exactly as shown."
    exit 2
fi

if [ ! -f "$1" ]; then
    echo "First argument is not a file"
    exit 2
fi

if [ ! -f "$2" ]; then
    echo "Second argument is not a file"
    exit 2
fi

if [ "$3" != 'strict' ] && [ "$3" != 'lenient' ]; then
    echo "Third argument is neither 'strict' nor 'lenient'"
    exit 2
fi

if [ "$#" -eq 4 ] && [ "$4" != 'verbose' ]; then
    echo "Fourth argument, if present, must be 'verbose'"
    exit 2
fi

awk -F';' -v OFS=';' '
/^#/ {next} # Skip comment lines
/^$/ {next} # Skip empty lines
NF != 3 {
    print "need exactly three fields per line" > "/dev/stderr"
    exit 1
}
{print $3, $1, $2}
' "$1" "$2"                                                      |
awk -f ./monitor-data-changes.awk -v strict="$3" -v verbose="$4" |
sort -t';' -k1,1 -k2,2 -k3,3                                     |
cut -d';' -f2-                                                   |
column -s ';' -t

The Awk script:

# Thure Dührsen, 2024-06-11..12        Initial version
# 2024-06-22                           Ensure consistent ratio calc on unsorted input

# For any number of machines, calculates the ratio of data usage
# between two months for each machine: amount of data in later month
# divided by amount of data in ealier month, expressed as a
# percentage.

# Ensures that data is from exactly two months and that each machine
# has at least one record per month.

# Optionally checks whether there is more than one record per month
# and machine and if so, stops processing the entire file.

# Prints the ratio and flags machines with ratios outside
# the 95-105% range.

# Command-line options (to be used in ./monitor-data-changes.sh):

# -v strict="strict"    Enable strict mode: Reject more than one record
#                       for each (machine, month) pair and stop processing
#                       the entire file if this happens
# -v strict="lenient"   Disable strict mode
#
# -v verbose="verbose"  Enable verbose mode: print ratios and sizes for
#                       all machines

# Preconditions:

# Input is CSV, concatenated from two months, with three fields in each line.
# Semicolon as field delimiter,
# first column is a machine name,
# second column is month and year (MM.YYYY),
# third column is a file size given in GB (floating-point
# number plus the string " GB").

# Exit with an error
function errprint(message) {
    print "ERROR: " message > "/dev/stderr"
    error_occurred = 1 # global error flag
    exit 1 # skip to END block
}

# Warn about bad data, but continue processing
function noteprint(message) {
    print "NOTE: " message > "/dev/stderr"
}

BEGIN {
    FS = ";"
    # Traverse arrays ordered by indices in ascending order compared
    # as strings
    # https://www.gnu.org/software/gawk/manual/html_node/Controlling-Scanning.html
    PROCINFO["sorted_in"] = "@ind_str_asc"
    error_occurred = 0  # Initialize error flag, global
    if (strict == "lenient") {strict = 0}
}

{
    key = $1 FS $2
    if (!seen[key]) {
        if ($2 !~ /^(0[1-9]|1[0-2])\.[0-9]{4}$/) {
            # TODO: Better check for year range
            errprint("Not a valid MM.YYYY: " $2)
        }
        if ($3 !~ /^[0-9]+(\.[0-9]+)?[[:space:]]*GB$/) {
            errprint("Not a valid size: " $3)
        }
        data[key] = $3
        seen[key] = 1
    } else {
        msg = "Duplicate data for machine " $1 " in month " $2
        if (strict) {
            errprint(msg)
        } else {
            noteprint(msg)
        }
    }

    months[$2]++
    machines[$1]++
}

END {
    # Skip the END block if an error occurred
    if (error_occurred) {
        exit 1
    }

    if (length(months) != 2) {
        msg="Data across the entire set is not from exactly two months"
        errprint(msg)
    }

    for (machine in machines) {
        count = 0
        for (month in months) {
            if ((machine FS month) in data) {
                count++
            }
        }
        if (count != 2) {
            msg = "Machine " machine " does not have exactly one record per month"
            noteprint(msg)
            delete machines[machine]
        }
    }

    allok = 1
    for (machine in machines) {
        split("", sizes)  # Clear sizes array
        i = 1
        for (month in months) {
            sizes[i++] = data[machine FS month]
        }

        ratio = sizes[2] / sizes[1] * 100

        if (verbose) {
            printf "1;%s;%9.0f%% [%10.2f GB -> %10.2f GB]\n",
                machine, ratio, sizes[1], sizes[2]
        }

        if (ratio < 95 || ratio > 105) {
            allok = 0
            print "2;Ratio for " machine "; is outside the 95-105% range"
        }
    }
    print (allok ? "3;Ratio for all machines in 95-105% range" : "")
}
\$\endgroup\$
4
  • 3
    \$\begingroup\$ Respectfully, these are the wrong tools for the job. There are myriad ways this data should/could be loaded into a spreadsheet and then analysed/reported/graphed. Imagine that 'uptime' is added as a 4th column to the CSV. Most every line of these two scripts would have to be evaluated and possibly adapted/extended, then tested. Many years ago I considered writing code to handle a complex year-end reconciliation. Glad I chose to use an XLS instead, 'cuz every year there's been a slight or major alteration in the requirements... KISS... \$\endgroup\$
    – Fe2O3
    Commented Jul 8 at 22:30
  • \$\begingroup\$ @Fe2O3 Only I do not have access to a spreadsheet. On the server this runs on, I have access to the GNU core utilities and not much else. \$\endgroup\$ Commented Jul 9 at 15:20
  • \$\begingroup\$ As I type these words of reply, I am one mouse click away from launching the MS-Excel clone known as Google Sheets. You can take advice offered to make your life easier from those who've been this way before, or you can do things your own way... Cheers! \$\endgroup\$
    – Fe2O3
    Commented Jul 9 at 20:24
  • \$\begingroup\$ @Fe2O3 No spreadsheets for this use case, thanks. \$\endgroup\$ Commented Jul 10 at 7:40

2 Answers 2

7
\$\begingroup\$

ISO dates

Yeah, I get it that you have a pre-defined data format.

Input Data Format

Month (in MM.YYYY format)

But that's significantly less than ideal. If feasible, try to get new records produced in ISO 8601 format, so they look like 2024-05. Then lexical order matches chronological order.

language choice

awk is great for one-liners, or brief scripts with a handful of lines. For example your "need exactly three fields per line" validator is a good match.

Once you start worrying about scaleup to the software engineering challenges of an organization full of developers, other scripting languages begin to look like a better fit. Python is not the only one, but that's the example I will use here. You're looking for good tooling such as linters, and a robust ecosystem that can provide things like date conversion libraries.

The title suggested we're using the language of Aho, Weinberger, and Kernighan. But some code details reveal that we're dependent on gawk instead. I will assume we need GNU Awk 5.1.0 or greater.

exceptions

One of the best features offered by modern languages, including modern scripting languages, is the ability to abort via exception. This promotes composition of functions, subject to correctness constraints.

The OP software is mostly mainline code, with a few helpers, so composition is less of a concern. The fatal errprint() function is definitely useful.

pipeline errors

The driver should definitely set -e -o pipeline so we obey error reports and bail out at once. As written, I'm concerned that erroneous input may leave us with lots of "plausible looking" output which subsequent steps might naïvely go on to trust and consume.

sort order

more intuitive ways to sort the output?

The OP "decorate, sort, strip" makes perfect sense.

In other language environments we might supply key=my_collation_function to accomplish the same thing. More critically, sorting date objects instead of strings would be a natural fit to this use case.

cracking argv

Having settled on bash + awk, the OP approach makes perfect sense. But it is tedious. If written in python we could "import typer", write a one-line function signature, and get {parsing, diagnostics, --help} "for free".

modes

I worry that we've introduced a new mode because "it was too easy", rather than because the design called for it. Commands are easier for end users to learn if they each focus on a single responsibility.

Duplicate Handling

  • If running in strict mode, terminate the program on encountering a duplicate.
  • If not in strict mode, print a warning and continue processing.

There seem to be two distinct use cases here, one for producing numeric results that will be consumed downstream and one for pursuing the action item of "diagnose and fix upstream bug".

Consider breaking out separate validate-data-changes and monitor-data-changes scripts. I anticipate they will re-use core parsing functions, perhaps via import of a common module. The awk solution for such software re-use challenges appears to be synthesizing scripts from fragments using tools like m4.

helpful comment

    # Traverse arrays ordered by indices in ascending order compared as strings
    # https://www.gnu.org/software/gawk/manual/html_node/Controlling-Scanning.html
    PROCINFO["sorted_in"] = "@ind_str_asc"

Thank you, thank you! I would never have understood the meaning without that URL reference.

anchors

I really like the ^...$ "conservative" regex anchoring, thank you. It makes it easier to gain confidence that the expression matches exactly what you think it should match.

uniquifying

I don't understand the double assign:

        data[key] = $3
        seen[key] = 1

Why not just jam $3 into seen, and elide data? (The "not seen" idiom is very nice, do keep that!)

sorting

Consider asking /usr/bin/sort to order the rows before gawk ever sees them. (A one-liner might re-jigger the field order before that happens.) And then you needn't worry about the memory needs of an associative array. Let /usr/bin/sort worry about memory management for large inputs. When processing sorted rows you can just maintain previous row in prev and compare against that.

If you're keen on pipelines, consider asking join to put both month's observations on a single line. That makes it very easy for your script to find them, validate them, and do some arithmetic.

dup detection

Consider storing sorted rows in a temp file, and compare sort -u output against that, in order to report fatal "Duplicate data" error.

random sort order

I just don't even begin to understand this:

    for (machine in machines) { ...
        for (month in months) {

(BTW, thank you kindly for the helpful "Clear sizes array" comment.)

Typically we will scan a month sequence like

  • 09.2023
  • 10.2023

where the leading 0 zero saves us, giving a sensible collation order. But a few months later we encounter this:

  • 12.2023
  • 01.2024

What?!? The 0 zero sorts first, so we visit January followed by December. The effect is we report reciprocal of the usage ratio, so an increase is reported as a decrease, and vice versa. If symmetric 95 .. 105 became something like 95 .. 110, the impact would be much worse.

The fundamental issue is we chose an inappropriate calendar representation. Yes, we could finesse it with subkeys for sorting. But it would be far more sensible to parse out something like a dt.datetime or even an ISO string.

fatal error

I'm not sure this comment is accurate.

function errprint( ... ) {
    ...
    exit 1 # skip to END block
}

My man page explains that will "exit immediately", with no skipping involved.

END {
    # Skip the END block if an error occurred
    if (error_occurred) {
        exit 1
    }

I doubt that that particular exit ever executes, since we bailed previously.

multiple errors

It looks like there may have been some intent to scan entire input, reporting errors along the way, and then exit 1.

As written, it appears the OP code reports just the first error encountered, and immediately bails out.

new machine

I don't understand this test:

        if (count != 2) {
            msg = "Machine " machine " does not have exactly one record per month"
            noteprint(msg)
            delete machines[machine]
        }

Consider this example. We run against {April, May}. Machine Bob was added in May, and we see a single record for that. So count is 1, Bob simply did not exist in April. And then suppose Bob is decommissioned in November, with similar effect.

My difficulty is that

  • The diagnostic is misleading / inaccurate.
  • This is not an error condition, it is completely normal.

This codebase achieves a subset of its design goals, producing correct output for more than half of its intended input pairs. There are some architectural issues that may be worth revisiting.

I would not be willing to delegate or accept maintenance tasks on this codebase in its current form. I would be reluctant to offer it as an exemplar to junior devs who are learning to write production-ready code.

\$\endgroup\$
3
  • \$\begingroup\$ Nit: Wrapping a regex with ^...$ does not actually convert an existing regex to a full match regex, because it doesn't work if any | appears outside any groups, e.g ^0[1-9]|1[0-2]$. The correct way to do this is to wrap it with a non-capturing group instead (^(?:...)$), which would also preserve backreferences. Besides, groups that don't need to be captured should also be non-capturing, but this is mostly for performance reasons. \$\endgroup\$
    – Voile
    Commented Jul 9 at 3:26
  • \$\begingroup\$ Regarding the exit statement, the manual says, "An exit statement that is not part of a BEGIN or END rule stops the execution of any further automatic rules for the current record, skips reading any remaining input records, and executes the END rule if there is one." I therefore believe the comment "skip to END block" to be accurate, as no exit statement occurs within the BEGIN block. \$\endgroup\$ Commented Jul 9 at 15:22
  • \$\begingroup\$ Thank you so much for the comprehensive review! I will also take a page out of your book regarding your style of writing. I will try to get the customer that supplies these records to adhere to ISO-8601 and will incorporate the rest of your suggestions into my script. Rewriting in Python is very much an option. \$\endgroup\$ Commented Jul 9 at 15:25
2
\$\begingroup\$

Ah good old awk. I believe this program is 50 years old!

Your script is a good illustration of what can be achieved with this tool. But as hinted already, it may not be the best tool for the job. If I had to choose, it would probably be Python (probably already installed on your system) and optionally Pandas.

I don't think that sysadmins are really required to be familiar with awk nowadays. Or even Perl. I confess I've done a bit of awk today (it's been years) but there are so many things to learn that I would not ask new hires to spend too much time on "ancient" tech.

awk is a good tool for line by line processing, some tallying etc, but if you need to perform stuff like aggregation then the limits quickly become apparent as you have to resort to hacky code.

If I could give a word of advice, don't reuse references like $2, $3 in your code, instead assign them to variables as soon as possible. This will make the code easier to read, therefore less bug-prone and also more flexible. The file format could change over time, and it's easier to change the mapping of fields at just one place.

You are already writing user-defined functions in awk, so you could take this further, for example by moving the validation routines out of the main block. At the moment they are very short but they could get more complex later.

I did some awk today because I have to parse third-party files and I have no control over them. I don't have the full picture, but I am assuming maybe wrongly that the source file is generated from your own system. Then maybe the data should be in a database instead, and you could perform plain SQL queries for example.

Now let's say that the data come from an active directory extract or something similar, then Powershell would be an option. In fact, the job would have been easier in Powershell (and it runs on Linux too). I am just wondering if there isn't a more straightforward way to get the results you want, without all those transformations. But this is more an organizational question than a critic or your code.

\$\endgroup\$
1
  • \$\begingroup\$ I will definitely take your advice on assigning the values of numbered fields to references. The data does not originate on my system; the customer sends it. \$\endgroup\$ Commented Jul 10 at 7:52

Not the answer you're looking for? Browse other questions tagged or ask your own question.