How can I turn ugly output into pretty, useful information?

Question

How can I turn this ugly output into pretty, useful data?

The output:

/* ---------- TA#box#AbC_p ---------- */

insert_job: TA#box#AbC_p    job_type: a
#owner: bob
permission: gx
date_conditions: 1
days_of_week: su
start_times: "16:15"
run_window: "16:15-17:30"
description: "Job AbC that runs at 4:15PM on Sundays, and should end before 5:30PM"

    /* ---------- TA#cmd#EfGJob_p ---------- */

    insert_job: TA#cmd#EfGJob_p    job_type: b
    box_name: TA#box#AbC_p
    command: /path/to/shell/script.sh
    machine: vm_machine1
    #owner: alex
    permission: gx
    date_conditions: 2
    run_window: "16:20-16:30"
    description: "job EfG that runs within box AbC"
    term_run_time: 60
    std_out: /path/to/log.log
    std_err: /path/to/err.log
    alarm_if_fail: 1
    profile: /path/to/profile

and so on, for a long time. #cmd# jobs are sometimes under a #box#. If they are under a #box#, the #cmd# section is indented.

My ideal output would be something like:

"Job Name", "Time", "Schedule", "Machine", "Description", "Command"
"TA#box#AbC_p", "16:15", "su", "", "Job AbC that runs at 4:15PM on Sundays, and should end before 5:30PM", ""
"TA#cmd#EfGJob_p", "16:15", "su", "vm_machine1", "job EfG that runs within box AbC", "/path/to/shell/script.sh"

I'm trying awk, perl and grep, but I'm having trouble keeping all the info for one "section" together before I print the CSV line.

I got the ugly output to be: title: TA#box#AbC_p(\n)insert_job: TA#box#AbC_p(\n) job_type: a(\n)#owner: bob the (\n) is not actually printed, but my comments don't seem to show up properly here.. So if this output is easier to handle, let's work with this one. — Joe A, Commented May 19, 2012 at 8:21
If #cmd# is not under a box, where does it get the properties that it inherits from #box#? Do they simply appear in the cmd? For instance, will cmd have a days_of_week? — Kaz, Commented Oct 27, 2013 at 8:09

Joe A · Accepted Answer · 2012-05-19 12:11:07Z

A little terrible sed oneliner:

sed -n  \
# we divide out incoming text to small parts, 
# each one as you mentioned from /---.*box.*/ to /profile/
'/---.*box.*/,/profile/{
     # inside of each part we do following things:
     # if string matches our pattern we extract 
     # the value and give it some identifier (which you
     # can see is "ij", "st" and so on)
     # and we copy that value with identifier to hold buffer,
     # but we don't replace the content of hold buffer
     # we just append (capital H) new var to it
     /insert_job/{s/[^:]*: /ij"/;s/ .*/",/;H};
     /start_times/{s/[^:]*: /st/;s/$/,/;H};
     /days_of_week/{s/[^:]*: /dw"/;s/$/",/;H};
     /machine/{s/[^:]*: /ma"/;s/$/",/;H};
     /description/{s/[^:]*: /de/;s/$/,/;H};
     /command/{s/[^:]*: /co"/;s/$/",/;H};
     # when line matches next pattern (profile)
     # we think that it is the end of our part,
     # therefore we delete the whole line (s/.*//;)
     # and exchange the pattern and hold buffers (x;)
     # so now in pattern buffer we have several strings with all needed variables
     # but all of them are in pattern space, therefore we can remove
     # all newlines symbols (s/\n//g;). so it is just one string 
     # with a list of variables
     # and we just need to move to the order we want,
     # so in this section we do it with several s commands.
     # after that we print the result (p)
     /profile/{s/.*//;x;s/\n//g;s/ij\("[^"]*box[^"]*",\)/\1/;
          s/,\(.*\)st\("[^"]*",\)\(.*ij"[^"]*",\)/,\2\1\3\2/;
          s/\([^,]*,[^,]*,\)\(.*\)dw\("[^"]*",\)\(.*ij"[^"]*",[^,]*,\)/\1\3\2\4\3/;
          s/de/"",/;s/ij/""\n/;
          s/\(\n[^,]*,[^,]*,[^,]*,\)\(.*\)ma\("[^"]*",\)/\1\3\2/;
          s/co\("[^"]*"\),\(.*\)/\2\1/;s/de//;p}
     };
     # the last command just adds table caption and nothing more.
     # note: if you want to add some new commands,
     # add them before this one
     1i"Job Name", "Time", "Schedule", "Machine", "Description", "Command"'

I wrote it as field order may vary in different boxes but profile is always last one. In case the order is always the same it would be a little bit easier.

Wow. Let's see if I can explain part of this. sed -n '/---.*box.*/,/profile/{ quiet sed. do work from "/-------- box ------/" to "profile". /insert_job/{s/[^:]*: /ij"/;s/ .*/",/;H}; on the line that contains "insert_job", replace all non-colon characters up to the first ": " with 'ij"', and replace anything after a space with '",' and hold the new line (like, in a variable?). — Joe A, Commented May 19, 2012 at 7:13

Gilles 'SO- stop being evil' · Accepted Answer · 2012-05-21 07:15:33Z

1

I'd use Perl for that, or at least awk.

perl -ne '
    BEGIN {
        print "\"Job Name\", \"Time\", \"Schedule\", \"Machine\", \"Description\", \"Command\", \"\n";
    }
    chomp; s/^\s+//; s/\s+$//;
    if (($_ eq "" || eof) && exists $fields{insert_job}) {
        print "\"", join("\", \"", @fields{qw(insert_job start_times days_of_week machine description command)}), "\"\n";
        delete @fields{qw(insert_job)};
    }
    if (/^([^ :]+): *(.*)/) {$fields{$1} = $2}
'

Explanations:

The BEGIN block is run once at the beginning of the script, the rest runs for every input line.
The line that begins with chomp strips off leading and trailing whitespace.
The first if line triggers on empty lines (paragraph separators), if the field insert_job is present.
The delete line removes the insert_job field. Add other field names that you don't want to spill over from one paragraph to the next.
The last if line stores fields.

edited May 21, 2012 at 7:15

answered May 19, 2012 at 13:11

Gilles 'SO- stop being evil'

839k198 gold badges1.8k silver badges2.2k bronze badges

hmm, I'm not perl guru, but there is an error, when I'm tryin to lunch that code: Backslash found where operator expected at -e line 7, near ""\", \", @fields{qw(insert_job start_times days_of_week machine description command)}), "\" . ps perl v5.14.2
– rush
Commented May 19, 2012 at 13:52
@Rush Thanks, I'd missed a quote. I minimally tested the script this time, and fixed another couple of typos.
– Gilles 'SO- stop being evil'
Commented May 19, 2012 at 13:58
I don't get any output when I runs this except the header row. I pasted your script into "parser.pl", replaced perl -ne with #!perl -n, and removed the single quotes. Can anyone confirm that they do get the expected output?
– Joe A
Commented May 21, 2012 at 3:40
@JoeA Make that #!/usr/bin/perl -n, and change the $fields[$1] = $2 to $fields{$1} = $2 (I forgot to fix this typo somehow, sorry).
– Gilles 'SO- stop being evil'
Commented May 21, 2012 at 7:16

Add a comment |

Kaz · Accepted Answer · 2013-10-27 09:13:12Z

Using the TXR language:

@(bind inherit-time nil)
@(bind inherit-sched nil)
@(collect)
@  (all)
@indent/* ---------- @jobname ---------- */
@  (and)
@/ *//* ---------- @nil#@type#@nil ---------- */
@  (end)

@  (bind is-indented @(> (length indent) 0))
@  (gather :vars ((time "") (sched "") (mach "") (descr "") (cmd "")))
@/ */start_times: "@*time"
@/ */days_of_week: @sched
@/ */machine: @mach
@/ */description: "@*descr"
@/ */command: @cmd
@  (until)

@  (end)
@  (cases)
@    (bind type "box")
@    (set (inherit-time inherit-sched) (time sched))
@  (or)
@    (bind type "cmd")
@    (bind is-indented t)
@    (set (time sched) (inherit-time inherit-sched))
@  (end)
@(end)
@(output)
"Job Name", "Time", "Schedule", "Machine", "Description", "Command"
@  (repeat)
"@jobname", "@time", "@sched", "@mach", "@descr", "@cmd"
@  (end)
@(end)

This is a very naive approach. From each record, we extract all fields we are interested in, substituting blanks for ones which are not present (the default values in the :vars argument of @(gather)). We pay attention to the job type (box or cmd), and indentation. When we see a box, we copy a few box properties into global variables; and when we see a cmd which is indented, it copies these properties. (We assume blindly that they have been set up by an earlier box.)

Run:

$ txr jobs.txr jobs
"Job Name", "Time", "Schedule", "Machine", "Description", "Command"
"TA#box#AbC_p", "16:15", "su", "", "Job AbC that runs at 4:15PM on Sundays, and should end before 5:30PM", ""
"TA#cmd#EfGJob_p", "16:15", "su", "vm_machine1", "job EfG that runs within box AbC", "/path/to/shell/script.sh"

Note that the output is comma-separated quoted fields, but nothing is done with regard to the possibility that the data ontains quotes. If quotes are somehow escaped in description:, then that will be preserved, of course. The @*descr notation is a greedy match, and so description: "a b"c\"d" will result in descr taking on the characters a b"c\"d which will be reproduced verbatim in the output.

The nice thing about this solution is that if we don't have an example of the data, we can guess most of it from the structure of the code, since it expresses an orderly pattern match through the file. We can see that there are sections being collected which begin with a /* --- ... --- */ line, in which a job-name is embedded, and that there is a type field between two hash marks in the middle of the job name. Then an obligatory blank line follows after which properties are gathered until another blank line and so on.

Stack Exchange Network

How can I turn ugly output into pretty, useful information?

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
bash
awk
perl
csv
text
.

Hot Network Questions

How can I turn ugly output into pretty, useful information?

3 Answers 3

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged bashawkperlcsvtext.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
bash
awk
perl
csv
text
.