2

I am trying to auto-generate tab-completions for different commands. I am piping man pages for different commands into awk, then searching for command line options (eg. -shortopt --long-option) printing each on a separate line:

for (i=1;i<NF;i++){
    if(match("\<-[0-9a-zA-Z_-]\+\>", $i)){
        print $i
    }
}

For some reason this refuses to work. With single backslashes, awk warns about ignoring the escape sequence \<, then treats it as a literal \\<. If I double the backslashed, awk then refuses to actually match against the appropriate pattern (which works if I open the man page in vim, then run that pattern, so I think the pattern should be correct). The above code snippet lives in a file, then I invoke it with man {section} {page} | awk -f find-options.awk, so I think I can rule out issues with the command string being parsed multiple times by both bash and awk (as in awk FS with back slashes) as awk should be reading the script file directly.

4
  • 3
    You might get better results searching the *roff source markup Commented Sep 29, 2019 at 15:15
  • Yes, as roaima says, it would be easier to parse the roff sources than to parse the generated manual. But there are a few competing roff macro sets for typesetting manuals, so a solution that parses roff sources would have to take Linux man sources and mandoc sources into account (at least) to be generally useful. When you parse generated manuals (as in the question), you may want to take into account that the resulting text may use control sequences for doing highlighting and bold text etc.
    – Kusalananda
    Commented Sep 29, 2019 at 15:20
  • 1
    I didn't understand what you are looking for as a result but as a Hint in order to use a word boundaries ability in awk, you could use a variable to define your regex then use that variable and find a match against of that like man …| awk 'BEGIN{myRegex="\\<-[0-9a-zA-Z_-]+\\>";} {for(...) if(match($i, myRegex)) do_somehintg}' Commented Sep 29, 2019 at 16:49
  • 1
    You have the match() arguments in the wrong order, assuming that first string is supposed to be your regexp. If so then you're also using a dynamic regexp "..." instead of a static regexp /.../ and so need to double the escapes to account for 1 being consumed by the process of converting the string to a regexp. Also \< is gawk-only - are you using gawk?
    – Ed Morton
    Commented Sep 29, 2019 at 22:52

1 Answer 1

0
man ls | col -bx | nawk '
{
    for (ii=1;ii<=NF;ii++) {
        if ( match($ii,/^(-[a-zA-Z0-9]|--[a-zA-Z0-9-]+)/) )
            opt[substr($ii,RSTART,RLENGTH)]++
     }  
} 
END { for (oo in opt) printf("%s\n",oo)  } '

This should work in any "new" awk (nawk, mawk, gawk).

Things that were changed:

  1. loop variable name, and off-by-one error
  2. wrong order in match() arguments, as noted by Ed Morton
  3. use /.../ for literal regex, don't escape +, and remove incorrect use of \< (it won't match because - is not in a "word", only letters, digits, underscores)¹
  4. pipe though col -b to remove backspacing/overstriking
  5. save all the observed options in an array to suppress duplicates on output

The escaping error observed arises from the wrong ordering of match() arguments, "<" in a literal string does not need to, and should not, be escaped. Only \< in a regex (with proper /.../ delimiters) has special meaning. If the regex is a literal "string" or in a variable, then you use "\\<" in the literal string in order to represent \< in the regex.

The bash-completion package has an feature which does something similar to your aim, its _longopt function invokes a command with --help in order to generate completion on the fly, ultimately using something like:

compgen -W "$( LC_ALL=C $COMMAND --help 2>&1 | \
  sed -ne 's/.*\(--[-A-Za-z0-9]\{1,\}=\{0,1\}\).*/\1/p' | sort -u )"

You might also find the implementation (perl) of help2man instructive, this processes the output of "command --help" or equivalent to produce a minimal man page.


¹ the \< and \> zero-width assertions are non-POSIX and uncommon now, in PCRE \b(?<=\W) and \b(?=\W) are used. Support in gawk is a GNU-ism, though not documented as such. Solaris ERE matching also supports them, though its awk does not — there they can also match the beginning or end of a string, so work as intended (i.e. with /usr/xpg4/bin/grep -E).

They don't match start/end of string in GNU awk, but /-\<[0-9a-zA-Z_-]+\>/ would work, having changed \<- to -\< to match following word characters.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .