man ls | col -bx | nawk '
{
for (ii=1;ii<=NF;ii++) {
if ( match($ii,/^(-[a-zA-Z0-9]|--[a-zA-Z0-9-]+)/) )
opt[substr($ii,RSTART,RLENGTH)]++
}
}
END { for (oo in opt) printf("%s\n",oo) } '
This should work in any "new" awk (nawk
, mawk
, gawk
).
Things that were changed:
- loop variable name, and off-by-one error
- wrong order in
match()
arguments, as noted by Ed Morton
- use
/.../
for literal regex, don't escape +
, and remove incorrect use of \<
(it won't match because -
is not in a "word", only letters, digits, underscores)¹
- pipe though
col -b
to remove backspacing/overstriking
- save all the observed options in an array to suppress duplicates on output
The escaping error observed arises from the wrong ordering of match()
arguments, "<" in a literal string does not need to, and should not, be escaped. Only \<
in a regex (with proper /.../
delimiters) has special meaning. If the regex is a literal "string" or in a variable, then you use "\\<"
in the literal string in order to represent \<
in the regex.
The bash-completion package has an feature which does something similar to your aim, its _longopt
function invokes a command with --help
in order to generate completion on the fly, ultimately using something like:
compgen -W "$( LC_ALL=C $COMMAND --help 2>&1 | \
sed -ne 's/.*\(--[-A-Za-z0-9]\{1,\}=\{0,1\}\).*/\1/p' | sort -u )"
You might also find the implementation (perl) of help2man
instructive, this processes the output of "command --help
" or equivalent to produce a minimal man page.
¹ the \<
and \>
zero-width assertions are non-POSIX and uncommon now, in PCRE \b(?<=\W)
and \b(?=\W)
are used. Support in gawk
is a GNU-ism, though not documented as such. Solaris ERE matching also supports them, though its awk
does not — there they can also match the beginning or end of a string, so work as intended (i.e. with /usr/xpg4/bin/grep -E
).
They don't match start/end of string in GNU awk, but /-\<[0-9a-zA-Z_-]+\>/
would work, having changed \<-
to -\<
to match following word characters.
man
sources andmandoc
sources into account (at least) to be generally useful. When you parse generated manuals (as in the question), you may want to take into account that the resulting text may use control sequences for doing highlighting and bold text etc.man …| awk 'BEGIN{myRegex="\\<-[0-9a-zA-Z_-]+\\>";} {for(...) if(match($i, myRegex)) do_somehintg}'
"..."
instead of a static regexp/.../
and so need to double the escapes to account for 1 being consumed by the process of converting the string to a regexp. Also\<
is gawk-only - are you using gawk?