0

My system:

  • Ubuntu 22.04.3 LTS
  • GNU bash, version 5.1.16(1)-release (x86_64-pc-linux-gnu)

man ls describes -b as follows:

   -b, --escape
          print C-style escapes for nongraphic characters

The Wikipedia page for "control character" states:

a control character or non-printing character (NPC) is a code point in a character set that does not represent a written character or symbol. All other characters are mainly graphic characters, also known as printing characters (or printable characters), except perhaps for "space" characters.

This is ambiguous.

What authoritative resource explains what nongraphic characters are, and how this term may differ from non-printing characters?

2
  • look at the first 32 characters at ascii-code.com
    – jsotola
    Commented Mar 19 at 0:06
  • The Awk documentation is usually both readable and reliable: see https://www.gnu.org/software/gawk/manual/gawk.html#Bracket-Expressions. However, one significant issue is that the definitions may depend on the current Locale settings. Multi-byte characters (e.g. UTF-8) are a whole new game. Commented Mar 19 at 0:47

2 Answers 2

2

The graphic characters would be the one for which the isgraph()/iswgraph() standard functions return true or the ones matched by the [[:graph:]] regular expressions, that is the ones in the graph character class in the locale.

Per POSIX, the print class must be a superset of graph and be disjunct from cntrl and graph must be a superset of upper, lower, alpha, digit, xdigit, and punct and must not include the space (U+0020) character (with no mention of other whitespace characters).

The idea being that the graphic characters would be the ones for which ink would be used to draw them, while printable would be the non-control ones.

In practice, on GNU systems (such as Ubuntu) at least print is graph plus the non-control characters from the space class. Here with glibc 2.35 (as used on Ubuntu 22.04) and in UTF-8 locales, that includes:

U+0020 SPACE
U+1680 OGHAM SPACE MARK
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE

While the space class has:

U+0009 CHARACTER TABULATION
U+000A LINE FEED
U+000B LINE TABULATION
U+000C FORM FEED
U+000D CARRIAGE RETURN
U+0020 SPACE
U+1680 OGHAM SPACE MARK
U+2000 EN QUAD
U+2001 EM QUAD
U+2002 EN SPACE
U+2003 EM SPACE
U+2004 THREE-PER-EM SPACE
U+2005 FOUR-PER-EM SPACE
U+2006 SIX-PER-EM SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+200A HAIR SPACE
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE
0

This Bash script tabulates the character classes associated with each character in the ASCII set (according to GNU/awk definitions).

#! /bin/bash --

Awk='
BEGIN {
    Ctl1 = "SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI ";
    Ctl2 = "DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US SPACE";
    split (Ctl1 Ctl2, Ctl); Ctl[0] = "NUL"; Ctl[127] = "DEL";
    C = "cntrl print graph space blank punct alnum alpha digit lower upper xdigit";
    split (C, Class); 
}
function Char (n, ch, Local, j) {

    printf ("0x%.2X  %5s", n, (n <= 32 || n == 127) ? Ctl[n] : ch);
    for (j = 1; j in Class; ++j) 
        if (ch ~ "[[:" Class[j] ":]]") printf ("  :%s:", Class[j]);
    printf ("\n");
}
{ for (j = 0; j < 128; j++) Char( j, sprintf ("%c", j)); }
'
    echo | awk -f <( printf '%s' "${Awk}" ) 
    
2
  • ASCII represents a tiny fraction of all possible characters. See Command to retrieve the list of characters in a given character class in the current locale to extend it outside of ASCII. Commented Mar 19 at 8:18
  • @StéphaneChazelas I read (scanned anyway) that, right down to "See 24 more comments". Suddenly, I am nostalgic for my ICL 1901A 6-bit character set - no direct lowercase. We could deal with ASCII peripherals (paper-tape, printers, OCR readers, VDUs): Alpha and Beta shift codes for upper and lower case, Delta for control characters. Commented Mar 19 at 10:06

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .