0

I'm wondering: can gawk printf in any format besides ASCII?

Currently, I'm using gawk match() to search through some UTF-8 text. When I go ahead and print out the matches gawk finds, it ends up like this:

Chapter 9\n\xE2\x80\x9COh, I Get By with a Little Help from My Friends\xE2\x80\x9D: Short-Term Writing Center/Community Collaborations

When I really want it to look like this:

Chapter 9 “Oh, I Get By with a Little Help from My Friends”: Short-Term Writing Center/Community Collaborations

my code:

gawk '
    match($0, /^[\|\+-].*"([^"]+)".*#([[:digit:]]+)/, m) {
        print m[2]
    }
' file.txt

1 Answer 1

0

Yes, gawk operates in the current locale so it should work in any encodings supported by your system. You can see many references to Unicode and UTF-8 in gawk's manpage

The POSIX standard requires that awk function in terms of characters, not bytes. Thus in gawk, length(), substr(), split(), match() and the other string functions (see section String-Manipulation Functions) all work in terms of characters in the local character set, and not in terms of bytes. (Not all awk implementations do so, though).

The GNU Awk User’s Guide

If it doesn't work on your system then it's because the current locale's encoding isn't UTF-8. You need to set LC_ALL accordingly. Here's a simple example

$ echo 'Chapter 9 “Oh, I Get By with a Little Help from My Friends”:' | \
LC_ALL=en_US.UTF-8 gawk '/with/'
Chapter 9 “Oh, I Get By with a Little Help from My Friends”:

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .