6

The methods I found break things further down the line by also affecting linebreaks.
For example...

$ message="First Line\nSecond Line"; 
$ echo "${message^^}"
FIRST LINE\NSECOND LINE

Is there an elegant way to convert a string to uppercase, but leaving escaped characters alone, to get the following output instead?

FIRST LINE\nSECOND LINE

I could just do something convoluted like changing "\n" to 0001 or something along those lines, apply the conversion and then return 0001 to "\n". But maybe there is a better way.

2
  • Is this for later inclusion as part of some other data, possibly in XML or JSON format? If so, a parser of that format may possibly have routines for turning strings into uppercase in the way you describe, as, for example, ascii_upcase in tho JSON parser jq, or the XPath function upper-case() for XML.
    – Kusalananda
    Commented Jul 24, 2022 at 11:50
  • @Kusalananda For me this is only about text processing, but someone else stumbling across this question might have such a use case.
    – Ocean
    Commented Jul 25, 2022 at 9:52

6 Answers 6

6

With zsh instead of bash:

$ message="First Line\nSecond Line"
$ set -o extendedglob
$ print -r -- ${message//(#b)((\\?)|(?))/$match[2]$match[3]:u}
FIRST LINE\nSECOND LINE

In bash (or any shell) and with the GNU implementation of sed, you can do the same with:

$ printf '%s\n' "$message" | sed -E 's/(\\.)|(.)/\1\u\2/g'
FIRST LINE\nSECOND LINE

Some potentially more efficient variants as they minimise the number of substitutions:

  • zsh

    print -r -- ${message//(#b)((\\?)|([^\\]##))/$match[2]$match[3]:u}
    

    or

    print -r -- ${message//(#b)((\\?)#)([^\\]##)/$match[1]$match[3]:u}
    
  • their GNU sed translations:

    printf '%s\n' "$message" | sed -E 's/(\\.)|([^\\]+)/\1\U\2/g'
    

    or

    printf '%s\n' "$message" | sed -E 's/((\\.)*)([^\\]+)/\1\U\3/g'
    

Beware they convert \Mx (Meta-x, an escape sequence supported by zsh's print for instance and that expands to the 0xf8 byte ('x' + 0x80)) to \MX (0xd8). They also convert \x7a to \x7A or \u007a to \u007A or \Cx to \CX but that shouldn't be a problem as those expand to the same.

3

I'd be tempted to interpret the escape sequences into literal characters:

message="First Line\nSecond Line"
declare -u Message                       # uppercase on assignment
printf -v Message -- "${message//%/%%}"  # assign
declare -p Message                       # inspect

result

declare -u msg="FIRST LINE
SECOND LINE"
4
  • 3
    Beware that with message='\141' for instance, you'd get declare -u Message="A" instead of declare -u Message="a" Commented Apr 25, 2022 at 19:07
  • Note that any \ will ve doubled \\.
    – user232326
    Commented Apr 25, 2022 at 22:45
  • 1
    Not giving printf a format causes the change of % that you want to avoid by duplicating every %. However, a printf -v Message '%b' -- "${message}" will interpret back-slashed characters exactly as echo -e without changing the %s.
    – user232326
    Commented Apr 25, 2022 at 22:57
  • Please read: unix.stackexchange.com/q/700508/232326
    – user232326
    Commented Apr 27, 2022 at 19:26
1
echo "$message"  |  sed -e 's/^[[:lower:]]/\u&/' -e 's/\([^\]\)\([[:lower:]]\)/\1\u\2/g' \
                                                 -e 's/\([^\]\)\([[:lower:]]\)/\1\u\2/g'
  • -e 's/^[[:lower:]]/\u&/'  If the first character in the string (or, more generally, the first character on a line) is a lower-case letter, capitalize it.  Because the first character on a line can’t be escaped.  Duh.  That’s a no-brainer.

  • -e 's/\([^\]\)\([[:lower:]]\)/\1\u\2/g'  Look at the line two characters at a time.  If a lower-case letter is preceded by something other than a backslash, leave the preceding character alone, and capitalize the lower-case letter.

    You might think that this would be enough to process the entire line.  Unfortunately, since it processes the line two characters at a time, it gets only every other letter:

    $ echo "first line\nsecond line" | sed -e 's/\([^\]\)\([[:lower:]]\)/\1\u\2/g'
    fIrSt LiNe\nSeCoNd LiNe
    

    so,

  • -e 's/\([^\]\)\([[:lower:]]\)/\1\u\2/g'  Do the exact same thing a second time.  This will pick up the letters that were skipped on the first pass.


Alternative version:

echo "$message" | sed -e 's/^[[:lower:]]/\u&/' \
                                  -e ': loop; s/\([^\]\)\([[:lower:]]\)/\1\u\2/g; t loop'

Basically the same as the first version, but, instead of repeating the second s command, it iterates it with a loop.


Unfortunately, this will not work correctly for double backslashes:  foo\\bar will become FOO\\bAR, even though the b should be capitalized, since the \\ is an escaped backslash, and so should not cause the b to be escaped.

3
  • No, the first character could be escaped, like when you want to insert a tab at the beginning, which would be "\t".
    – Ocean
    Commented May 9, 2022 at 15:19
  • One of us is not understanding the other.  If the line begins with \t, then the first character is \.  t is the second character.  If I’m misunderstanding you, please explain more clearly. Commented May 9, 2022 at 22:36
  • Semantics. If a line begins with "\t", then the first character is an escaped "t". But one can also say that "\" is the first character. Depends on how you look at it, I guess. It could also be an escaped "\" by having "\\t", so one gets "\t" instead of the tab character. Since these constructs are supposed to represent a single character (\t is tab), I treat them as single entities, which was the origin of the misunderstanding.
    – Ocean
    Commented May 10, 2022 at 12:08
1

I'd consider evaluating the \n and other escape sequences at the point that the variable was defined. Here $message actually contains a newline.

message=$(printf '%b' 'First Line\nSecond Line')
echo "${message^^}"

Output

FIRST LINE
SECOND LINE
0

The variable can be iterated line by line. Then concatenate the output again.

bash:

$ message="First Line\nSecond Line";
$ message=$(echo -e ${message} |while read -r line; do echo -n "${line^^}\n" ; done) && message=${message%??}
$ echo ${message} 
FIRST LINE\nSECOND LINE
5
  • 1
    That will likely leave linefeeds alone, but the OP asked for all escaped characters to be left alone. Commented Apr 26, 2022 at 8:04
  • 1
    Backslash processing should be removed from the while read loop for sure. Just edited the answer.
    – Kadir
    Commented Apr 26, 2022 at 8:35
  • (1) For starters, ${message} should be "$message".  See ${variable_name} doesn’t mean what you think it does ….  (2) You should explain your answer better — in particular (IMO) the %?? part. (You don’t need to explain it to me; I figured it out.) … … … … … … … … … … … … … … … Please do not respond in comments; edit your answer to make it clearer and more complete. … (Cont’d) Commented May 7, 2022 at 19:04
  • (Cont’d) …  (3) This is a classic example of providing a solution for the example while ignoring the larger question.  foo\012bar will turn into FOO\nBAR, \g\h\i\j\k\l\m\n\o\p\q will turn into \G\H\I\J\K\L\M\n\O\P\Q, and any of \a, \b, \c, \e, \f, \r, \t, \v, and \\ will cause problems.  Also, leading and trailing spaces, and multiple spaces. (4) Strictly speaking, the question didn’t say that you should clobber the original variable.  If you need a multi-step process, you should assign the intermediate value to a temp variable. Commented May 7, 2022 at 19:04
0

Using Raku (formerly known as Perl_6)

~$ echo 'a\nb'
a\nb
~$ echo 'a\nb' | raku -pe 's:g/ <!after "\\"> (.) /{$0.uc}/;'
A\nB
~$ echo "a\\nb"
a\nb
~$ echo "a\\nb" | raku -pe 's:g/ <!after "\\"> (.) /{$0.uc}/;'
A\nB

Above uses a negative look-behind assertion, <!after "\\">, to select out all characters except those immediately after a \ backslash. Selected characters are then uppercased with Raku's .uc routine.

Certainly it's safer to provide the regex with a custom <-[ … ]> negative character class, sparing backslashed characters like \n and \t from being uppercased. (FYI, custom positive character classes are written <+[ … ]> or more simply <[ … ]> in Raku).

Below, using Raku's "Q-lang" (quoting language) to feed the substitution operator a string. In all four examples below \n is returned (not uppercase \N). Note in the third example how \n is operationally-interpreted as a newline character, and this remains unchanged in the fourth example, telling us that \n still exists in that string (i.e. it has NOT been uppercased to \N):

~$ raku -e 'put Q<a\nb>'
a\nb
~$ raku -e 'put Q<a\nb>' | raku -pe 's:g/ <!after "\\"> (<-[nt]>) /{$0.uc}/;'
A\nB
~$ raku -e 'put Q:b<a\nb>'
a
b
~$ raku -e 'put Q:b<a\nb>' | raku -pe 's:g/ <!after "\\"> (<-[nt]>) /{$0.uc}/;'
A
B

NOTE, see: "Place an escape sign before every non-alphanumeric characters" for Raku answers to a related question on StackOverflow.

References:
https://docs.raku.org/language/quoting
https://docs.raku.org/language/regexes#Literals_and_metacharacters
https://raku.org

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .