Convert JSON with raw characters in string

Question

Had a problem with a tool that was generating illegal JSON.

Some of the JSON strings contained characters in the range 00-1f. So I wanted to convert these characters to correctly escaped valued \u00xx within the string.

The best I have managed to do is:

cat test2.json | jq -aR . | sed -e 's/\\"/"/g' -e 's/^"\(.*\)"$/\1/' | jq

Explanation:

jq -aR                    reads the data as raw input and converts
                          the whole thing into a single string.
                          This converts all control characters into
                          the correct form => \u00xx

sed -e 's/^"\(.*\)"$/\1/' Removes the quotes from the beginning and end.

sed -e 's/\\"/"/g'        Looks for escaped quotes and removed the quotes.

jq                        Just makes it pretty again at the end.
                          Also makes sure it is valid JSON.

A couple of issues that I have spotted (but luckily don't affect me yet).

embedded '\n' in the string are not handled correctly.
Any escaped characters are now probably double escaped.
Probably other things I have not though about.

Some test data can be generated with:

echo -e "{ \"data\": \"XX\001YY\"}" > test2.json

Then I have tested with:

cat test2.json | jq -aR . | sed -e 's/\\"/"/g' -e 's/^"\(.*\)"$/\1/' | jq

Generates:

{
  "data": "XX\u0001YY"
}

Just noticed that this does not handle newline => `\n' => '\x0a' correctly when it is inside a string.

@Kusalananda The actual data is huge. But I have been testing with files I generated by hand. BUT it would be pointless to add those to a web page as the characters are not printable and thus cutting and pasting them does not work. — Loki Astari, Commented Mar 25, 2019 at 19:15
Have also just noticed that newline inside the string is not handled correctly. — Loki Astari, Commented Mar 25, 2019 at 19:16
Can you confirm whether there are raw newline characters both inside and outside quotes in your file? — Michael Homer, Commented Mar 25, 2019 at 20:48
@MichaelHomer There does not seem to be newline characters inside any of the data that has been produced so far (as this breaks the script above and generate an error). BUT as this is a bug in the output of the system I can see it as a possibility (The character '\a' or 0x0a is invalid in JSON strings and is in the range 00-1f that could potentially be produced). — Loki Astari, Commented Mar 25, 2019 at 21:40

Michael Homer · Accepted Answer · 2019-03-25 22:49:38Z

You can use perl to replace all the C0 controls with the hex escapes:

perl -pe 's/([\x01-\x1f])/sprintf("\\u%04x", ord($1))/eg' < test.json

This

Runs the program in a loop, printing out the result at the end sed-style (perl -pe)
Matches each byte in the range 01-1f (s/([\x01-\x1f])/...g)
Computes the ordinal value of the byte (ord($1))
Replaces the matched byte with the result of sprintf("\\u%04x", ord($1)) (/e)

That will insert \u0001, \u0002, ..., \u001f in place of the matched bytes.

It will escape all newlines in the same way, so if the file has unquoted line breaks it will break (notably, a text file will have at least a terminating newline character, but that can be removed mechanically either before or afterwards). In that case, [\x01-\x09\x0b-\x1f] will skip it, but fail if there are true line breaks inside quotes.

If your file has both quoted and unquoted line breaks, this sort of contextless replacement can't work. You will need a liberal JSON parser that accepts the file as-is in order to know which need escaping and which don't. I'm not sure of one off-hand.

Stack Exchange Network

Convert JSON with raw characters in string

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged
sed
quoting
escape-characters
json
.

Hot Network Questions

Convert JSON with raw characters in string

1 Answer 1

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged sedquotingescape-charactersjson.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
sed
quoting
escape-characters
json
.