1

Had a problem with a tool that was generating illegal JSON.

Some of the JSON strings contained characters in the range 00-1f. So I wanted to convert these characters to correctly escaped valued \u00xx within the string.

The best I have managed to do is:

cat test2.json | jq -aR . | sed -e 's/\\"/"/g' -e 's/^"\(.*\)"$/\1/' | jq

Explanation:

jq -aR                    reads the data as raw input and converts
                          the whole thing into a single string.
                          This converts all control characters into
                          the correct form => \u00xx

sed -e 's/^"\(.*\)"$/\1/' Removes the quotes from the beginning and end.

sed -e 's/\\"/"/g'        Looks for escaped quotes and removed the quotes.

jq                        Just makes it pretty again at the end.
                          Also makes sure it is valid JSON.

A couple of issues that I have spotted (but luckily don't affect me yet).

  1. embedded '\n' in the string are not handled correctly.
  2. Any escaped characters are now probably double escaped.
  3. Probably other things I have not though about.

Some test data can be generated with:

echo -e "{ \"data\": \"XX\001YY\"}" > test2.json

Then I have tested with:

cat test2.json | jq -aR . | sed -e 's/\\"/"/g' -e 's/^"\(.*\)"$/\1/' | jq

Generates:

{
  "data": "XX\u0001YY"
}

Just noticed that this does not handle newline => `\n' => '\x0a' correctly when it is inside a string.

6
  • Any chance of seeing the actual data?
    – Kusalananda
    Commented Mar 25, 2019 at 18:28
  • @Kusalananda The actual data is huge. But I have been testing with files I generated by hand. BUT it would be pointless to add those to a web page as the characters are not printable and thus cutting and pasting them does not work. Commented Mar 25, 2019 at 19:15
  • Have also just noticed that newline inside the string is not handled correctly. Commented Mar 25, 2019 at 19:16
  • Can you confirm whether there are raw newline characters both inside and outside quotes in your file? Commented Mar 25, 2019 at 20:48
  • @MichaelHomer There does not seem to be newline characters inside any of the data that has been produced so far (as this breaks the script above and generate an error). BUT as this is a bug in the output of the system I can see it as a possibility (The character '\a' or 0x0a is invalid in JSON strings and is in the range 00-1f that could potentially be produced). Commented Mar 25, 2019 at 21:40

1 Answer 1

2

You can use perl to replace all the C0 controls with the hex escapes:

perl -pe 's/([\x01-\x1f])/sprintf("\\u%04x", ord($1))/eg' < test.json

This

  1. Runs the program in a loop, printing out the result at the end sed-style (perl -pe)
  2. Matches each byte in the range 01-1f (s/([\x01-\x1f])/...g)
  3. Computes the ordinal value of the byte (ord($1))
  4. Replaces the matched byte with the result of sprintf("\\u%04x", ord($1)) (/e)

That will insert \u0001, \u0002, ..., \u001f in place of the matched bytes.

It will escape all newlines in the same way, so if the file has unquoted line breaks it will break (notably, a text file will have at least a terminating newline character, but that can be removed mechanically either before or afterwards). In that case, [\x01-\x09\x0b-\x1f] will skip it, but fail if there are true line breaks inside quotes.

If your file has both quoted and unquoted line breaks, this sort of contextless replacement can't work. You will need a liberal JSON parser that accepts the file as-is in order to know which need escaping and which don't. I'm not sure of one off-hand.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .