3

I have been working on string encoding schemes and while I examine how UTF-16 works, I have a question. Why using complex surrogate pairs to represent 21 bits code point? Why not to simply store the bits in the first code unit and the remaining bits in the second code unit? Am I missing something! Is there a problem to store the bits directly like we did in UTF-8?

Example of what I am thinking of:

The character '🙃'

Corresponding code point: 128579 (Decimal)
The binary form: 1 1111 0110 0100 0011 (17 bits)
It's 17-bit code point.

  • Based on UTF-8 schemes, it will be represented as:

    240 : 11110 000
    159 : 10 011111
    153 : 10 011001
    131 : 10 000011
    
  • In UTF-16, why not do something looks like that rather than using surrogate pairs:

    49159 : 110 0 0000 0000 0111
    30275 : 01 11 0110 0100 0011
    
15
  • 3
    Which bits do you store in the first unit? Which bits do you store in the second unit? How do you know that what you're dealing with is a 21-bit number, not two 16-bit numbers (code points)? You have to be able to determine unambiguously when you have a surrogate pair. That's what the UTF-16 encoding does. It's also why the maximum Unicode code point is U+10FFFF — you can't encode anything bigger with just two surrogates with the scheme as presently designed. Commented Jan 21, 2020 at 7:52
  • 2
    Also, you are mistaken about UTF-8; it uses a similar scheme to store 21-bit numbers in 8-bit chars. In fact, UTF-8 is more complex than UTF-16, because it uses four different encodings (with four different lengths) for the various codepoints, rather than only two different ones as UTF-16 does.
    – Mr Lister
    Commented Jan 21, 2020 at 8:36
  • Please look at the question again, I have edited it. Commented Jan 21, 2020 at 13:54
  • 1
    There's also a historical dimension: UTF-16 (then called UCS2) was already in use when it was decided that 2 bytes (65536 code points) wasn't enough. At that point, the surrogate pair solution was invented and they reserved two ranges that were still unused. If you think UTF-16 is intricate, the answer might be that it wasn't designed like this from the beginning.
    – lenz
    Commented Jan 21, 2020 at 20:01
  • 2
    @MrLister... wow, this is the first time I've heard the claim that UTF-8 is more complex than UTF-16 :) I guess that depends on your definition of the word "complex"
    – JoelFan
    Commented Jan 21, 2020 at 21:19

1 Answer 1

11
+50

Proposed alternative to UTF-16

I think you're proposing an alternative format using 16-bit code units analogous to the UTF-8 code scheme — let's designate it UTF-EMF-16.

In your UTF-EMF-16 scheme, code points from U+0000 to U+7FFF would be encoded as a single 16-bit unit with the MSB (most significant bit) always zero. Then, you'd reserve 16-bit units with the 2 most significant bits set to 10 as 'continuation units', with 14 bits of payload data. And then you'd encode code points from U+8000 to U+10FFFF (the current maximum Unicode code point) in 16-bit units with the three most significant bits set to 110 and up to 13 bits of payload data. With Unicode as currently defined (U+0000 .. U+10FFFF), you'd never need more than 7 of the 13 bits set.

U+0000 .. U+7FFF   — One 16-bit unit: values 0x0000 .. 0x7FFF
U+8000 .. U+10FFF  — Two 16-bit units:
                     1. First unit  0xC000 .. 0xC043
                     2. Second unit 0x8000 .. 0xBFFF

For your example code point, U+1F683 (binary: 1 1111 0110 0100 0011):

First unit:  1100 0000 0000 0111 = 0xC007
Second unit: 1011 0110 0100 0011 = 0xB643

The second unit differs from your example in reversing the two most significant bits, from 01 in your example to 10 in mine.

Why wasn't such a scheme used in UTF-16

Such a scheme could be made to work. It is unambiguous. It could accommodate many more characters than Unicode currently allows. UTF-8 could be modified to become UTF-EMF-8 so that it could handle the same extended range, with some characters needing 5 bytes instead of the current maximum of 4 bytes. UTF-EMF-8 with 5 bytes would encode up to 26 bits; UTF-EMF-16 could encode 27 bits, but should be limited to 26 bits (roughly 64 million code points, instead of just over 1 million). So, why wasn't it, or something very similar, adopted?

The answer is the very common one – history (plus backwards compatibility).

When Unicode was first defined, it was hoped or believed that a 16-bit code set would be sufficient. The UCS2 encoding was developed using 16-bit values, and many values in the range 0x8000 .. 0xFFFF were given meanings. For example, U+FEFF is the byte order mark.

When the Unicode scheme had to be extended to make Unicode into a bigger code set, there were many defined characters with the 10 and 110 bit patterns in the most significant bits, so backwards compatibility meant that the UTF-EMF-16 scheme outlined above could not be used for UTF-16 without breaking compatibility with UCS2, which would have been a serious problem.

Consequently, the standardizers chose an alternative scheme, where there are high surrogates and low surrogates.

0xD800 .. 0xDBFF   High surrogates (most signicant bits of 21-bit value)
0xDC00 .. 0xDFFF   Low surrogates (less significant bits of 21-bit value)

The low surrogates range provides storage for 10 bits of data — the prefix 1101 11 uses 6 of 16 bits. The high surrogates range also provides storage for 10 bits of data — the prefix 1101 10 also uses 6 of 16 bits. But because the BMP (Basic Multilingual Plane — U+0000 .. U+FFFF) doesn't need to be encoded with two 16-bit units, the UTF-16 encoding subtracts 1 from the high order data, and can therefore be used to encode U+10000 .. U+10FFFF. (Note that although Unicode is a 21-bit encoding, not all 21-bit (unsigned) numbers are valid Unicode code points. Values from 0x110000 .. 0x1FFFFF are 21-bit numbers but are not a part of Unicode.)

From the Unicode FAQ — UTF-8, UTF-16, UTF-32 & BOM:

Q: What’s the algorithm to convert from UTF-16 to character codes?

A: The Unicode Standard used to contain a short algorithm, now there is just a bit distribution table. Here are three short code snippets that translate the information from the bit distribution table into C code that will convert to and from UTF-16.

Using the following type definitions

typedef unsigned int16 UTF16;
typedef unsigned int32 UTF32;

the first snippet calculates the high (or leading) surrogate from a character code C.

const UTF16 HI_SURROGATE_START = 0xD800

UTF16 X = (UTF16) C;
UTF32 U = (C >> 16) & ((1 << 5) - 1);
UTF16 W = (UTF16) U - 1;
UTF16 HiSurrogate = HI_SURROGATE_START | (W << 6) | X >> 10;

where X, U and W correspond to the labels used in Table 3-5 UTF-16 Bit Distribution. The next snippet does the same for the low surrogate.

const UTF16 LO_SURROGATE_START = 0xDC00

UTF16 X = (UTF16) C;
UTF16 LoSurrogate = (UTF16) (LO_SURROGATE_START | X & ((1 << 10) - 1));

Finally, the reverse, where hi and lo are the high and low surrogate, and C the resulting character

UTF32 X = (hi & ((1 << 6) -1)) << 10 | lo & ((1 << 10) -1);
UTF32 W = (hi >> 6) & ((1 << 5) - 1);
UTF32 U = W + 1;

UTF32 C = U << 16 | X;

A caller would need to ensure that C, hi, and lo are in the appropriate ranges. [

14
  • Thanks for your time. So it's all about "backward compatibility" and by using my schema I will not be able to determine whether this code unit encodes a single character or it's a part of surrogate pairs. But I wonder what if two consecutive single character code units begin with the same mark of the surrogate pairs? Commented Jan 23, 2020 at 8:15
  • 2
    Your scheme could be used and would be unambiguous — but it wouldn't be backwards compatible with UCS2 whereas UTF-16 is backwards compatible. Yes, lots of odd decisions are made because of 'backwards compatibility'. It's a very common reason for what might otherwise be deemed 'non-obvious' design decisions. Commented Jan 23, 2020 at 8:41
  • 1
    You've misencoded the data. In the hypothetical UTF-EMF-16 encoding, decimal 55357 would be encoded with two units — as would any other Unicode character from U+8000 (hex) upwards. So, too, would decimal 56899 (and, indeed, U+55357 and U+56899, in hex). Further, decimal 55357 is U+D83D and decimal 56899 is U+DE43. That's a high surrogate and a low surrogate in UTF-16. Those are reserved, permanently unassigned code points under the UTF-EMF-16 encoding. No valid UTF-EMF-16 data would ever encode any of the surrogates U+D800 .. U+DFFF (any more than valid UTF-8 ever contains encoded surrogates). Commented Jan 23, 2020 at 16:30
  • 1
    I finally understand why Unicode did the surrogate pair. Thanks for that answer Commented Mar 14, 2021 at 10:56
  • 1
    I was struggling to understand how 2*10 bits could store 21 bits in surrogate pairs; you explained this perfectly with the addition hack, big thanks! Commented Dec 12, 2021 at 14:47

Not the answer you're looking for? Browse other questions tagged or ask your own question.