UTF-16 Encoding - Why using complex surrogate pairs?

Question

I have been working on string encoding schemes and while I examine how UTF-16 works, I have a question. Why using complex surrogate pairs to represent 21 bits code point? Why not to simply store the bits in the first code unit and the remaining bits in the second code unit? Am I missing something! Is there a problem to store the bits directly like we did in UTF-8?

Example of what I am thinking of:

The character '🙃'

Corresponding code point: 128579 (Decimal)
The binary form: 1 1111 0110 0100 0011 (17 bits)
It's 17-bit code point.

Based on UTF-8 schemes, it will be represented as:

240 : 11110 000
159 : 10 011111
153 : 10 011001
131 : 10 000011

In UTF-16, why not do something looks like that rather than using surrogate pairs:
```
49159 : 110 0 0000 0000 0111
30275 : 01 11 0110 0100 0011
```

Which bits do you store in the first unit? Which bits do you store in the second unit? How do you know that what you're dealing with is a 21-bit number, not two 16-bit numbers (code points)? You have to be able to determine unambiguously when you have a surrogate pair. That's what the UTF-16 encoding does. It's also why the maximum Unicode code point is U+10FFFF — you can't encode anything bigger with just two surrogates with the scheme as presently designed. — Jonathan Leffler, Commented Jan 21, 2020 at 7:52
Also, you are mistaken about UTF-8; it uses a similar scheme to store 21-bit numbers in 8-bit chars. In fact, UTF-8 is more complex than UTF-16, because it uses four different encodings (with four different lengths) for the various codepoints, rather than only two different ones as UTF-16 does. — Mr Lister, Commented Jan 21, 2020 at 8:36
There's also a historical dimension: UTF-16 (then called UCS2) was already in use when it was decided that 2 bytes (65536 code points) wasn't enough. At that point, the surrogate pair solution was invented and they reserved two ranges that were still unused. If you think UTF-16 is intricate, the answer might be that it wasn't designed like this from the beginning. — lenz, Commented Jan 21, 2020 at 20:01
@MrLister... wow, this is the first time I've heard the claim that UTF-8 is more complex than UTF-16 :) I guess that depends on your definition of the word "complex" — JoelFan, Commented Jan 21, 2020 at 21:19

Jonathan Leffler · Accepted Answer · 2020-01-26 08:51:53Z

Proposed alternative to UTF-16

I think you're proposing an alternative format using 16-bit code units analogous to the UTF-8 code scheme — let's designate it UTF-EMF-16.

In your UTF-EMF-16 scheme, code points from U+0000 to U+7FFF would be encoded as a single 16-bit unit with the MSB (most significant bit) always zero. Then, you'd reserve 16-bit units with the 2 most significant bits set to 10 as 'continuation units', with 14 bits of payload data. And then you'd encode code points from U+8000 to U+10FFFF (the current maximum Unicode code point) in 16-bit units with the three most significant bits set to 110 and up to 13 bits of payload data. With Unicode as currently defined (U+0000 .. U+10FFFF), you'd never need more than 7 of the 13 bits set.

U+0000 .. U+7FFF   — One 16-bit unit: values 0x0000 .. 0x7FFF
U+8000 .. U+10FFF  — Two 16-bit units:
                     1. First unit  0xC000 .. 0xC043
                     2. Second unit 0x8000 .. 0xBFFF

For your example code point, U+1F683 (binary: 1 1111 0110 0100 0011):

First unit:  1100 0000 0000 0111 = 0xC007
Second unit: 1011 0110 0100 0011 = 0xB643

The second unit differs from your example in reversing the two most significant bits, from 01 in your example to 10 in mine.

Why wasn't such a scheme used in UTF-16

Such a scheme could be made to work. It is unambiguous. It could accommodate many more characters than Unicode currently allows. UTF-8 could be modified to become UTF-EMF-8 so that it could handle the same extended range, with some characters needing 5 bytes instead of the current maximum of 4 bytes. UTF-EMF-8 with 5 bytes would encode up to 26 bits; UTF-EMF-16 could encode 27 bits, but should be limited to 26 bits (roughly 64 million code points, instead of just over 1 million). So, why wasn't it, or something very similar, adopted?

The answer is the very common one – history (plus backwards compatibility).

When Unicode was first defined, it was hoped or believed that a 16-bit code set would be sufficient. The UCS2 encoding was developed using 16-bit values, and many values in the range 0x8000 .. 0xFFFF were given meanings. For example, U+FEFF is the byte order mark.

When the Unicode scheme had to be extended to make Unicode into a bigger code set, there were many defined characters with the 10 and 110 bit patterns in the most significant bits, so backwards compatibility meant that the UTF-EMF-16 scheme outlined above could not be used for UTF-16 without breaking compatibility with UCS2, which would have been a serious problem.

Consequently, the standardizers chose an alternative scheme, where there are high surrogates and low surrogates.

0xD800 .. 0xDBFF   High surrogates (most signicant bits of 21-bit value)
0xDC00 .. 0xDFFF   Low surrogates (less significant bits of 21-bit value)

The low surrogates range provides storage for 10 bits of data — the prefix 1101 11 uses 6 of 16 bits. The high surrogates range also provides storage for 10 bits of data — the prefix 1101 10 also uses 6 of 16 bits. But because the BMP (Basic Multilingual Plane — U+0000 .. U+FFFF) doesn't need to be encoded with two 16-bit units, the UTF-16 encoding subtracts 1 from the high order data, and can therefore be used to encode U+10000 .. U+10FFFF. (Note that although Unicode is a 21-bit encoding, not all 21-bit (unsigned) numbers are valid Unicode code points. Values from 0x110000 .. 0x1FFFFF are 21-bit numbers but are not a part of Unicode.)

From the Unicode FAQ — UTF-8, UTF-16, UTF-32 & BOM:

Q: What’s the algorithm to convert from UTF-16 to character codes?

A: The Unicode Standard used to contain a short algorithm, now there is just a bit distribution table. Here are three short code snippets that translate the information from the bit distribution table into C code that will convert to and from UTF-16.

Using the following type definitions
typedef unsigned int16 UTF16;
typedef unsigned int32 UTF32;
the first snippet calculates the high (or leading) surrogate from a character code C.
const UTF16 HI_SURROGATE_START = 0xD800

UTF16 X = (UTF16) C;
UTF32 U = (C >> 16) & ((1 << 5) - 1);
UTF16 W = (UTF16) U - 1;
UTF16 HiSurrogate = HI_SURROGATE_START | (W << 6) | X >> 10;
where X, U and W correspond to the labels used in Table 3-5 UTF-16 Bit Distribution. The next snippet does the same for the low surrogate.
const UTF16 LO_SURROGATE_START = 0xDC00

UTF16 X = (UTF16) C;
UTF16 LoSurrogate = (UTF16) (LO_SURROGATE_START | X & ((1 << 10) - 1));
Finally, the reverse, where hi and lo are the high and low surrogate, and C the resulting character
UTF32 X = (hi & ((1 << 6) -1)) << 10 | lo & ((1 << 10) -1);
UTF32 W = (hi >> 6) & ((1 << 5) - 1);
UTF32 U = W + 1;

UTF32 C = U << 16 | X;
A caller would need to ensure that C, hi, and lo are in the appropriate ranges. [

Thanks for your time. So it's all about "backward compatibility" and by using my schema I will not be able to determine whether this code unit encodes a single character or it's a part of surrogate pairs. But I wonder what if two consecutive single character code units begin with the same mark of the surrogate pairs? — Essam Fahmy, Commented Jan 23, 2020 at 8:15
Your scheme could be used and would be unambiguous — but it wouldn't be backwards compatible with UCS2 whereas UTF-16 is backwards compatible. Yes, lots of odd decisions are made because of 'backwards compatibility'. It's a very common reason for what might otherwise be deemed 'non-obvious' design decisions. — Jonathan Leffler, Commented Jan 23, 2020 at 8:41
You've misencoded the data. In the hypothetical UTF-EMF-16 encoding, decimal 55357 would be encoded with two units — as would any other Unicode character from U+8000 (hex) upwards. So, too, would decimal 56899 (and, indeed, U+55357 and U+56899, in hex). Further, decimal 55357 is U+D83D and decimal 56899 is U+DE43. That's a high surrogate and a low surrogate in UTF-16. Those are reserved, permanently unassigned code points under the UTF-EMF-16 encoding. No valid UTF-EMF-16 data would ever encode any of the surrogates U+D800 .. U+DFFF (any more than valid UTF-8 ever contains encoded surrogates). — Jonathan Leffler, Commented Jan 23, 2020 at 16:30
I finally understand why Unicode did the surrogate pair. Thanks for that answer — Edison Pebojot, Commented Mar 14, 2021 at 10:56
I was struggling to understand how 2*10 bits could store 21 bits in surrogate pairs; you explained this perfectly with the addition hack, big thanks! — Jules Sam. Randolph, Commented Dec 12, 2021 at 14:47

Collectives™ on Stack Overflow

UTF-16 Encoding - Why using complex surrogate pairs?

1 Answer 1

Proposed alternative to UTF-16

Why wasn't such a scheme used in UTF-16

Not the answer you're looking for? Browse other questions tagged
string
unicode
utf-8
character-encoding
utf-16
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Proposed alternative to UTF-16

Why wasn't such a scheme used in UTF-16

Not the answer you're looking for? Browse other questions tagged stringunicodeutf-8character-encodingutf-16 or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
string
unicode
utf-8
character-encoding
utf-16
or ask your own question.