Proposed alternative to UTF-16
I think you're proposing an alternative format using 16-bit code units analogous to the UTF-8 code scheme — let's designate it UTF-EMF-16.
In your UTF-EMF-16 scheme, code points from U+0000 to U+7FFF would be encoded as a single 16-bit unit with the MSB (most significant bit) always zero. Then, you'd reserve 16-bit units with the 2 most significant bits set to 10
as 'continuation units', with 14 bits of payload data. And then you'd encode code points from U+8000 to U+10FFFF (the current maximum Unicode code point) in 16-bit units with the three most significant bits set to 110
and up to 13 bits of payload data. With Unicode as currently defined (U+0000 .. U+10FFFF), you'd never need more than 7 of the 13 bits set.
U+0000 .. U+7FFF — One 16-bit unit: values 0x0000 .. 0x7FFF
U+8000 .. U+10FFF — Two 16-bit units:
1. First unit 0xC000 .. 0xC043
2. Second unit 0x8000 .. 0xBFFF
For your example code point, U+1F683 (binary: 1 1111 0110 0100 0011):
First unit: 1100 0000 0000 0111 = 0xC007
Second unit: 1011 0110 0100 0011 = 0xB643
The second unit differs from your example in reversing the two most significant bits, from 01
in your example to 10
in mine.
Why wasn't such a scheme used in UTF-16
Such a scheme could be made to work. It is unambiguous. It could accommodate many more characters than Unicode currently allows. UTF-8 could be modified to become UTF-EMF-8 so that it could handle the same extended range, with some characters needing 5 bytes instead of the current maximum of 4 bytes. UTF-EMF-8 with 5 bytes would encode up to 26 bits; UTF-EMF-16 could encode 27 bits, but should be limited to 26 bits (roughly 64 million code points, instead of just over 1 million). So, why wasn't it, or something very similar, adopted?
The answer is the very common one – history (plus backwards compatibility).
When Unicode was first defined, it was hoped or believed that a 16-bit code set would be sufficient. The UCS2 encoding was developed using 16-bit values, and many values in the range 0x8000 .. 0xFFFF were given meanings. For example, U+FEFF is the byte order mark.
When the Unicode scheme had to be extended to make Unicode into a bigger code set, there were many defined characters with the 10
and 110
bit patterns in the most significant bits, so backwards compatibility meant that the UTF-EMF-16 scheme outlined above could not be used for UTF-16 without breaking compatibility with UCS2, which would have been a serious problem.
Consequently, the standardizers chose an alternative scheme, where there are high surrogates and low surrogates.
0xD800 .. 0xDBFF High surrogates (most signicant bits of 21-bit value)
0xDC00 .. 0xDFFF Low surrogates (less significant bits of 21-bit value)
The low surrogates range provides storage for 10 bits of data — the prefix 1101 11
uses 6 of 16 bits. The high surrogates range also provides storage for 10 bits of data — the prefix 1101 10
also uses 6 of 16 bits. But because the BMP (Basic Multilingual Plane — U+0000 .. U+FFFF) doesn't need to be encoded with two 16-bit units, the UTF-16 encoding subtracts 1
from the high order data, and can therefore be used to encode U+10000 .. U+10FFFF. (Note that although Unicode is a 21-bit encoding, not all 21-bit (unsigned) numbers are valid Unicode code points. Values from 0x110000 .. 0x1FFFFF are 21-bit numbers but are not a part of Unicode.)
From the Unicode FAQ — UTF-8, UTF-16, UTF-32 & BOM:
Q: What’s the algorithm to convert from UTF-16 to character codes?
A: The Unicode Standard used to contain a short algorithm, now there is just a bit distribution table. Here are three short code snippets that translate the information from the bit distribution table into C code that will convert to and from UTF-16.
Using the following type definitions
typedef unsigned int16 UTF16;
typedef unsigned int32 UTF32;
the first snippet calculates the high (or leading) surrogate from a character code C.
const UTF16 HI_SURROGATE_START = 0xD800
UTF16 X = (UTF16) C;
UTF32 U = (C >> 16) & ((1 << 5) - 1);
UTF16 W = (UTF16) U - 1;
UTF16 HiSurrogate = HI_SURROGATE_START | (W << 6) | X >> 10;
where X, U and W correspond to the labels used in Table 3-5 UTF-16 Bit Distribution. The next snippet does the same for the low surrogate.
const UTF16 LO_SURROGATE_START = 0xDC00
UTF16 X = (UTF16) C;
UTF16 LoSurrogate = (UTF16) (LO_SURROGATE_START | X & ((1 << 10) - 1));
Finally, the reverse, where hi and lo are the high and low surrogate, and C the resulting character
UTF32 X = (hi & ((1 << 6) -1)) << 10 | lo & ((1 << 10) -1);
UTF32 W = (hi >> 6) & ((1 << 5) - 1);
UTF32 U = W + 1;
UTF32 C = U << 16 | X;
A caller would need to ensure that C, hi, and lo are in the appropriate ranges. [