Does the Unicode Consortium Intend to make UTF-16 run out of characters?

Question

The current version of UTF-16 is only capable of encoding 1,112,064 different numbers(code points); 0x0-0x10FFFF.

i.e. make a code point > 0x10FFFF

If not, why would anyone write the code for a utf-8 parser to be able to accept 5 or 6 byte sequences? Since it would add unnecessary instructions to their function.

Isn't 1,112,064 enough, do we actually need MORE characters? I mean: How quickly are we running out?

I happen to know a utf8-loose parser that accepts 13-byte code points. This is not unuseful. Obviously this process doesn’t give a fart about UTF-16, which is a very unfortunate legacy we’d all like to forget since it incorporates the worst disadvantages of both UTF-8 and UTF-32 without enjoying any of the advantages of either: UTF-16 is truly the worst of both worlds. But make no misake: any strict UTF-8 parser must reject code points over 4 bytes in encoded length. This is to kiss UTF-16’s sweet you know what. — tchrist, Commented Feb 22, 2012 at 0:31
Wake me up when they discover a new civilization with a non-alphabetic writing system. — Hans Passant, Commented Feb 22, 2012 at 1:09
@HansPassant Time to wake up Alphabets are just one of the forms that human writing take. There are also syllabaries and logograms. Bazillions of logograms. CJK Extension E is nearly ready, and that has 6,000 new characters in it — not one of which has anything to do with an “alphabet”. — tchrist, Commented Feb 22, 2012 at 1:44
@GlassGhost By logograms tchrist meant Chinese characters. I don't believe anyone supports all Unicode characters; if you're making a font, feel free to exclude whatever characters you want. By sheer count, the few hundred emoji that were new to Unicode aren't that major, especially when compared to the tens of thousands of Chinese characters being encoded. — prosfilaes, Commented Aug 19, 2012 at 21:36
@GlassGhost He did say what meant; for example, the Encyclopedic Dictionary of Archaeology says "Writing systems that make use of logograms include Chinese, Egyptian hieroglyphic writing, and early cuneiform writing systems." — prosfilaes, Commented Aug 21, 2012 at 12:02

Community · Accepted Answer · 2021-10-07 05:51:50Z

As of 2011 we have consumed 109,449 characters AND set aside for application use(6,400+131,068):

leaving room for over 860,000 unused chars; plenty for CJK extension E(~10,000 chars) and 85 more sets just like it; so that in the event of contact with the Ferengi culture, we should be ready.

In November 2003 the IETF restricted UTF-8 to end at U+10FFFF with RFC 3629, in order to match the constraints of the UTF-16 character encoding: a UTF-8 parser should not accept 5 or 6 byte sequences that would overflow the utf-16 set, or characters in the 4 byte sequence that are greater than 0x10FFFF

Please put edits listing sets that pose threats on the size of the unicode code point limit here if they are over 1/3 the Size of the CJK extension E(~10,000 chars):

CJK extension E(~10,000 chars)
Ferengi culture characters(~5,000 chars)

+1 for Ferengi (and being the most descriptive)
– Izkata
Commented Mar 12, 2012 at 15:43 — Izkata, Commented Mar 12, 2012 at 15:43

StilesCrisis · Accepted Answer · 2012-02-21 20:06:25Z

2

At present time, the Unicode standard doesn't define any characters above U+10FFFF, so you would be fine to code your app to reject characters above that point.

Predicting the future is hard, but I think you're safe for the near term with this strategy. Honestly, even if Unicode extends past U+10FFFF in the distant future, it almost certainly won't be for mission critical glyphs. Your app might not be compatible with the new Ferengi fonts that come out in 2063, but you can always fix it when it actually becomes an issue.

answered Feb 21, 2012 at 20:06

StilesCrisis

16.2k4 gold badges40 silver badges63 bronze badges

I dunno the Star trek buffs might get mad?? But shouldn't we still have room with that? I think 1,112,064 is a LOT of damn characters, I'm used to english and with ascii and all the math symbols and greek symbols I can think of we only have like 512.
– GlassGhost
Commented Feb 21, 2012 at 20:23
3

Sure, but basic Japanese at a high school level has several thousand. Chinese, more still. Some languages just have more glyphs than others. Still, I agree that one million glyphs ought to stretch a long way.
– StilesCrisis
Commented Feb 21, 2012 at 23:44
I also agree that one million glyphs ought to stretch a long way.
– GlassGhost
Commented Feb 24, 2012 at 15:48
6

@GlassGhost: Sure, and 640 kilobytes of memory is enough for anyone.
– Keith Thompson
Commented Feb 24, 2012 at 18:56
1

To be fair, human languages aren't affected by Moore's law--and thank goodness for that!!
– StilesCrisis
Commented Feb 24, 2012 at 19:49

| Show 2 more comments

Perry · Accepted Answer · 2012-03-01 19:42:34Z

2

Cutting to the chase:

It is indeed intentional that the encoding system only supports code points up to U+10FFFF

It does not appear that there is any real risk of running out any time soon.

edited Mar 1, 2012 at 19:42

answered Mar 1, 2012 at 19:15

Perry

4,4471 gold badge18 silver badges20 bronze badges

it's 10FFFF not "10FFF" and the already accepted answer; implies all the above.
– GlassGhost
Commented Mar 1, 2012 at 19:39
I typoed, obviously. Also, the other answer was not yet accepted when I posted it.
– Perry
Commented Mar 1, 2012 at 19:43
Actually it was accepted, like a whole week before you posted this, hover your mouse over the accepted checkmark if you don't believe me.
– GlassGhost
Commented Mar 1, 2012 at 20:21
My mistake. For some reason this popped up as a new question and I made a bad assumption.
– Perry
Commented Mar 1, 2012 at 20:49

Add a comment |

Remy Lebeau · Accepted Answer · 2012-02-22 00:05:18Z

0

There is no reason to write a UTF-8 parser that supports 5-6 byte sequences, except for support of any legacy systems that actually used them. The current offical UTF-8 specification does not support 5-6 byte sequences in order to accomodate 100% loss-less conversions to/from UTF-16. If there is ever a time that Unicode has to support new codepoints above U+10FFFF, there will be plenty of time to devise new encoding formats for the higher bit counts. Or maybe by the time that happens, memory and computional power will be sufficient enough that everyone will just switch to UTF-32 for everything, which can handle up to U+FFFFFFFF for over 4 billion characters.

answered Feb 22, 2012 at 0:05

Remy Lebeau

580k32 gold badges483 silver badges812 bronze badges

1

That’s not exactly true. There are systems that use a modified version of the UTF-8 algorithm to allow for non-Unicode code points up to 2⁷²−1. So long as cooperating processes not pretend these so-called ‘hypers’ are actual Unicode code points or that that encoding is identical to UTF-8 (although it largely is), there’s nothing in the Standard that forbids them. And if you can’t think of anything creative, interesting, and useful to do with an extra 51 bits of namespace for characters, I certainly know people who can. And no, these people don’t give a rat sassy mamma about UTF-16. Who would?
– tchrist
Commented Feb 22, 2012 at 0:25
3

If a system is using a UTF-8 like encoding for non-Unicode values, then it is not really UTF-8, it is just a custom encoding that was inspired by UTF-8. The OP's question was specifically about standard UTF-8 and Unicode, and in that case what I wrote in my answer applies.
– Remy Lebeau
Commented Mar 31, 2015 at 2:34

Add a comment |

Collectives™ on Stack Overflow

Does the Unicode Consortium Intend to make UTF-16 run out of characters?

4 Answers 4

Not the answer you're looking for? Browse other questions tagged
unicode
utf-8
utf-16
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Not the answer you're looking for? Browse other questions tagged unicodeutf-8utf-16 or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
unicode
utf-8
utf-16
or ask your own question.