10

The current version of UTF-16 is only capable of encoding 1,112,064 different numbers(code points); 0x0-0x10FFFF.

Does the Unicode Consortium Intend to make UTF-16 run out of characters?

i.e. make a code point > 0x10FFFF

If not, why would anyone write the code for a utf-8 parser to be able to accept 5 or 6 byte sequences? Since it would add unnecessary instructions to their function.

Isn't 1,112,064 enough, do we actually need MORE characters? I mean: How quickly are we running out?

9
  • 3
    I happen to know a utf8-loose parser that accepts 13-byte code points. This is not unuseful. Obviously this process doesn’t give a fart about UTF-16, which is a very unfortunate legacy we’d all like to forget since it incorporates the worst disadvantages of both UTF-8 and UTF-32 without enjoying any of the advantages of either: UTF-16 is truly the worst of both worlds. But make no misake: any strict UTF-8 parser must reject code points over 4 bytes in encoded length. This is to kiss UTF-16’s sweet you know what.
    – tchrist
    Commented Feb 22, 2012 at 0:31
  • 1
    Wake me up when they discover a new civilization with a non-alphabetic writing system. Commented Feb 22, 2012 at 1:09
  • 7
    @HansPassant Time to wake up Alphabets are just one of the forms that human writing take. There are also syllabaries and logograms. Bazillions of logograms. CJK Extension E is nearly ready, and that has 6,000 new characters in it — not one of which has anything to do with an “alphabet”.
    – tchrist
    Commented Feb 22, 2012 at 1:44
  • 1
    @GlassGhost By logograms tchrist meant Chinese characters. I don't believe anyone supports all Unicode characters; if you're making a font, feel free to exclude whatever characters you want. By sheer count, the few hundred emoji that were new to Unicode aren't that major, especially when compared to the tens of thousands of Chinese characters being encoded.
    – prosfilaes
    Commented Aug 19, 2012 at 21:36
  • 1
    @GlassGhost He did say what meant; for example, the Encyclopedic Dictionary of Archaeology says "Writing systems that make use of logograms include Chinese, Egyptian hieroglyphic writing, and early cuneiform writing systems."
    – prosfilaes
    Commented Aug 21, 2012 at 12:02

4 Answers 4

9

As of 2011 we have consumed 109,449 characters AND set aside for application use(6,400+131,068):

leaving room for over 860,000 unused chars; plenty for CJK extension E(~10,000 chars) and 85 more sets just like it; so that in the event of contact with the Ferengi culture, we should be ready.

In November 2003 the IETF restricted UTF-8 to end at U+10FFFF with RFC 3629, in order to match the constraints of the UTF-16 character encoding: a UTF-8 parser should not accept 5 or 6 byte sequences that would overflow the utf-16 set, or characters in the 4 byte sequence that are greater than 0x10FFFF

Please put edits listing sets that pose threats on the size of the unicode code point limit here if they are over 1/3 the Size of the CJK extension E(~10,000 chars):

1
  • 1
    +1 for Ferengi (and being the most descriptive)
    – Izkata
    Commented Mar 12, 2012 at 15:43
2

At present time, the Unicode standard doesn't define any characters above U+10FFFF, so you would be fine to code your app to reject characters above that point.

Predicting the future is hard, but I think you're safe for the near term with this strategy. Honestly, even if Unicode extends past U+10FFFF in the distant future, it almost certainly won't be for mission critical glyphs. Your app might not be compatible with the new Ferengi fonts that come out in 2063, but you can always fix it when it actually becomes an issue.

7
  • I dunno the Star trek buffs might get mad?? But shouldn't we still have room with that? I think 1,112,064 is a LOT of damn characters, I'm used to english and with ascii and all the math symbols and greek symbols I can think of we only have like 512.
    – GlassGhost
    Commented Feb 21, 2012 at 20:23
  • 3
    Sure, but basic Japanese at a high school level has several thousand. Chinese, more still. Some languages just have more glyphs than others. Still, I agree that one million glyphs ought to stretch a long way. Commented Feb 21, 2012 at 23:44
  • I also agree that one million glyphs ought to stretch a long way.
    – GlassGhost
    Commented Feb 24, 2012 at 15:48
  • 6
    @GlassGhost: Sure, and 640 kilobytes of memory is enough for anyone. Commented Feb 24, 2012 at 18:56
  • 1
    To be fair, human languages aren't affected by Moore's law--and thank goodness for that!! Commented Feb 24, 2012 at 19:49
2

Cutting to the chase:

It is indeed intentional that the encoding system only supports code points up to U+10FFFF

It does not appear that there is any real risk of running out any time soon.

4
  • it's 10FFFF not "10FFF" and the already accepted answer; implies all the above.
    – GlassGhost
    Commented Mar 1, 2012 at 19:39
  • I typoed, obviously. Also, the other answer was not yet accepted when I posted it.
    – Perry
    Commented Mar 1, 2012 at 19:43
  • Actually it was accepted, like a whole week before you posted this, hover your mouse over the accepted checkmark if you don't believe me.
    – GlassGhost
    Commented Mar 1, 2012 at 20:21
  • My mistake. For some reason this popped up as a new question and I made a bad assumption.
    – Perry
    Commented Mar 1, 2012 at 20:49
0

There is no reason to write a UTF-8 parser that supports 5-6 byte sequences, except for support of any legacy systems that actually used them. The current offical UTF-8 specification does not support 5-6 byte sequences in order to accomodate 100% loss-less conversions to/from UTF-16. If there is ever a time that Unicode has to support new codepoints above U+10FFFF, there will be plenty of time to devise new encoding formats for the higher bit counts. Or maybe by the time that happens, memory and computional power will be sufficient enough that everyone will just switch to UTF-32 for everything, which can handle up to U+FFFFFFFF for over 4 billion characters.

2
  • 1
    That’s not exactly true. There are systems that use a modified version of the UTF-8 algorithm to allow for non-Unicode code points up to 2⁷²−1. So long as cooperating processes not pretend these so-called ‘hypers’ are actual Unicode code points or that that encoding is identical to UTF-8 (although it largely is), there’s nothing in the Standard that forbids them. And if you can’t think of anything creative, interesting, and useful to do with an extra 51 bits of namespace for characters, I certainly know people who can. And no, these people don’t give a rat sassy mamma about UTF-16. Who would?
    – tchrist
    Commented Feb 22, 2012 at 0:25
  • 3
    If a system is using a UTF-8 like encoding for non-Unicode values, then it is not really UTF-8, it is just a custom encoding that was inspired by UTF-8. The OP's question was specifically about standard UTF-8 and Unicode, and in that case what I wrote in my answer applies. Commented Mar 31, 2015 at 2:34

Not the answer you're looking for? Browse other questions tagged or ask your own question.