10

I have a strange problem that I can't explain. I'm trying to manipulate a string with an accent as "é". This string comes from the name of an image from an input file type.

What I can not understand is why my string when I parse with for the accented character is split into two character. Here is an example to better understand:

My é is divided into two character like this e & ́.

"é".length
=> 2

It's possible that utf8 is involved ?

I really don't understand anything at all !

3
  • 6
    Which browser are you using? It returns 1 in my chrome. Commented Sep 2, 2013 at 17:26
  • Im some rar cases it is possible to write this letter with two charactors. I read this in the context of LaTeX.
    – rekire
    Commented Sep 2, 2013 at 17:27
  • 2
    Your character also returns 1 on Firefox.
    – Kevin Ji
    Commented Sep 2, 2013 at 17:29

2 Answers 2

12

They are called Combining Diacritical Marks. They are a "piece" of Unicode... Some combinable diacritics that can be "chained" on any character. Clearly the length of the string in that case is 2 (because there is the e and the '. The precomposed characters like àéèìòù have been left for compatibility, but now any character can be accented :-) Clearly 99% of the programmers don't know it, and 99.9% of the programs support it very badly. I'm quite sure they could be used as an attack vector somewhere (but I'm not paranoid :-) )

I'll even add that even Skeet in 2009 wasn't sure on how they worked: http://codeblog.jonskeet.uk/2009/11/02/omg-ponies-aka-humanity-epic-fail/

You see, I couldn't remember whether combining characters came before or after base characters

:-) :-)

5
  • +1 for your explanation, but, why am I getting 1 in my chrome console? Commented Sep 2, 2013 at 17:33
  • 2
    @SayemAhmed Because at a certain step you have probably forced the recomposition of the e + ' in a è :-) Try "e\u0301"; (new line) "e\u0301".length
    – xanatos
    Commented Sep 2, 2013 at 17:35
  • @xanator: Got it, thanks. I just copied the "é".length portion from the question and pasted it on my console previously..... Commented Sep 2, 2013 at 17:37
  • 1
    @SayemAhmed: OP's "é" in the post is really just the composed letter.
    – kennytm
    Commented Sep 2, 2013 at 17:38
  • @SayemAhmed Note that you can mount more than one diacritical mark on top of a single character, like ẫ
    – xanatos
    Commented Sep 2, 2013 at 17:46
9

Instead of UTF-8, it's more likely combining diacritical marks involved.

>>> "e\u0301"
"é"
>>> "e\u0301".length
2

Javascript's strings are usually encoded as UTF-16, so it could contain the whole single "é" (U+00e9) in 1 code unit.


But characters outside of the BMP (those with code point beyond U+FFFF) will return 2, as they are encoded into 2 UTF-16 code units.

>>> "😐".length
2
2
  • 1
    See also wikipedia about that.
    – rekire
    Commented Sep 2, 2013 at 17:32
  • To make it more clear for the OP, the second part (the one about the BMP) is orthogonal to the combining diacritical mark. They are different independent things. Each code point of Unicode can be represented by "something" that uses one Javascript character or 2 Javascript characters. On top of this you can "mount" Combining Diacritical Marks (0...n, with n quite big), so that a rendered grapheme could be composed of 1-x Javascript characters, with x > 10 :-) Aaah... I wanted to make it clear and it became 5 rows! :-(
    – xanatos
    Commented Sep 2, 2013 at 17:51

Not the answer you're looking for? Browse other questions tagged or ask your own question.