Strange length of accent as "é" string return 2

Question

I have a strange problem that I can't explain. I'm trying to manipulate a string with an accent as "é". This string comes from the name of an image from an input file type.

What I can not understand is why my string when I parse with for the accented character is split into two character. Here is an example to better understand:

My é is divided into two character like this e & ́.

"é".length
=> 2

It's possible that utf8 is involved ?

I really don't understand anything at all !

Im some rar cases it is possible to write this letter with two charactors. I read this in the context of LaTeX. — rekire, Commented Sep 2, 2013 at 17:27

schnaader · Accepted Answer · 2015-01-16 10:36:45Z

12

They are called Combining Diacritical Marks. They are a "piece" of Unicode... Some combinable diacritics that can be "chained" on any character. Clearly the length of the string in that case is 2 (because there is the e and the '. The precomposed characters like àéèìòù have been left for compatibility, but now any character can be accented :-) Clearly 99% of the programmers don't know it, and 99.9% of the programs support it very badly. I'm quite sure they could be used as an attack vector somewhere (but I'm not paranoid :-) )

I'll even add that even Skeet in 2009 wasn't sure on how they worked: http://codeblog.jonskeet.uk/2009/11/02/omg-ponies-aka-humanity-epic-fail/

You see, I couldn't remember whether combining characters came before or after base characters

:-) :-)

edited Jan 16, 2015 at 10:36

schnaader

49.5k10 gold badges106 silver badges139 bronze badges

answered Sep 2, 2013 at 17:26

xanatos

111k13 gold badges204 silver badges287 bronze badges

+1 for your explanation, but, why am I getting 1 in my chrome console?
– MD Sayem Ahmed
Commented Sep 2, 2013 at 17:33
2

@SayemAhmed Because at a certain step you have probably forced the recomposition of the e + ' in a è :-) Try "e\u0301"; (new line) "e\u0301".length
– xanatos
Commented Sep 2, 2013 at 17:35
@xanator: Got it, thanks. I just copied the "é".length portion from the question and pasted it on my console previously.....
– MD Sayem Ahmed
Commented Sep 2, 2013 at 17:37
1

@SayemAhmed: OP's "é" in the post is really just the composed letter.
– kennytm
Commented Sep 2, 2013 at 17:38
@SayemAhmed Note that you can mount more than one diacritical mark on top of a single character, like ẫ
– xanatos
Commented Sep 2, 2013 at 17:46

Add a comment |

kennytm · Accepted Answer · 2013-09-02 17:46:57Z

9

Instead of UTF-8, it's more likely combining diacritical marks involved.

>>> "e\u0301"
"é"
>>> "e\u0301".length
2

Javascript's strings are usually encoded as UTF-16, so it could contain the whole single "é" (U+00e9) in 1 code unit.

But characters outside of the BMP (those with code point beyond U+FFFF) will return 2, as they are encoded into 2 UTF-16 code units.

>>> "😐".length
2

edited Sep 2, 2013 at 17:46

answered Sep 2, 2013 at 17:29

kennytm

519k108 gold badges1.1k silver badges1k bronze badges

1

See also wikipedia about that.
– rekire
Commented Sep 2, 2013 at 17:32
To make it more clear for the OP, the second part (the one about the BMP) is orthogonal to the combining diacritical mark. They are different independent things. Each code point of Unicode can be represented by "something" that uses one Javascript character or 2 Javascript characters. On top of this you can "mount" Combining Diacritical Marks (0...n, with n quite big), so that a rendered grapheme could be composed of 1-x Javascript characters, with x > 10 :-) Aaah... I wanted to make it clear and it became 5 rows! :-(
– xanatos
Commented Sep 2, 2013 at 17:51

Add a comment |

Collectives™ on Stack Overflow

Strange length of accent as "é" string return 2

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
javascript
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Not the answer you're looking for? Browse other questions tagged javascript or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
javascript
or ask your own question.