2

Not to reinvent the wheel I refer to already existing Cyrillic characters in PHP's json_encode.

The question is: what are those characters, what do they mean: \u0435, \u0434 and so on? I guess there is nothing to do with number of bytes, is that just a serial number in UTF-8 that corresponds to cyrillic symbols "е", "д" and so on respectively?

1 Answer 1

3

These are Unicode escape sequences that reference characters in the Unicode character set by denoting their code points in hexadecimal.

From the JSON specification:

Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".

Although these characters do not need to be escaped (see unescaped rule), json_encode does encode any character except those character that are also in US-ASCII (see source of json.c) to avoid encoding issues with US-ASCII-based protocols.

So inside a JSON string, \u0435 references the character at U+0435 that is the CYRILLIC SMALL LETTER IE (е) and \u0434 references the character at U+0434 that is the CYRILLIC SMALL LETTER DE (д).

Not the answer you're looking for? Browse other questions tagged or ask your own question.