2

I try to parse an UTF-8 JSON-message in C. I pass the following code to the parser:

char *text = "{\"mdl\":\"users\",\"fnc\":\"getuserslist\"}";

and all works. But if message has a Cyrillic characters, both of my parsers say that string is "not valid UTF-8 string". Example:

char *text = "{\"mdl\":\"пользователи\",\"fnc\":\"получитьсписокпользователей\"}";

I used Jansson C parser and CCAN JSON parcer for C. In my main function I have the following call of setlocale:

setlocale(LC_ALL, "ru_RU.utf8");

How can I get the valid UTF-8 string using Cyrillic characters in it?

3
  • 1
    This is valid JSON, at least according to JSONLint. The library you used has a bug.
    – user529758
    Commented May 2, 2013 at 12:28
  • 1
    I tried this line with the two popular parsers: Jansson and CCAN JSON parcer for C.
    – Ze..
    Commented May 2, 2013 at 12:33
  • You really should not hard-code locale names (which are not portable). Best practices would be to call setlocale(LC_CTYPE, "") to get the configured locale and then assert that nl_langinfo(CODESET) gives you the string "UTF-8" or something similar. Commented May 2, 2013 at 12:54

2 Answers 2

3

The relationship between the source encoding (the encoding used to encode the text in the C source) and the target encoding (the encoding used to encode run-time strings) is not obvious. See this question for more discussion about this.

Make sure your source encoding is UTF-8, and that the compiler is preserving this.

Or, you can manually encode your strings as UTF-8, by replacing non-ASCII characters with backslash-escaped UTF-8 sequences to be more sure.

2
  • Thanks for the answer. I tried to write the codes of characters like char *text = "{\"mdl\":\" \x41 \",\"fnc\":\"getuserslist\"}";. "\x41" (Latin 'A') works, but "\x410" (Cyrillic 'А') - not valid.
    – Ze..
    Commented May 2, 2013 at 12:59
  • 2
    @Ze_ \x410 is not UTF-8. That would be \xd0\x90. Commented May 2, 2013 at 14:18
0

Instead of setlocale(LC_ALL, "ru_RU.utf8"), try set console to UTF8 (cp 65001) and redirect output to file.

//Save As UTF-8 without BOM signature
#include<stdio.h>
#include<Windows.h>
int main(){
    SetConsoleOutputCP(65001);
    char *text = "{\"mdl\":\"пользователи\",\"fnc\":\"получитьсписокпользователей\"}";
    printf("%s",text);
}

We can get the valid UTF-8 string using Cyrillic characters:

{"mdl":"пользователи","fnc":"получитьсписокпользователей"}

Not the answer you're looking for? Browse other questions tagged or ask your own question.