For a project of mine I need to be able to read all possible MBCS UTF-8 codepoints from the Windows console. As it is well known that Windows works internally with wchar_t (UTF-16), I tried an approach to read even the "strangest" Unicode characters, including Greek and Cyrillic letters (works fine), CJK (works fine too), math characters (also works fine). But whenever an emoji or a very special 4-byte length codepoint is inserted into the input function method, the EF BF BD
sequence is returned, which stands for "Replacement character" instead of the expected sequence. For example, with the emoji 😀, one would expect the sequence F0 9F 98 80
but instead EF BF BD
is returned.
I have tried changing codepages to 65001 and tried using both the old legacy conhost which is known for lacking emoji support and the new Windows terminal which does support emojis.
The code I'm using for getting the input and converting it to its respective hexadecimal MBCS representation is the following:
#include <stdio.h>
#include <windows.h>
int main()
{
HANDLE hStdin = GetStdHandle(STD_INPUT_HANDLE);
if (hStdin == INVALID_HANDLE_VALUE)
return 1;
unsigned int codepage = GetConsoleOutputCP();
if (codepage != 65001)
{
fprintf(stderr, "[WARNING] Non Unicode codepage found (%u), changing to 65001\n", codepage);
SetConsoleOutputCP(65001);
SetConsoleCP(65001);
}
DWORD fdwSaveOldMode;
INPUT_RECORD irInBuf[128];
DWORD cNumRead, i;
if (!GetConsoleMode(hStdin, &fdwSaveOldMode))
return 1;
SetConsoleMode(hStdin, fdwSaveOldMode & ~ENABLE_MOUSE_INPUT);
while (1)
{
if (!ReadConsoleInput(hStdin, irInBuf, 128, &cNumRead))
return 1;
for (i = 0; i < cNumRead; i++)
{
switch (irInBuf[i].EventType)
{
case KEY_EVENT: // keyboard input
if (irInBuf[i].Event.KeyEvent.uChar.UnicodeChar && irInBuf[i].Event.KeyEvent.bKeyDown)
printf("Press UChar: %03hd\t0x%02x\n", irInBuf[i].Event.KeyEvent.uChar.UnicodeChar, irInBuf[i].Event.KeyEvent.uChar.UnicodeChar & (~0xff00));
break;
}
}
if (irInBuf[0].Event.KeyEvent.bKeyDown)
putchar('\n');
FlushConsoleInputBuffer(hStdin);
}
SetConsoleMode(hStdin, fdwSaveOldMode);
}
When running this code, if θ
is inserted, it shows its correct multibyte UTF-8 representation, however, once an emoji is inserted, the replacement character gets returned twice.
Output:
Press UChar: 206 0xce
Press UChar: 184 0xb8
Press UChar: 239 0xef
Press UChar: 191 0xbf
Press UChar: 189 0xbd
Press UChar: 239 0xef
Press UChar: 191 0xbf
Press UChar: 189 0xbd
WCHAR
thatReadConsoleInput()
gives you. UseReadConsole()
instead to read Unicode text from the consoleWCHAR
. The problem only happens with 4 characters UTF-8 codepoints.ReadConsoleInput()
gives you only 1 wchar (UTF-16 codeunit) per key event, which is why it gives you the replacement char for codepoints above U+FFFFReadConsole()
so I don't think it will be of any help considering I also need to be able to retrieve special characters since I'm building a text editor.