4

For a project of mine I need to be able to read all possible MBCS UTF-8 codepoints from the Windows console. As it is well known that Windows works internally with wchar_t (UTF-16), I tried an approach to read even the "strangest" Unicode characters, including Greek and Cyrillic letters (works fine), CJK (works fine too), math characters (also works fine). But whenever an emoji or a very special 4-byte length codepoint is inserted into the input function method, the EF BF BD sequence is returned, which stands for "Replacement character" instead of the expected sequence. For example, with the emoji 😀, one would expect the sequence F0 9F 98 80 but instead EF BF BD is returned.

I have tried changing codepages to 65001 and tried using both the old legacy conhost which is known for lacking emoji support and the new Windows terminal which does support emojis.

The code I'm using for getting the input and converting it to its respective hexadecimal MBCS representation is the following:

#include <stdio.h>
#include <windows.h>

int main()
{
    HANDLE hStdin = GetStdHandle(STD_INPUT_HANDLE);
    if (hStdin == INVALID_HANDLE_VALUE)
        return 1;
    unsigned int codepage = GetConsoleOutputCP();

    if (codepage != 65001)
    {
        fprintf(stderr, "[WARNING] Non Unicode codepage found (%u), changing to 65001\n", codepage);
        SetConsoleOutputCP(65001);
        SetConsoleCP(65001);
    }

    DWORD fdwSaveOldMode;
    INPUT_RECORD irInBuf[128];
    DWORD cNumRead, i;
    if (!GetConsoleMode(hStdin, &fdwSaveOldMode))
        return 1;
    SetConsoleMode(hStdin, fdwSaveOldMode & ~ENABLE_MOUSE_INPUT);
    while (1)
    {
        if (!ReadConsoleInput(hStdin, irInBuf, 128, &cNumRead))
            return 1;

        for (i = 0; i < cNumRead; i++)
        {
            switch (irInBuf[i].EventType)
            {
            case KEY_EVENT: // keyboard input
                if (irInBuf[i].Event.KeyEvent.uChar.UnicodeChar && irInBuf[i].Event.KeyEvent.bKeyDown)
                    printf("Press UChar: %03hd\t0x%02x\n", irInBuf[i].Event.KeyEvent.uChar.UnicodeChar, irInBuf[i].Event.KeyEvent.uChar.UnicodeChar & (~0xff00));

                break;
            }
        }
        if (irInBuf[0].Event.KeyEvent.bKeyDown)
            putchar('\n');

        FlushConsoleInputBuffer(hStdin);
    }
    SetConsoleMode(hStdin, fdwSaveOldMode);
}

When running this code, if θ is inserted, it shows its correct multibyte UTF-8 representation, however, once an emoji is inserted, the replacement character gets returned twice.

Output:

Press UChar: 206        0xce
Press UChar: 184        0xb8

Press UChar: 239        0xef
Press UChar: 191        0xbf
Press UChar: 189        0xbd
Press UChar: 239        0xef
Press UChar: 191        0xbf
Press UChar: 189        0xbd
5
  • 1
    The Emoji 😀 (and any other Unicode character > U+FFFF) in UTF-16 requires 2 surrogates, so it can't be represented by the single WCHAR that ReadConsoleInput() gives you. Use ReadConsole() instead to read Unicode text from the console Commented Jun 30 at 17:09
  • The thing is, it can read characters that have to be represented with three bytes, take for example all CJK characters and various symbols that aren't emojis. That's why I'm reading max 128 bytes instead of just one and this allows me to read more than a single WCHAR. The problem only happens with 4 characters UTF-8 codepoints.
    – Anic17
    Commented Jun 30 at 17:44
  • 4
    You are confusing codepoints and codeunits. Unicode defines codepoints, UTFs define how codepoints are encoded into codeunits. In UTF-8, a 3-codeunit sequence encodes a codepoint between U+0800..U+FFFF. In UTF-16, that same codepoint is encoded in 1 codeunit. A codepoint that requires 4 codeunits in UTF-8 will require 2 codeunits (surrogates) in UTF-16. But ReadConsoleInput() gives you only 1 wchar (UTF-16 codeunit) per key event, which is why it gives you the replacement char for codepoints above U+FFFF Commented Jun 30 at 19:44
  • It seems you're right in the sense that once codepoints surpass U+FFFF the replacement sequence is returned, however, I still get the same issue with ReadConsole() so I don't think it will be of any help considering I also need to be able to retrieve special characters since I'm building a text editor.
    – Anic17
    Commented Jun 30 at 21:38
  • 1
    Possibly related: ReadConsole does not work with utf-8 codepage Commented Jul 1 at 3:12

0