Correctly reading a utf-16 text file into a string without external libraries?

Question

I've been using StackOverflow since the beginning, and have on occasion been tempted to post questions, but I've always either figured them out myself or found answers posted eventually... until now. This feels like it should be fairly simple, but I've been wandering around the internet for hours with no success, so I turn here:

I have a pretty standard utf-16 text file, with a mixture of English and Chinese characters. I would like those characters to end up in a string (technically, a wstring). I've seen a lot of related questions answered (here and elsewhere), but they're either looking to solve the much harder problem of reading arbitrary files without knowing the encoding, or converting between encodings, or are just generally confused about "Unicode" being a range of encodings. I know the source of the text file I'm trying to read, it will always be UTF16, it has a BOM and everything, and it can stay that way.

I had been using the solution described here, which worked for text files that were all English, but after encountering certain characters, it stopped reading the file. The only other suggestion I found was to use ICU, which would probably work, but I'd really rather not include a whole large library in an application for distribution, just to read one text file in one place. I don't care about system independence, though - I only need it to compile and work in Windows. A solution that didn't rely on that fact would prettier, of course, but I would be just as happy for a solution that used the stl while relying on assumptions about Windows architecture, or even solutions that involved win32 functions, or ATL; I just don't want to have to include another large 3rd-party library like ICU. Am I still totally out of luck unless I want to reimplement it all myself?

edit: I'm stuck using VS2008 for this particular project, so C++11 code sadly won't help.

edit 2: I realized that the code I had been borrowing before didn't fail on non-English characters like I thought it was doing. Rather, it fails on specific characters in my test document, among them '：' (FULLWIDTH COLON, U+FF1A) and '）' (FULLWIDTH RIGHT PARENTHESIS, U+FF09). bames53's posted solution also mostly works, but is stumped by those same characters?

edit 3 (and the answer!): the original code I had been using -did- mostly work - as bames53 helped me discover, the ifstream just needed to be opened in binary mode for it to work.

Please show us some code. What actual API are you calling? ReadFile? fread? read? — bmargulies, Commented May 8, 2012 at 18:19
There shouldn't be a problem if you're actually certain that the text is UTF16. To the best of my knowledge, Chinese typically ends up as an MBCS string, which is an entirely different beast. — Mahmoud Al-Qudsi, Commented May 8, 2012 at 18:25
_wfopen can open/translate UTF-16 which can then be read into a string by fread msdn.microsoft.com/fr-fr/library/yeby3zcb%28v=vs.80%29.aspx — Benj, Commented May 8, 2012 at 18:25
I don't see any reason why the code you linked to shouldn't work. It reads a file of bytes and type-casts it to wchar_t* to initialize a wstring. The only thing I'd check is if the file is opened in binary mode, but I wouldn't expect a mistake there to show your symptom. — Mark Ransom, Commented May 8, 2012 at 20:53
@MarkRansom See my response to bames53's post: I now have a better idea just -what- odd symptom it is that that code we had previously been using was displaying: certain specific unicode characters stopped it reading before it had read the whole file. Not enough of a unicode expert to guess -why-, though. — neminem, Commented May 8, 2012 at 21:20

Cubbi · Accepted Answer · 2012-05-08 18:25:55Z

11

The C++11 solution (supported, on your platform, by Visual Studio since 2010, as far as I know), would be:

#include <fstream>
#include <iostream>
#include <locale>
#include <codecvt>
int main()
{
    // open as a byte stream
    std::wifstream fin("text.txt", std::ios::binary);
    // apply BOM-sensitive UTF-16 facet
    fin.imbue(std::locale(fin.getloc(),
       new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
    // read     
    for(wchar_t c; fin.get(c); )
            std::cout << std::showbase << std::hex << c << '\n';
}

answered May 8, 2012 at 18:25

Cubbi

47.1k13 gold badges105 silver badges171 bronze badges

6

On platforms with a two byte wchar_t like Windows this will convert from UTF-16 to UCS-2. Specifically the VS2010 implementation truncates characters outside the BMP.
– bames53
Commented May 8, 2012 at 19:12
1

@bames53 Indeed.. VS2010 reads those characters into char32_t correctly, but there's not a lot that can be done with a UCS4 string on Windows. It's probably too early to get rid of compiler-dependent stuff like _O_U16TEXT.
– Cubbi
Commented May 8, 2012 at 19:26
1

std::consume_header doesn't seem to work in VS2010 -- BOM is consumed, but byte order is not affected. I had to explicitly use std::little_endian too.
– Eugene
Commented May 7, 2013 at 23:14
2

Note that on macOS I had to explicitly set std::little_endian instead of std::consume_header for a file encoded as UTF-16 LE that included the respective BOM. Otherwise I would receive big endian output.
– bfx
Commented May 27, 2020 at 10:24
1

@ChrisGuzak std::codecvt was not deprecated. The codecvt header and its contents were - cppreference notes that on en.cppreference.com/w/cpp/… and individual pages
– Cubbi
Commented Jul 16, 2021 at 17:43

| Show 7 more comments

Mark Ransom · Accepted Answer · 2022-10-24 18:34:45Z

When you open a file for UTF-16, you must open it in binary mode. This is because in text mode, certain characters are interpreted specially - specifically, 0x0d is filtered out completely and 0x1a marks the end of the file. There are some UTF-16 characters that will have one of those bytes as half of the character code and will mess up the reading of the file. This is not a bug, it is intentional behavior and is the sole reason for having separate text and binary modes.

For the reason why 0x1a is considered the end of a file, see this blog post from Raymond Chen tracing the history of Ctrl-Z. It's basically backwards compatibility run amok.

bames53 · Accepted Answer · 2013-03-07 14:47:35Z

Edit:

So it appears that the issue was that the Windows treats certain magic byte sequences as the end of the file in text mode. This is solved by using binary mode to read the file, std::ifstream fin("filename", std::ios::binary);, and then copying the data into a wstring as you already do.

The simplest, non-portable solution would be to just copy the file data into a wchar_t array. This relies on the fact that wchar_t on Windows is 2 bytes and uses UTF-16 as its encoding.

You'll have a bit of difficulty converting UTF-16 to the locale specific wchar_t encoding in a completely portable fashion.

Here's the unicode conversion functionality available in the standard C++ library (though VS 10 and 11 implement only items 3, 4, and 5)

codecvt<char32_t,char,mbstate_t>
codecvt<char16_t,char,mbstate_t>
codecvt_utf8
codecvt_utf16
codecvt_utf8_utf16
c32rtomb/mbrtoc32
c16rtomb/mbrtoc16

And what each one does

A codecvt facet that always converts between UTF-8 and UTF-32
converts between UTF-8 and UTF-16
converts between UTF-8 and UCS-2 or UCS-4 depending on the size of target element (characters outside BMP are probably truncated)
converts between a sequence of chars using a UTF-16 encoding scheme and UCS-2 or UCS-4
converts between UTF-8 and UTF-16
If the macro __STDC_UTF_32__ is defined these functions convert between the current locale's char encoding and UTF-32
If the macro __STDC_UTF_16__ is defined these functions convert between the current locale's char encoding and UTF-16

If __STDC_ISO_10646__ is defined then converting directly using codecvt_utf16<wchar_t> should be fine since that macro indicates that wchar_t values in all locales correspond to the short names of Unicode charters (and so implies that wchar_t is large enough to hold any such value).

Unfortunately there's nothing defined that goes directly from UTF-16 to wchar_t. It's possible to go UTF-16 -> UCS-4 -> mb (if __STDC_UTF_32__) -> wc, but you'll loose anything that's not representable in the locale's multi-byte encoding. And of course no matter what, converting from UTF-16 to wchar_t will lose anything not representable in the locale's wchar_t encoding.

So it's probably not worth being portable, and instead you can just read the data into a wchar_t array, or use some other Windows specific facility, such as the _O_U16TEXT mode on files.

This should build and run anywhere, but makes a bunch of assumptions to actually work:

#include <fstream>
#include <sstream>
#include <iostream>

int main ()
{
    std::stringstream ss;
    std::ifstream fin("filename");
    ss << fin.rdbuf(); // dump file contents into a stringstream
    std::string const &s = ss.str();
    if (s.size()%sizeof(wchar_t) != 0)
    {
        std::cerr << "file not the right size\n"; // must be even, two bytes per code unit
        return 1;
    }
    std::wstring ws;
    ws.resize(s.size()/sizeof(wchar_t));
    std::memcpy(&ws[0],s.c_str(),s.size()); // copy data into wstring
}

You should probably at least add code to handle endianess and the 'BOM'. Also Windows newlines don't get converted automatically so you need to do that manually.

Well, turns out, your code helped me debug - it stopped reading in exactly the same place in my sample text file as the code I linked to - (cfc.kizzx2.com/index.php/… - did. Turns out it wasn't stopping at a Chinese character, it stopped reading at the first instance of a ： (FULLWIDTH COLON, U+FF1A) character. Removing that, it then stops at ） (FULLWIDTH RIGHT PARENTHESIS, U+FF09). I'm sensing a theme... — neminem, Commented May 8, 2012 at 21:17
@neminem I guess I should have looked more closely at that link, it's just doing the same thing as I show. I'm guessing that for whatever reason, the VS 2008 implementation of fstream does not like reading the byte 0xFF. That byte represents 'delete'. Try opening the file in binary mode std::ifstream fin("...",std::ios::binary); — bames53, Commented May 8, 2012 at 21:32
Oh my frelling god. I spent over a day trying to figure it out, and it was that obvious? I tried -other- things that involved opening the file in binary mode, but I never tried the -original- solution only opening it in binary mode? You win so much. You should edit that into your solution, in case other people stumble on this question later (I can't imagine I'm the only person who's ever had this issue) :). — neminem, Commented May 8, 2012 at 21:39
@MarkRansom That makes sense, though I'd have expected it to only have an effect on Windows when 0x0D and 0x0A appear together. The 0x1A seems like a bug by design, but since none of this stuff is standardized it's probably best to never use text mode anywhere. — bames53, Commented May 9, 2012 at 14:25

Collectives™ on Stack Overflow

Correctly reading a utf-16 text file into a string without external libraries?

3 Answers 3

Edit:

Not the answer you're looking for? Browse other questions tagged
c++
winapi
unicode
utf-16
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Edit:

Not the answer you're looking for? Browse other questions tagged c++winapiunicodeutf-16 or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c++
winapi
unicode
utf-16
or ask your own question.