Unicode Processing in C++

Question

What is the best practice of Unicode processing in C++?

Tim Stone · Accepted Answer · 2012-07-25 04:16:52Z

81

Use ICU for dealing with your data (or a similar library)
In your own data store, make sure everything is stored in the same encoding
Make sure you are always using your unicode library for mundane tasks like string length, capitalization status, etc. Never use standard library builtins like is_alpha unless that is the definition you want.
I can't say it enough: never iterate over the indices of a string if you care about correctness, always use your unicode library for this.

edited Jul 25, 2012 at 4:16

Tim Stone

19.1k6 gold badges57 silver badges66 bronze badges

answered Sep 11, 2008 at 1:37

hazzen

17.3k6 gold badges42 silver badges33 bronze badges

Unless you are treating the string as binary data.
– Demi
Commented May 13, 2016 at 23:59

Add a comment |

eestrada · Accepted Answer · 2012-11-28 18:57:05Z

10

If you don't care about backwards compatibility with previous C++ standards, the current C++11 standard has built in Unicode support: http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2011/n3242.pdf

So the truly best practice for Unicode processing in C++ would be to use the built in facilities for it. That isn't always a possibility with older code bases though, with the standard being so new at present.

EDIT: To clarify, C++11 is Unicode aware in that it now has support for Unicode literals and Unicode strings. However, the standard library has only limited support for Unicode processing and conversion. For your current needs this may be enough. However, if you need to do a large amount of heavy lifting right now then you may still need to use something like ICU for more in-depth processing. There are some proposals currently in the works to include more robust support for text conversion between different encodings. My guess (and hope) is that this will be part of the next technical report.

edited Nov 28, 2012 at 18:57

answered Nov 21, 2012 at 1:09

eestrada

1,60515 silver badges25 bronze badges

That link to a draft standard doc isn't very helpful without a reference to a particular section that describes the "built in Unicode support" you're discussing.
– Ben Collins
Commented Jan 7, 2014 at 16:27
1

@BenCollins Section 2.14.5 "String literals" - discusses string literals, including string literals for UTF-8, UTF-16 and UTF-32 encodings. Section 22.4.1.4 "Class template codecvt" - discusses the codecvt class used for converting between character encodings (including UTF-8, UTF-16 and UTF-32). There is more about Unicode support peppered throughout the document, but these seem to be the most critical sections on the subject.
– eestrada
Commented Nov 25, 2014 at 16:44

Add a comment |

jschroedl · Accepted Answer · 2008-09-11 01:46:51Z

8

Our company (and others) use the open source Internation Components for Unicode (ICU) library originally developed by Taligent.

It handles strings, locales, conversions, date/times, collation, transformations, et. al.

Start with the ICU Userguide

answered Sep 11, 2008 at 1:46

jschroedl

4,9663 gold badges31 silver badges48 bronze badges

Add a comment |

Adam Pierce · Accepted Answer · 2008-09-11 01:33:53Z

5

Here is a checklist for Windows programming:

All strings enclosed in _T("my string")
strlen() etc. functions replaced with _tcslen() etc.
Use LPTSTR and LPCTSTR instead of char * and const char *
When starting new projects in Dev Studio, religiously make sure the Unicode option is selected in your project properties.
For C++ strings, use std::wstring instead of std::string

answered Sep 11, 2008 at 1:33

Adam Pierce

34.1k23 gold badges70 silver badges89 bronze badges

11

Do not use "T" strings, chars and functions, unless you intend to do both Unicode and ANSI builds. If you only intend to do Unicode builds, just do regular wide character stuff: L"my wide string" wcslen(L"my string") etc
– 1800 INFORMATION
Commented Sep 11, 2008 at 1:52
Agree, only use _T macros if you want generic text, i.e., the ability to code for both Unicode and Ascii/MBCS.
– user2189331
Commented Sep 11, 2008 at 2:23
1

In case you want do both Unicode and ANSI for C++ strings use something like typedef std::basic_string<TCHAR> tString;
– Serge
Commented Sep 11, 2008 at 7:10
Ah yes, I always do #ifdef _UNICODE #define tstring std::wstring #else #define tstring std::string #endif but I like your way better Serge.
– Adam Pierce
Commented Sep 17, 2008 at 4:38
4

Honestly, I think that UTF16 is a waste, leaving all encodings in UTF8 is simpler and way more compatible with *nix.
– chacham15
Commented Nov 30, 2012 at 6:26

Add a comment |

Community · Accepted Answer · 2017-05-23 12:33:54Z

Look at Case insensitive string comparison in C++

That question has a link to the Microsoft documentation on Unicode: http://msdn.microsoft.com/en-us/library/cc194799.aspx

If you look on the left-hand navigation side on MSDN next to that article, you should find a lot of information pertaining to Unicode functions. It is part of a chapter on "Encoding Characters" (http://msdn.microsoft.com/en-us/library/cc194786.aspx)

It has the following subsections:

The Code-Page Model
Double-Byte Character Sets in Windows
Unicode
Compatibility Issues in Mixed Environments
Unicode Data Conversion
Migrating Windows-Based Programs to Unicode
Summary

Willow Schlanger · Accepted Answer · 2012-03-12 04:10:39Z

Although this may not be best practice for everyone, you can write your own C++ UNICODE routines if you want!

I just finished doing it over a weekend. I learned a lot, though I don't guarantee it's 100% bug free, I did a lot of testing and it seems to work correctly.

My code is under the New BSD license and can be found here:

http://code.google.com/p/netwidecc/downloads/list

It is called WSUCONV and comes with a sample main() program that converts between UTF-8, UTF-16, and Standard ASCII. If you throw away the main code, you've got a nice library for reading / writing UNICODE.

Paul Hutchinson · Accepted Answer · 2017-05-31 16:34:30Z

As has been said above a library is the best bet when using a large system. However some times you do want to handle things your self (maybe because the library would use to many resources like on a micro controller). In this case you want a simple library that you can copy the parts out of for the things you actually need.

Willow Schlanger's example code seems like a good one (see his answer for more details).

I also found another one that has smaller code, but lacks full error checking and only handles UTF-8 but was simpler to take parts out of.

Here's a list of the embedded libraries that seem decent.

Embedded libraries

http://code.google.com/p/netwidecc/downloads/list (UTF8, UTF16LE, UTF16BE, UTF32)
http://www.cprogramming.com/tutorial/unicode.html (UTF8)
http://utfcpp.sourceforge.net/ (Simple UTF8 library)

Joe Schneider · Accepted Answer · 2008-09-11 01:39:07Z

0

Use IBM's International Components for Unicode

answered Sep 11, 2008 at 1:39

Joe Schneider

9,2797 gold badges43 silver badges59 bronze badges

Add a comment |

Jan Rüegg · Accepted Answer · 2016-09-23 09:30:58Z

0

Have a look at the recommendations of UTF-8 Everywhere

answered Sep 23, 2016 at 9:30

Jan Rüegg

9,9078 gold badges65 silver badges110 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Unicode Processing in C++

9 Answers 9

Embedded libraries

Not the answer you're looking for? Browse other questions tagged
c++
unicode
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

Embedded libraries

Not the answer you're looking for? Browse other questions tagged c++unicode or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c++
unicode
or ask your own question.