Unicode

Always use the unicode conversion functions offered in carb/extras/Unicode.h when you must convert between representations. Fortunately this is a rare occurrence because we follow the “Unicode sandwich” methodology in this code base. For more details continue reading.

Unicode was unfortunately late to the character encoding party. When it finally arrived there were already in existence all kinds of half-baked solutions, including code pages and wide character encodings. This is legacy that we must still deal with. There are many pitfalls. Even in newer code, like the STL, it’s rather trivial to generate an exception when converting between unicode representations.

Hopefully this introduction gets your attention. We’ve developed some simple rules for this project that should keep you out of harm’s way and make programming simple when dealing with text.

Unicode sandwich

There are multiple types of Unicode representations, like UTF8, UTF16LE, and UTF32. For this project we have chosen to use the same representation everywhere. That is, all text is stored and therefore interpreted the same way. This approach is called “the Unicode sandwich” because when we need to interact with external APIs (OS or 3rd party libraries) that don’t offer the internal encoding flavor we will convert on the boundary from whatever encoding is being used to our internal encoding. This will happen on input and output boundaries which act as the two slices of bread in our analogy with the sandwich. Inside the sandwich everything is in the same encoding. This uniformity makes things simpler for all of us.

UTF8 is our encoding

We use UTF8 encoding for all of our text in Carbonite, both in interfaces and inside plugins. This means that all the text is stored in char arrays. The reasons for choosing UTF8 over other encodings are the following:

  1. Existing tools support; browsers, editors, etc support it directly and it is most often the default.

  2. Existing format support; json, xml, etc are all by default UTF8 encoded.

  3. Many string operations performed on 8-bit ASCII characters work equally well on UTF8 characters.

  4. The English alphabet and many other special characters are directly represented in the first 7-bits (7-bit ASCII) making debugging easier when viewing raw memory.

  5. All codepoints are supported by UTF8 because it’s inherently extendable

Entering UTF8 characters in source files

We use .editorconfig to set the encoding of all text files to UTF8. This means that the source files are already UTF8 and you can therefore enter text/paste directly into them and it just works!

const char* volcano = "Eyjafjallajökull";
const char* mountain = "Everest";

Notice how we didn’t have to add u8 qualifier in front of the string literal that had characters beyond 7-bit ASCII. You can put u8 in front of these strings but it’s redundant, it will result in a NO-OP because our source files are already in UTF8 format.

NOTE: If you choose to use an editor or merge tool that either:

  • doesn’t support character encoding coming from .editorconfig or,

  • doesn’t use UTF8 by default then

you are responsible for saving files in UTF8 encoding if you push their content beyond 7-bit ASCII.

Viewing UTF8 characters in Windows console

You need to enable the right code page in your Windows console to get the UTF8 encoded characters output by Carbonite FrameworkLogger to render correctly. You can do this temporarily by executing:

chcp 65001

You can also do it permanently by following these steps:

  1. Start -> Run -> regedit

  2. Go to [HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor

  3. Create string Autorun (if it doesn’t already exist)

  4. Change the value to @chcp 65001>nul

  5. Launch a new Windows console

Some Unicode characters - like Chinese characters for example - will still be displayed as boxes, question marks, or other placeholder symbols denoting that the font doesn’t have proper representation of the requested character. To fix this, you can select a different font for the console: right click on the console title bar -> Properties -> Font, and select NSimSun. Other fonts that have greater Unicode coverage are MS Gothic/MS Mincho, but they are known to cause artifacts when displaying certain ASCII symbols.

Converting to and from those tricky wchar_t types

Yes, we used plural for wchar_t on purpose. There isn’t a single wchar_t type. On Windows it is actually a UTF16LE encoding while on Linux it’s a UTF32 encoding. Let that sink in. They don’t have the same capabilities and are not even the same size. This has some serious implications.

  • Don’t write code that converts between UTF16 and UTF8 when the input is wchar_t. That code will (probably) work on Windows but it will mysteriously fail on Linux. Instead you must be careful to convert from the actual type, not what you think it contains. The STL contains support for converting UTF16 and wchar_t. Avoid UTF16 unless you positively know the input is in UTF16. The fact that the type is wchar_t does not guarantee this - unless you are writing Windows specific code. Even so, it is safer to convert from wchar_t, since this will work correctly on all operating systems, regardless of what the wchar_t type is on that particular OS.

  • The default settings for the STL provided conversion objects (codecvt) will throw range exceptions when code points are encountered that are outside the range for the source or target encoding. Few are aware of this and even fewer know how to work around it, so please use the utility functions provided in Unicode.h and never roll your own.

  • Don’t use the std::experimental_filesystem::path facilities because they are not supported on all of Carbonite’s target platforms (Tegra) and secondly are error prone to use because by default these facilities will use wide strings with backslashes on Windows but UTF8 on Linux with forward slashes. You need to be careful to call the right u8string accessor and u8path constructors in order to not bungle it up. Just avoid this mess altogether and use Path.h.

Fortunately, we only need to deal with the wide characters on Windows so the conversion functions to/from those are only available on the Windows platform. In general just write:

#include <carb/extras/Unicode.h>

and use the functions provided for your platform, knowing that you are doing it right.