Tag Archives: Unicode

Character encoding and locales

Recently, I've been looking into how character encoding and locales work on Linux, and I thought it might be worthwhile to write down my findings; partly so that I can look them up again later, and partly so that people can correct all the things I've got wrong.

To begin with, let's define some terminology:

  • Character set: a set of symbols which can be used together. This defines the symbols and their semantics, but not how they're encoded in memory. For example: Unicode. (Update: As noted in the comments, the character set doesn't define the appearance of symbols; this is left up to the fonts.)
  • Character encoding: a mapping from a character set to an representation of the characters in memory. For example: UTF-8 is one encoding of the Unicode character set.
  • Nul byte: a single byte which has a value of zero. Typically represented as the C escape sequence ‘\0’.
  • NULL character: the Unicode NULL character (U+0000) in the relevant encoding. In UTF-8, this is just a single nul byte. In UTF-16, however, it's a sequence of two nul bytes.

Now, the problem: if I'm writing a (command line) C program, how do strings get from the command line to the program, and how do strings get from the program to the terminal? More concretely, what actually happens with argv[] and printf()?

Let's consider the input direction first. When the main() function of a C program is called, it's passed an array of pointers to char arrays, i.e. strings. These strings can be arbitrary byte sequences (for example, file names), but are generally intended/assumed to be encoded in the user's environment's character encoding. This is set using the LC_ALL, LC_CTYPE or LANG environment variables. These variables specify the user's locale which (among other things) specifies the character encoding they use.

So the program receives as input a series of strings which are in an arbitrary encoding. This means that all programs have to be able to handle all possible character encodings, right? Wrong. A standard solution to this already exists in the form of libiconv. iconv() will convert between any two character encodings known to the system, so we can use it to convert from the user's environment encoding to, for example, UTF-8. How do we find out the user's environment encoding without parsing environment variables ourselves? We use setlocale() and nl_langinfo().

setlocale() parses the LC_ALL, LC_CTYPE and LANG environment variables (in that order of precedence) to determine the user's locale, and hence their character encoding. It then stores this locale, which will affect the behaviour of various C runtime functions. For example, it will change the formatting of numbers outputted by printf() to use the locale's decimal separator. Just calling setlocale() doesn't have any effect on character encodings, though. It won't, for example, cause printf() to magically convert strings to the user's environment encoding. More on this later.

nl_langinfo() is one function affected by setlocale(). When called with the CODESET type, it will return a string identifying the character encoding set in the user's environment. This can then be passed to iconv_open(), and we can use iconv() to convert strings from argv[] to our internal character encoding (which will typically be UTF-8).

At this point, it's worth noting that most people don't need to care about any of this. If using a library such as GLib – and more specifically, using its GOption command line parsing functionality – all this character encoding conversion is done automatically, and the strings it returns to you are guaranteed to be UTF-8 unless otherwise specified.

So we now have our input converted to UTF-8, our program can go ahead and do whatever processing it likes on it, safe in the knowledge that the character encoding is well defined and, for example, there aren't any unexpected embedded nul bytes in the strings. (This could happen if, for example, the user's environment character encoding was UTF-16; although this is really unlikely and might not even be possible on Linux — but that's a musing for another blog post).

Having processed the input and produced some output (which we'll assume is in UTF-8, for simplicity), many programs would just printf() the output and be done with it. printf() knows about character encodings, right? Wrong. printf() outputs exactly the bytes which are passed to its format parameter (ignoring all the fancy conversion specifier expansion), so this will only work if the program's internal character encoding is equal to the user's environment character encoding, for the characters being outputted. In many cases, the output of programs is just ASCII, so programs get away with just using printf() because most character encodings are supersets of ASCII. In general, however, more work is required to do things properly.

We need to convert from UTF-8 to the user's environment encoding so that what appears in their terminal is correct. We could just use iconv() again, but that would be boring. Instead, we should be able to use gettext(). This means we get translation support as well, which is always good.

gettext() takes in a msgid string and returns a translated version in the user's locale, if possible. Since these translations are done using message catalogues which may be in a completely different character encoding to the user's environment or the program's internal character encoding (UTF-8), gettext() helpfully converts from the message catalogue encoding to the user's environment encoding (the one returned by nl_langinfo(), discussed above). Great!

But what if no translation exists for a given string? gettext() returns the msgid string, unmodified and unconverted. This means that translatable string literals in our program need to magically be written in the user's environment encoding…and we're back to where we were before we introduced gettext(). Bother.

I see three solutions to this:

  • The gettext() solution: declare that all msgid strings should be in US-ASCII, and thus not use any Unicode characters. This works, provided we make the (reasonable) assumption that the user's environment encoding is a superset of ASCII. This requires that if a program wants to use Unicode characters in its translatable strings, it has to provide an en-US message catalogue to translate the American English msgid strings to American English (with Unicode). Not ideal.
  • The gettext()++ solution: declare that all msgid strings should be in UTF-8, and assume that anybody who's running without message catalogues is using UTF-8 as their environment encoding (this is a big assumption). Also not ideal, but a lot less work.
  • The iconv() solution: instruct gettext() to not return any strings in the user's environment encoding, but to return them all in UTF-8 instead (using bind_textdomain_codeset()), and use UTF-8 for the msgid strings. The program can then pass these translated (and untranslated) strings through iconv() as it did with the input, converting from UTF-8 to the user's environment encoding. More effort, but this should work properly.

An additional complication is that of combining translatable printf() format strings with UTF-8 string output from the program. Since printf() isn't encoding-aware, this requires that both the format string and the parameters are in the same encoding (or we get into a horrible mess with output strings which have substrings encoded in different ways). In this case, since our program output is in UTF-8, we definitely want to go with option 3 from above, and have gettext() return all translated messages in UTF-8. This also means we get to use UTF-8 in msgid strings. Unfortunately, it means that we now can't use printf() directly, and instead have to sprintf() to a string, use iconv() to convert that string from UTF-8 to the user's environment encoding, and then printf() it. Whew.

Here's a diagram which hopefully makes some of the journey clearer (click for a bigger version):

Diagram of the processing of strings from input to output in a C program.

So what does this mean for you? As noted above, in most cases it will mean nothing. Libraries such as GLib should take care of all of this for you, and the world will be a lovely place with ponies (U+1F3A0) and cats (U+1F431) everywhere. Still, I wanted to get this clear in my head, and hopefully it's useful to people who can't make use of libraries like GLib (for whatever reason).

Exploring exactly what GLib does is a matter for another time. Similarly, exploring how Windows does things is also best left to a later post (hint: Windows does things completely differently to Linux and other Unices, and I'm not sure it's for the better).

Unicode in Python

Now that exams are finally over, I can spend more time on GNOMEy things. One problem which has been sitting on my to-do list for a while is that of translatable Unicode strings in Python. It appears that my patch in bug #591496 to get Hamster to use Unicode em-dashes inadvertently broke translation of the strings. Whoops.

It turns out that in order for gettext to properly match and translate a C-locale string which contains Unicode characters, the encoding of the Python file must be specified using a coding: line at the top of the file, and the string in question must be a Unicode object. For example:

# -*- coding: utf-8 -*-
import gettext
my_translated_string = gettext.gettext(u'My Unicode string…')

I don't think this is too common a problem, and I've checked that it doesn't affect any of the other Python modules I've fiddled with, but hopefully this will be useful to someone. As far as I understand it, all translatable strings in Python modules should be u'Unicode objects rather than normal strings' anyway, ideally, but don't take my word on it because my Python-fu is weak.

Unicode in GNOME

This is something I’ve been meaning to write about for a while and, I must admit, something I should have written about before I started pushing through changes in GNOME applications. I’m talking about the use of Unicode in GNOME: the use of the proper ellipsis character (“…”), proper en- and em-dashes (“–” and “—”, respectively) and fancy quotation marks.

This is something which has been brought up before, so I’ll try not to reignite the old arguments, and instead concentrate on the unresolved issues. Here are the main points:

  • Proper Unicode characters look nicer than the ASCII versions which substitute for them. The ellipsis is correctly spaced (if one were to use full stops instead, they should technically have non-breaking spaces between them), and the quotation marks are pleasantly curved. This looks nicer, to my eye at least. The difference between en- and em-dashes and the ASCII hyphens used to simulate them is considerable.
  • They’re harder to type on a conventional keyboard, though are easily accessible through the use of the compose key.
  • There are questions about the level of font support for such characters. On my Fedora 11 system, all the fonts except one (“PakTypeTehreer”) have the expected characters (ellipsis, dashes and quotation marks) at the right codepoints, although many of the glyphs are ugly and unloved (e.g. in Hershey and Khmer). DejaVu and Bitstream have excellent support for these characters. There is a suggestion that Pango should be extended to support decomposing the Unicode characters into their ASCII equivalents if a font doesn't support them.
  • There was confusion over what exactly was allowed in source code, and whether UTF-8 characters were allowed in C-locale strings (regardless of their representation in source code). It was decided that they were, but that the most portable way to represent them in C was to use octal slash escaping (e.g. “\342\200\246” instead of “…”). We’ve had Unicode characters in source code since GNOME 2.22, and (apparently) there have been no bug reports on the matter, but there was no conclusive answer about how embedded C compilers (and other, less well-known compilers) cope with such things.

Obviously, I’m thoroughly in the pro-Unicode camp. I believe it would make our desktop look more professional, and improve legibility of the interface in places. I’ve spoken to Calum Benson of HIG fame and he has no particular objections to mandating use of the appropriate Unicode characters by the HIG.

In the meantime, I’ve been filing bugs against applications to convert them to using proper Unicode characters; this probably wasn’t the best way to go about things, but at least it is a move in the right direction (in my view anyway). Unfortunately, this has come at the cost of inconsistency in the desktop. Most of the changes have been applied after branching for gnome-2-28, however, so if we can work out some guidelines about use of Unicode characters early in the 2.30 cycle (i.e. now), consistency could be maintained in the desktop for the 2.30 release. We might even be able to brag about nice typography for (dare I say it?) GNOME 3.0!

So, should we be expending effort on dealing with fonts which don’t support various Unicode characters, extending Pango to support the appropriate decompositions? Are there any problems with embedded C compilers and Unicode string literals? If we decide to go with a uniform usage of certain Unicode characters, what guidelines shall we go with, and how can we educate translators in how to type them?