Character encoding and locales

Recently, I've been looking into how character encoding and locales work on Linux, and I thought it might be worthwhile to write down my findings; partly so that I can look them up again later, and partly so that people can correct all the things I've got wrong.

To begin with, let's define some terminology:

  • Character set: a set of symbols which can be used together. This defines the symbols and their semantics, but not how they're encoded in memory. For example: Unicode. (Update: As noted in the comments, the character set doesn't define the appearance of symbols; this is left up to the fonts.)
  • Character encoding: a mapping from a character set to an representation of the characters in memory. For example: UTF-8 is one encoding of the Unicode character set.
  • Nul byte: a single byte which has a value of zero. Typically represented as the C escape sequence ‘\0’.
  • NULL character: the Unicode NULL character (U+0000) in the relevant encoding. In UTF-8, this is just a single nul byte. In UTF-16, however, it's a sequence of two nul bytes.

Now, the problem: if I'm writing a (command line) C program, how do strings get from the command line to the program, and how do strings get from the program to the terminal? More concretely, what actually happens with argv[] and printf()?

Let's consider the input direction first. When the main() function of a C program is called, it's passed an array of pointers to char arrays, i.e. strings. These strings can be arbitrary byte sequences (for example, file names), but are generally intended/assumed to be encoded in the user's environment's character encoding. This is set using the LC_ALL, LC_CTYPE or LANG environment variables. These variables specify the user's locale which (among other things) specifies the character encoding they use.

So the program receives as input a series of strings which are in an arbitrary encoding. This means that all programs have to be able to handle all possible character encodings, right? Wrong. A standard solution to this already exists in the form of libiconv. iconv() will convert between any two character encodings known to the system, so we can use it to convert from the user's environment encoding to, for example, UTF-8. How do we find out the user's environment encoding without parsing environment variables ourselves? We use setlocale() and nl_langinfo().

setlocale() parses the LC_ALL, LC_CTYPE and LANG environment variables (in that order of precedence) to determine the user's locale, and hence their character encoding. It then stores this locale, which will affect the behaviour of various C runtime functions. For example, it will change the formatting of numbers outputted by printf() to use the locale's decimal separator. Just calling setlocale() doesn't have any effect on character encodings, though. It won't, for example, cause printf() to magically convert strings to the user's environment encoding. More on this later.

nl_langinfo() is one function affected by setlocale(). When called with the CODESET type, it will return a string identifying the character encoding set in the user's environment. This can then be passed to iconv_open(), and we can use iconv() to convert strings from argv[] to our internal character encoding (which will typically be UTF-8).

At this point, it's worth noting that most people don't need to care about any of this. If using a library such as GLib – and more specifically, using its GOption command line parsing functionality – all this character encoding conversion is done automatically, and the strings it returns to you are guaranteed to be UTF-8 unless otherwise specified.

So we now have our input converted to UTF-8, our program can go ahead and do whatever processing it likes on it, safe in the knowledge that the character encoding is well defined and, for example, there aren't any unexpected embedded nul bytes in the strings. (This could happen if, for example, the user's environment character encoding was UTF-16; although this is really unlikely and might not even be possible on Linux — but that's a musing for another blog post).

Having processed the input and produced some output (which we'll assume is in UTF-8, for simplicity), many programs would just printf() the output and be done with it. printf() knows about character encodings, right? Wrong. printf() outputs exactly the bytes which are passed to its format parameter (ignoring all the fancy conversion specifier expansion), so this will only work if the program's internal character encoding is equal to the user's environment character encoding, for the characters being outputted. In many cases, the output of programs is just ASCII, so programs get away with just using printf() because most character encodings are supersets of ASCII. In general, however, more work is required to do things properly.

We need to convert from UTF-8 to the user's environment encoding so that what appears in their terminal is correct. We could just use iconv() again, but that would be boring. Instead, we should be able to use gettext(). This means we get translation support as well, which is always good.

gettext() takes in a msgid string and returns a translated version in the user's locale, if possible. Since these translations are done using message catalogues which may be in a completely different character encoding to the user's environment or the program's internal character encoding (UTF-8), gettext() helpfully converts from the message catalogue encoding to the user's environment encoding (the one returned by nl_langinfo(), discussed above). Great!

But what if no translation exists for a given string? gettext() returns the msgid string, unmodified and unconverted. This means that translatable string literals in our program need to magically be written in the user's environment encoding…and we're back to where we were before we introduced gettext(). Bother.

I see three solutions to this:

  • The gettext() solution: declare that all msgid strings should be in US-ASCII, and thus not use any Unicode characters. This works, provided we make the (reasonable) assumption that the user's environment encoding is a superset of ASCII. This requires that if a program wants to use Unicode characters in its translatable strings, it has to provide an en-US message catalogue to translate the American English msgid strings to American English (with Unicode). Not ideal.
  • The gettext()++ solution: declare that all msgid strings should be in UTF-8, and assume that anybody who's running without message catalogues is using UTF-8 as their environment encoding (this is a big assumption). Also not ideal, but a lot less work.
  • The iconv() solution: instruct gettext() to not return any strings in the user's environment encoding, but to return them all in UTF-8 instead (using bind_textdomain_codeset()), and use UTF-8 for the msgid strings. The program can then pass these translated (and untranslated) strings through iconv() as it did with the input, converting from UTF-8 to the user's environment encoding. More effort, but this should work properly.

An additional complication is that of combining translatable printf() format strings with UTF-8 string output from the program. Since printf() isn't encoding-aware, this requires that both the format string and the parameters are in the same encoding (or we get into a horrible mess with output strings which have substrings encoded in different ways). In this case, since our program output is in UTF-8, we definitely want to go with option 3 from above, and have gettext() return all translated messages in UTF-8. This also means we get to use UTF-8 in msgid strings. Unfortunately, it means that we now can't use printf() directly, and instead have to sprintf() to a string, use iconv() to convert that string from UTF-8 to the user's environment encoding, and then printf() it. Whew.

Here's a diagram which hopefully makes some of the journey clearer (click for a bigger version):

Diagram of the processing of strings from input to output in a C program.

So what does this mean for you? As noted above, in most cases it will mean nothing. Libraries such as GLib should take care of all of this for you, and the world will be a lovely place with ponies (U+1F3A0) and cats (U+1F431) everywhere. Still, I wanted to get this clear in my head, and hopefully it's useful to people who can't make use of libraries like GLib (for whatever reason).

Exploring exactly what GLib does is a matter for another time. Similarly, exploring how Windows does things is also best left to a later post (hint: Windows does things completely differently to Linux and other Unices, and I'm not sure it's for the better).

21 thoughts on “Character encoding and locales

  1. Sean Burke

    First, a little nitpicking: technically, the character set doesn't define the appearance (the glyph) of a character. That's the job of the fonts only. It's the job of the font designer to understand what the character means and how it should be represented. The examples included in the Unicode standard are informative only.

    On to the subject of how strings should be output. The second two options result in essentially the same restriction: the only way to guarantee that the input and output are dealt with sanely is for the user to use a UTF-8 locale. Using UTF-8 in message catalogues allows developers to use Unicode characters in their translatable strings, and in practice they do. The most widespread example I know of is typographical punctuation.

    Many fewer locales have an equivalent for the Unicode characters in use in UTF-8 message catalogues, so iconv() ends up not being a suitable solution. iconv() will fail to convert those strings and the user won't get the information they need. Restricting developers from using Unicode characters would be difficult and not always a good thing.

    On the other hand, while the first solution isn't totally ideal, it has the advantage of allowing users with legacy encoding locales to keep using them. As you said, ASCII is a subset of just about every encoding with much currency. It also has the additional advantage of simplifying localization a great deal. It provides consistency for users, especially for users still using the C locale. It provides consistency for translators. And while this is fairly Anglocentric of me, it is the language with the widest international use.

    In the end, because encodings are such a nightmare area, my preference is the first option with a healthy dose of encouraging users to move over to UTF-8 locales wherever possible.

    1. Philip Withnall Post author

      Thanks for your comments. I've updated the post to fix the definition of character sets.

      I did intend to have a paragraph on the differences between character sets wrt. input and output conversion, but apparently I forgot about it, and the whole post ended up being about converting character encodings.

      My take on the problem would be that it's not really iconv()'s fault that many character sets are subsets of Unicode. As you say, the only way to guarantee that every character the program uses is representable in the user's environment is for the user's environment to be using a character set which has Unicode as a subset.

      I see two solutions, short of requiring only ASCII characters to be used in msgid strings:
      the first is for the programmer to strike a balance in the number of Unicode characters they use, such that even if they get replaced, the meaning of the strings is still clear. This works under the assumption that any locale with an encoding which forces such characters to be replaced will eventually end up getting its own message catalogue with proper translations of the msgid strings to the appropriate character set (and language, etc.).
      The second is to use some kind of transliteration function on output as well as iconv(). As you say, many of the Unicode characters in use in msgid strings are typographical punctuation, which all have reasonable ASCII fallbacks. If no message catalogue is available for the locale, and the user's environment character set is not a superset of Unicode, the output strings could be transliterated to (hopefully) reduce the number of replacement characters needed. (I should note, however, that I don't know much more about transliteration than this — I certainly have never looked into libraries implementing it.)

      1. Sean Burke

        Again, I see problems with these solutions. The problem with the first is that iconv will stop converting when it encounters something it can't convert. The problem with the second is that it assumes you can be aware ahead of time of the range of Unicode characters that will be used and that you can find a useful representation in ASCII.

        1. Philip Withnall Post author

          Ack, yes. I somehow ended up thinking that iconv() would replace unrepresentable glyphs with a replacement character rather than just erroring.

          For my second suggestion, the programmer can know ahead of time the full range of Unicode characters which may need transliterating, as they're just the ones used in the msgid strings. I'm not considering transliterating translated strings from message catalogues, as that would be ridiculous.

          I guess if you can't find a useful representation of a given Unicode character in ASCII, there's always the replacement character. Not perfect, but the best I could come up with.

          1. Sean Burke

            I still feel like the best option here is to have your msgids written in American English and let gettext handle character set conversion. At least then your fallback strings will be mostly viewable.

  2. Bob Bobson

    Is there anyone on the planet who doesn't just use Unicode with UTF-8? Just assume it's that and you'll be fine.

    1. Philip Withnall Post author

      I believe that Japan, China and other East Asian countries still make moderate use of encodings such as Shift JIS and Big5, which aren't UTF-8 compatible. (Please correct me if I'm wrong.)

      1. Sean Burke

        What do you mean by UTF-8 compatible? Both Shift_JIS and Big5 are subsets of Unicode. Though the thrust seems to be, "don't assume UTF-8". I'd agree, since it's hardly true that everyone's using it.

        1. Philip Withnall Post author

          Not UTF-8 compatible in that they're not encodings of Unicode, and their encodings aren't byte-compatible subsets of UTF-8 in the way that ISO-8859 is, for example.

          1. Sean Burke

            Ahh, right. Well, it's worth noting that just about any locale that isn't Western European (and probably some that are) will still have some people using locales which aren't byte-compatible with UTF-8. The punchline is, assuming people are using a UTF-8 locale is a Bad Idea. (I'm with Philip here.)

    2. Mathias

      Firefox, Java, JavaScript, Qt, Windows NT, ... - They all use UTF-16 to encode Unicode.

    1. Philip Withnall Post author

      That's what I would (and did) recommend. The point of this post was to try and understand what's going on “under the bonnet” a little better.

  3. Jeffrey Stedfast

    > We use setlocale() and nl_langinfo().

    Sadly, this is wrong as I discovered a while back while fixing a bug in gmime.

    It turns out that setlocale() is completely useless across the board as you cannot rely on it returning anything useful. It certainly doesn't parse the LC_ALL, LC_CTYPE, nor LANG environment variables. Try it.

    On Cygwin, it always returns "C" (hard-coded apparently) and on my Linux system, it didn't seem to be affected at all by my environment, also always returning "C".

    As far as nl_langinfo (CODESET), it seems that on some systems this will always return "US-ASCII" when that is not, in fact, the correct codeset. So... if you get back US-ASCII, you need to fall back to using the LC_ALL, LC_CTYPE and LANG environment variables (see http://git.gnome.org/browse/gmime/tree/gmime/gmime-charset.c#n269 for an example of how to arrive at the system's locale charset... it's not pretty).

  4. Jeffrey Stedfast

    #include <locale.h>
    #include <stdlib.h>

    int main (int argc, char **argv)
    {
    char *locale = setlocale (LC_ALL /* or LC_CTYPE */, NULL);

    printf ("setlocale() = %s\n", locale);
    printf ("LC_ALL = %s\n", getenv ("LC_ALL"));
    printf ("LC_CTYPE = %s\n", getenv ("LC_CTYPE"));
    printf ("LANG = %s\n", getenv ("LANG"));

    return 0;
    }

    Here are my results on Linux (which is probably the only system that you could possibly expect setlocale() to work on):

    [fejj@serenity gmime]$ ./setlocale 
    setlocale() = C
    LC_ALL = (null)
    LC_CTYPE = (null)
    LANG = en_US.utf8
    
    [fejj@serenity gmime]$ LC_ALL=en_US.UTF-8 ./setlocale 
    setlocale() = C
    LC_ALL = en_US.UTF-8
    LC_CTYPE = (null)
    LANG = en_US.utf8
    
    [fejj@serenity gmime]$ LC_ALL=en_US.ISO-8859-1 ./setlocale 
    setlocale() = C
    LC_ALL = en_US.ISO-8859-1
    LC_CTYPE = (null)
    LANG = en_US.utf8
    
    [fejj@serenity gmime]$ LC_CTYPE=ISO-8859-1 ./setlocale 
    setlocale() = C
    LC_ALL = (null)
    LC_CTYPE = ISO-8859-1
    LANG = en_US.utf8

    I also tried passing LC_CTYPE to setlocale() but that made no difference.

    1. Philip Withnall Post author

      You need to pass an empty string as the second parameter of setlocale() to get it to parse the environment variables. Passing NULL is defined to just query the current C runtime locale, and definitely not modify it.

      Changing your code to call setlocale() as setlocale (LC_ALL, ""); works for me on Linux.

        1. Philip Withnall Post author

          I would say that using setlocale(LC_ALL, "") is more portable than parsing the environment yourself. For example, AIX uses funny locale names whereby the case of the locale name specifies the character encoding in use. Its setlocale() function can handle this, but you'd have a hard time getting it to work manually.

          Of course, calling setlocale(LC_ALL, "") also means that the C runtime locale is nicely set for you, and you don't have to make a subsequent call to it with your manually-parsed locale just to get printf()'s formatting to be locale-dependent.

          1. Jeffrey Stedfast

            I meant as far as getting the system locale charset. I don't have the option of calling setlocale(LC_ALL, "") because I'm a library author, so I can't go calling setlocale() on behalf of the program my library is being used from w/o risking destroying what the program may have already set.

          2. Philip Withnall Post author

            That's true, but you could require that the application call setlocale(LC_ALL, "") before initialising gmime and then calling setlocale(LC_CTYPE, NULL) within gmime should work, shouldn't it?

  5. kd_harrington

    Ye whittle meaning from the secrets of black magic.

    Please could you explain supersymetry to me next ?

Comments are closed.