Unicode in GNOME

This is something I’ve been meaning to write about for a while and, I must admit, something I should have written about before I started pushing through changes in GNOME applications. I’m talking about the use of Unicode in GNOME: the use of the proper ellipsis character (“…”), proper en- and em-dashes (“–” and “—”, respectively) and fancy quotation marks.

This is something which has been brought up before, so I’ll try not to reignite the old arguments, and instead concentrate on the unresolved issues. Here are the main points:

Proper Unicode characters look nicer than the ASCII versions which substitute for them. The ellipsis is correctly spaced (if one were to use full stops instead, they should technically have non-breaking spaces between them), and the quotation marks are pleasantly curved. This looks nicer, to my eye at least. The difference between en- and em-dashes and the ASCII hyphens used to simulate them is considerable.
They’re harder to type on a conventional keyboard, though are easily accessible through the use of the compose key.
There are questions about the level of font support for such characters. On my Fedora 11 system, all the fonts except one (“PakTypeTehreer”) have the expected characters (ellipsis, dashes and quotation marks) at the right codepoints, although many of the glyphs are ugly and unloved (e.g. in Hershey and Khmer). DejaVu and Bitstream have excellent support for these characters. There is a suggestion that Pango should be extended to support decomposing the Unicode characters into their ASCII equivalents if a font doesn't support them.
There was confusion over what exactly was allowed in source code, and whether UTF-8 characters were allowed in C-locale strings (regardless of their representation in source code). It was decided that they were, but that the most portable way to represent them in C was to use octal slash escaping (e.g. “\342\200\246” instead of “…”). We’ve had Unicode characters in source code since GNOME 2.22, and (apparently) there have been no bug reports on the matter, but there was no conclusive answer about how embedded C compilers (and other, less well-known compilers) cope with such things.

Obviously, I’m thoroughly in the pro-Unicode camp. I believe it would make our desktop look more professional, and improve legibility of the interface in places. I’ve spoken to Calum Benson of HIG fame and he has no particular objections to mandating use of the appropriate Unicode characters by the HIG.

In the meantime, I’ve been filing bugs against applications to convert them to using proper Unicode characters; this probably wasn’t the best way to go about things, but at least it is a move in the right direction (in my view anyway). Unfortunately, this has come at the cost of inconsistency in the desktop. Most of the changes have been applied after branching for gnome-2-28, however, so if we can work out some guidelines about use of Unicode characters early in the 2.30 cycle (i.e. now), consistency could be maintained in the desktop for the 2.30 release. We might even be able to brag about nice typography for (dare I say it?) GNOME 3.0!

So, should we be expending effort on dealing with fonts which don’t support various Unicode characters, extending Pango to support the appropriate decompositions? Are there any problems with embedded C compilers and Unicode string literals? If we decide to go with a uniform usage of certain Unicode characters, what guidelines shall we go with, and how can we educate translators in how to type them?

Sources:

20 thoughts on “Unicode in GNOME”

Joe Buck October 2, 2009 at 00:13

The ideal case would be that the application developer would specify the proper Unicode characters, but Pango or some other mechanism would automatically support the nearest equivalent (e.g. three periods for an ellipsis) if the chosen font lacked the character. That way the app would look good if possible, but still work in any case.
David October 2, 2009 at 04:18

Hear, hear! “Straight quotes” are evil in so many ways. I wish rounded quotes were part of standard keyboard configurations and that everyone used them for everything (yes, including programming languages). The Compose key is also absolutely essential (though I wish it were quasimodal, which would allow more and more natural sequences). If only keyboards weren’t so tied to their typewriter heritage… At the very least, GTK+ text inputs could be modified to auto-correct for these characters, like word processors already do. (Is there a bug for this?) /rant

As for your last question, my idea of auto-correction would make things very easy for the future. Additionally, perhaps it’s time to suggest that Compose be mapped to Caps Lock by default?
1. Jeroen Hoek October 2, 2009 at 10:39
  
  I'm partial to the right alt for compose myself. I agree that its functionality should be advertised better, and enabled by default. If you've ever showed a non-technical user of any OS how easy it can be to enter characters not on you keyboard, you know that people do want to use them, but just assume they can't.
  
  Autocorrection at the GTK+ level is a step too far I'm afraid. A developer working on some database shouldn't have to worry about a GTK+ editor or database tool converting his ASCII text to multibyte characters.
  1. David October 3, 2009 at 04:12
    
    If autocorrection were available for GTK+ text fields, applications can decide if it should be enabled or not. Thus, gedit can allow it only on non–source-code. For most users in most text contexts, autocorrecting by default would be a big win. After all, you shouldn’t need to use a word processor just so you have nicely formatted characters in your text. This would also be a fun opportunity to one-up other platforms in æsthetics.
Benjamin Otte October 2, 2009 at 06:39

I'm all for saying it's encouraged to put the proper Unicode characters in source code. After all, lots of code already contains the © sign in the header, and as you said: no compilers every complained.

There's two things I'd like to have about this:

First, I'd like to have some web page that exlains the suggested behavior and explains the rationale behind it. This is very useful to avoid arguments in bug reports, both about UI and about how to write code.

Second, I care very much about not putting Unicode escape sequences into source code. Applications might be able to handle them portably, but they are not very portable to humans trying to hack on the code. I know I wouldn't spot the bug in the string "Save As\342\246\200" when reviewing a patch.
1. Philip Withnall Post authorOctober 2, 2009 at 07:23
  
  I think it was decided that using octal escapes was the only spec-conformant way to use Unicode in C, but I agree: it is ugly. How about some automake magic which replaces Unicode characters in string literals with their octal escapes before compilation? That would also make it easier for translators, since they wouldn't have to deal with the octal escapes then, either (although gettext might convert them back already anyway; I don't know).
  
  On the other hand, bugs in strings are notoriously hard to spot without running the program anyway, so we don't really lose anything by switching to octal escapes.
2. Emmanuele Bassi October 2, 2009 at 08:30
  
  we could add the "HIG approved" octal escaped Unicode glyphs to Pango, as convenience macro:
  
  /**
  * PANGO_ELLIPSIS_S:
  *
  * Evaluates to a string containing the ellipsis glyph in octal form, useful to concatenate
  * strings, e.g.:
  * |[
  * button = gtk_button_new_with_label ("Save as" PANGO_ELLIPSIS_S);
  * ]|
  */
  #define PANGO_ELLIPSIS_S "\342\246\200"
  1. Philip Withnall Post authorOctober 2, 2009 at 16:53
    
    Unfortunately, I don't think gettext can cope with that. I just ran a quick test with _("ABC" FOOBAR "DEF"), with FOOBAR defined as another string, and gettext only ever picked up "ABC".
    1. Jaroslav Smid October 2, 2009 at 18:13
      
      gettext's bug? ...
      1. Philip Withnall Post authorOctober 3, 2009 at 07:15
        
        It's an xgettext problem where it can't understand #defines.
waldo October 2, 2009 at 08:42

Unfortunately the coreutils maintainers refused my patch to turn the symlink -> arrow into a proper unicode arrow when then tty supported it. Sniff. But it looks oh so much prettier.
Stuart October 2, 2009 at 08:57

We really need an alternative to the Unicode key, Ctrl-Shift-U because many keyboard layouts (e.g. Greek, Arabic) have no "U" key.

And a representation for Unicode in the terminal.
Pingback: Gabor Kelemen (kelemeng) 's status on Friday, 02-Oct-09 09:21:58 UTC - Identi.ca
Marius Gedminas October 2, 2009 at 10:23

I recall a suggestion from somewhere (gettext's documentation perhaps) that strings in the source code should be pure ASCII and use straight quotes etc., but the corresponding en_US.UTF-8 translation ought to use proper Unicode characters for quotes etc.
1. Philip Withnall Post authorOctober 2, 2009 at 16:58
  
  There were such suggestions in the desktop-devel list threads I linked to above, but it was determined that since GTK+ functions were defined to accept UTF-8 strings regardless of locale, it was OK to use Unicode in the C locale (just as long as the program is careful to then not feed such strings to glibc functions, which expect different input encodings depending on the locale).
Alexander Jones October 2, 2009 at 10:32

Good luck with this mate, it was I who set this all off and I got very bored very fast with peoples' obsession to please obscure, 40 year old compilers!

By the way, the octal escapes are VERY easy to break. Some byte sequences are invalid UTF-8. When you are typing them into your UTF-8 text editor, you will not be able to break the byte sequences, and if for some reason one is broken, it will show up as a very obvious ? placeholder anyway.

@Marius: that means the people writing the English "translations" have to do everything twice. Once in the source code for the C pseudo-locale, and then going through everything again and putting in proper UTF-8 for the en_* locales. DO NOT WANT.
Ben October 2, 2009 at 12:22

Fine with the quotation marks and dashes. And the same for things like math operators (minus, multiplication).

But the ellipsis is a bad example. In Unicode it's equivalent with three normal dots, meaning that it doesn't matter if you write three dots or one ellipsis. And since in monospaced fonts the three dots looks much better than the squished ellipsis, it's best to just use three dots.
1. Philip Withnall Post authorOctober 2, 2009 at 17:00
  
  There are situations, such as in terminals, where one would want to use three dots; that's the reason the GLib bug about using proper ellipses was closed as WONTFIX (https://bugzilla.gnome.org/show_bug.cgi?id=596060#c6).
  However, as far as I know, a Unicode ellipsis is not directly equivalent to three dots. An ellipsis actually has more space between the dots than would be produced by three dots, so while it can decompose (for compatibility purposes) to three dots, the two character sequences are not identical.
bochecha October 2, 2009 at 16:12

« and how can we educate translators in how to type them? »

You don't. For an example, just look at the quotes I used above. That's the proper quote in french typography. For another example, in french, we use an unbreakable space before « double » punctuation marks (like « : ; ! ? »).

So you don't tell translators how to use those fancy unicode chars used in english typography. They should (and I suspect most do) already know what are the proper unicode chars to use in their own language. 😉
1. Philip Withnall Post authorOctober 2, 2009 at 17:01
  
  Excellent!

Comments are closed.