Category Archives: Tutorials

How-tos, tutorials and guides (mainly on how to code for the web).

(const gchar*) vs. (gchar*) and other memory management stories

Memory management in C is often seen as a minefield, and while it’s not as simple as in more modern languages, if a few conventions are followed it doesn’t have to be hard at all. Here are some pointers about heap memory management in C using GObject and the usual GNOME types and coding conventions, including GObject introspection. A basic knowledge of C is assumed, but no more. Stack memory allocation and other obscure patterns aren’t covered, since they’re much less commonly used in a typical GObject application.

The most important rule when thinking about memory management is: which code has ownership of this memory?

Ownership can be defined as the responsibility to free or deallocate a piece of memory. If code which does own some memory doesn’t deallocate that memory, that’s a memory leak. If code which doesn’t own some memory deallocates it (in addition to the owner doing so), that’s a double-free. Both are bad.

For example, if a function returns a newly allocated string and doesn’t retain a pointer to it, ownership of the string is transferred to the caller. Note that when thinking about ownership transfer, malloc()/free() and reference counting are treated the same: in the former case, a newly allocated piece of heap memory is being transferred; in the latter, a newly incremented reference.

gchar *transfer_full_function (void) {
	return g_strdup ("Newly allocated string.");
}

/* Ownership is transferred here. */
gchar *my_string = transfer_full_function ();

Conversely, if the function does retain a pointer (e.g. inside an object), ownership is not transferred to the caller.

const gchar *transfer_none_function (void) {
	return "Static string.";
}

const gchar *transfer_none_method (MyObject *self) {
	gchar *new_string = g_strdup ("Newly allocated string.");
	/* Ownership is retained by the object here. */
	self->priv->saved_string = new_string;
	return new_string;
}

/* In both of these cases, ownership is not transferred. */
const gchar *my_string1 = transfer_none_function ();
const gchar *my_string2 = transfer_none_method (some_object);

In all of these examples, you may notice the return type of the functions reflects whether they transfer ownership of their return values. const return types indicate no transfer, and non-const return types indicate some transfer.

While this is useful, it’s by no means completely clear. For example, what if a function returns a non-const GList*? Is ownership of the list elements transferred, or just the list? What if the function’s author forgot to make the return type const, and actually there’s no transfer? This is where documentation comments are useful.

There’s a convention in GNOME documentation comments to specify the function which should be used to free a returned value. If such a function is mentioned, ownership of the returned memory is transferred. If no function is mentioned, ownership probably isn’t transferred, but it’s hard to be sure. That’s why it’s good to always be explicit when writing documentation comments.

/**
 * my_object_get_some_string:
 * @self: a #MyObject
 *
 * Gets the value of the #MyObject:some-string property.
 *
 * Return value: (transfer none): some string, owned by the
 * object and must not be freed
 */
const gchar *my_object_get_some_string (MyObject *self) {
	return self->priv->some_string;
}

/**
 * my_object_build_result:
 * @self: a #MyObject
 *
 * Builds a result string which probably represents something
 * meaningful.
 *
 * Return value: (transfer full): a newly allocated result
 * string; free with g_free()
 */
gchar *my_object_build_result (MyObject *self) {
	return g_strdup_printf ("%s %s",
	                        self->priv->some_string,
	                        self->priv->some_other_string);
}

/**
 * my_object_dup_controller:
 * @self: a #MyObject
 *
 * Gets the value of the #MyObject:controller property,
 * incrementing the controller's reference count.
 *
 * Return value: (transfer full): the object's
 * controller; unref with g_object_unref()
 */
MyController *my_object_dup_controller (MyObject *self) {
	return g_object_ref (self->priv->controller);
}

When GObject introspection was introduced, these kinds of documentation comments were formalised as introspection annotations: (transfer full), (transfer container) or (transfer none), as documented on wiki.gnome.org. These allow the runtimes of language bindings to manage memory correctly. Since a C programmer is essentially doing the same job as a language runtime when writing C code, the information provided by transfer annotations is sufficient for perfect memory management. Unfortunately, not all parameters and return types of all functions have these annotations added. The examples above do, however.

Finally, a few libraries use a function naming convention. Functions named *_get_* do not transfer ownership, whereas functions named *_dup_* (for ‘duplicate’) do transfer ownership. This can be seen in the examples above, or with json_node_get_array() vs. json_node_dup_array(). Be aware, though, that only a few libraries use this convention. Other libraries use *_get_* for both functions which do and do not transfer ownership of their results. Other code, such as that generated by gdbus-codegen, uses a different and incompatible convention: *_get_* methods signify full transfer, and *_peek_* methods signify no transfer. For example, goa_object_get_manager() vs. goa_object_peek_manager(). For this reason, going by function naming conventions only works within libraries, not between different libraries.

Memory management of parameters is analogous to return values: look at whether the parameter is const and whether there are any introspection annotations for it.

In summary, here are a set of guidelines one can follow to determine whether ownership of a return value is transferred, and hence whether the caller needs to free it:

  1. If the type has an introspection (transfer) annotation, look at that.
  2. Otherwise, if the type is const, there is no transfer.
  3. Otherwise, if the function documentation explicitly specifies the return value must be freed, there is full or container transfer.
  4. Otherwise, if the function is named *_dup_*, there is full or container transfer.
  5. Otherwise, if the function is named *_peek_*, there is no transfer.
  6. Otherwise, you need to look at the function’s code to determine whether it intends ownership to be transferred. Then file a bug against the documentation for that function, and ask for an introspection annotation to be added.

Some common pitfalls:

  • If you’re using an explicit typecast (e.g. casting a (const gchar*) return value to (gchar*)), it’s likely something’s wrong.
  • Generally, return values which are not transferred (such as (const gchar*)) are freed when the owning object is destroyed — so if you need such a value to persist, you must copy it or increase its reference count. You then have ownership of the copy or new reference, and are responsible for freeing it (not the original).

How can one check for incorrect memory handling? Use Valgrind. It will detect leaks and double-frees, and is simple to use:

valgrind --tool=memcheck --leak-check=full my-program-name

Or, if running your program from the source directory, use the following to avoid running leak checking on the libtool helper scripts:

libtool --mode=execute valgrind --tool=memcheck --leak-check=full ./my-program-name

Valgrind lists each memory problem it detects, along with a short backtrace (if you’ve compiled your program with debug symbols), allowing the cause of the memory error to be pinpointed and fixed!

Character encoding and locales

Recently, I've been looking into how character encoding and locales work on Linux, and I thought it might be worthwhile to write down my findings; partly so that I can look them up again later, and partly so that people can correct all the things I've got wrong.

To begin with, let's define some terminology:

  • Character set: a set of symbols which can be used together. This defines the symbols and their semantics, but not how they're encoded in memory. For example: Unicode. (Update: As noted in the comments, the character set doesn't define the appearance of symbols; this is left up to the fonts.)
  • Character encoding: a mapping from a character set to an representation of the characters in memory. For example: UTF-8 is one encoding of the Unicode character set.
  • Nul byte: a single byte which has a value of zero. Typically represented as the C escape sequence ‘\0’.
  • NULL character: the Unicode NULL character (U+0000) in the relevant encoding. In UTF-8, this is just a single nul byte. In UTF-16, however, it's a sequence of two nul bytes.

Now, the problem: if I'm writing a (command line) C program, how do strings get from the command line to the program, and how do strings get from the program to the terminal? More concretely, what actually happens with argv[] and printf()?

Let's consider the input direction first. When the main() function of a C program is called, it's passed an array of pointers to char arrays, i.e. strings. These strings can be arbitrary byte sequences (for example, file names), but are generally intended/assumed to be encoded in the user's environment's character encoding. This is set using the LC_ALL, LC_CTYPE or LANG environment variables. These variables specify the user's locale which (among other things) specifies the character encoding they use.

So the program receives as input a series of strings which are in an arbitrary encoding. This means that all programs have to be able to handle all possible character encodings, right? Wrong. A standard solution to this already exists in the form of libiconv. iconv() will convert between any two character encodings known to the system, so we can use it to convert from the user's environment encoding to, for example, UTF-8. How do we find out the user's environment encoding without parsing environment variables ourselves? We use setlocale() and nl_langinfo().

setlocale() parses the LC_ALL, LC_CTYPE and LANG environment variables (in that order of precedence) to determine the user's locale, and hence their character encoding. It then stores this locale, which will affect the behaviour of various C runtime functions. For example, it will change the formatting of numbers outputted by printf() to use the locale's decimal separator. Just calling setlocale() doesn't have any effect on character encodings, though. It won't, for example, cause printf() to magically convert strings to the user's environment encoding. More on this later.

nl_langinfo() is one function affected by setlocale(). When called with the CODESET type, it will return a string identifying the character encoding set in the user's environment. This can then be passed to iconv_open(), and we can use iconv() to convert strings from argv[] to our internal character encoding (which will typically be UTF-8).

At this point, it's worth noting that most people don't need to care about any of this. If using a library such as GLib – and more specifically, using its GOption command line parsing functionality – all this character encoding conversion is done automatically, and the strings it returns to you are guaranteed to be UTF-8 unless otherwise specified.

So we now have our input converted to UTF-8, our program can go ahead and do whatever processing it likes on it, safe in the knowledge that the character encoding is well defined and, for example, there aren't any unexpected embedded nul bytes in the strings. (This could happen if, for example, the user's environment character encoding was UTF-16; although this is really unlikely and might not even be possible on Linux — but that's a musing for another blog post).

Having processed the input and produced some output (which we'll assume is in UTF-8, for simplicity), many programs would just printf() the output and be done with it. printf() knows about character encodings, right? Wrong. printf() outputs exactly the bytes which are passed to its format parameter (ignoring all the fancy conversion specifier expansion), so this will only work if the program's internal character encoding is equal to the user's environment character encoding, for the characters being outputted. In many cases, the output of programs is just ASCII, so programs get away with just using printf() because most character encodings are supersets of ASCII. In general, however, more work is required to do things properly.

We need to convert from UTF-8 to the user's environment encoding so that what appears in their terminal is correct. We could just use iconv() again, but that would be boring. Instead, we should be able to use gettext(). This means we get translation support as well, which is always good.

gettext() takes in a msgid string and returns a translated version in the user's locale, if possible. Since these translations are done using message catalogues which may be in a completely different character encoding to the user's environment or the program's internal character encoding (UTF-8), gettext() helpfully converts from the message catalogue encoding to the user's environment encoding (the one returned by nl_langinfo(), discussed above). Great!

But what if no translation exists for a given string? gettext() returns the msgid string, unmodified and unconverted. This means that translatable string literals in our program need to magically be written in the user's environment encoding…and we're back to where we were before we introduced gettext(). Bother.

I see three solutions to this:

  • The gettext() solution: declare that all msgid strings should be in US-ASCII, and thus not use any Unicode characters. This works, provided we make the (reasonable) assumption that the user's environment encoding is a superset of ASCII. This requires that if a program wants to use Unicode characters in its translatable strings, it has to provide an en-US message catalogue to translate the American English msgid strings to American English (with Unicode). Not ideal.
  • The gettext()++ solution: declare that all msgid strings should be in UTF-8, and assume that anybody who's running without message catalogues is using UTF-8 as their environment encoding (this is a big assumption). Also not ideal, but a lot less work.
  • The iconv() solution: instruct gettext() to not return any strings in the user's environment encoding, but to return them all in UTF-8 instead (using bind_textdomain_codeset()), and use UTF-8 for the msgid strings. The program can then pass these translated (and untranslated) strings through iconv() as it did with the input, converting from UTF-8 to the user's environment encoding. More effort, but this should work properly.

An additional complication is that of combining translatable printf() format strings with UTF-8 string output from the program. Since printf() isn't encoding-aware, this requires that both the format string and the parameters are in the same encoding (or we get into a horrible mess with output strings which have substrings encoded in different ways). In this case, since our program output is in UTF-8, we definitely want to go with option 3 from above, and have gettext() return all translated messages in UTF-8. This also means we get to use UTF-8 in msgid strings. Unfortunately, it means that we now can't use printf() directly, and instead have to sprintf() to a string, use iconv() to convert that string from UTF-8 to the user's environment encoding, and then printf() it. Whew.

Here's a diagram which hopefully makes some of the journey clearer (click for a bigger version):

Diagram of the processing of strings from input to output in a C program.

So what does this mean for you? As noted above, in most cases it will mean nothing. Libraries such as GLib should take care of all of this for you, and the world will be a lovely place with ponies (U+1F3A0) and cats (U+1F431) everywhere. Still, I wanted to get this clear in my head, and hopefully it's useful to people who can't make use of libraries like GLib (for whatever reason).

Exploring exactly what GLib does is a matter for another time. Similarly, exploring how Windows does things is also best left to a later post (hint: Windows does things completely differently to Linux and other Unices, and I'm not sure it's for the better).

Reference count debugging with systemtap

I got some really helpful comments on yesterday's post about reference count debugging with gdb which enabled me to get systemtap working.

Getting systemtap working (on Fedora 13)

Install the systemtap-* and kernel-devel packages as per the instructions on the systemtap wiki. Note that the kernel packages need to be for the same version as the kernel you're currently running. I got caught out by this since I hadn't rebooted since I last downloaded an updated kernel package. You then need to add yourself to the stapdev and stapusr groups. Run the command stap -v -e 'probe vfs.read {printf("read performed\n"); exit()}' to test whether everything's installed and working properly. systemtap might ask you to run a make command at this point, which you need to do.

Writing systemtap probes

The probe I'm using to sort out referencing issues is the following, based off the examples Alex Larsson gave when static probes were initially added to GLib and GObject. I've saved it as refs.stp:

global alive
global my_object = "FooObject"

probe gobject.object_new {
	if (type == my_object)
		alive++
}

probe gobject.object_ref {
	if (type == my_object) {
		printf ("%s %p ref (%u)\n", type, object, refcount)
		print_ubacktrace_brief ()
		printf ("\n")
	}
}

probe gobject.object_unref {
	if (type == my_object) {
		printf ("%s %p unref (%u)\n", type, object, old_refcount)
		print_ubacktrace_brief ()
		printf ("\n")

		if (old_refcount == 1)
			alive--
	}
}

probe end {
	printf ("Alive objects: \n")
	if (alive > 0)
		printf ("%d\t%s\n", alive, my_object)
}

This counts how many instances of the FooObject class are created (using a probe on g_object_new()) and destroyed (probing on g_object_unref() and decrementing the alive counter when the last reference is dropped). References and dereferences are also counted, with a short backtrace being outputted for each, which is the key thing I was looking for when debugging reference counting problems.

Using the probes

I was debugging problems in Empathy, so I had to use the following command:

stap refs.stp \
-d ${libdir}/libfolks.so \
-d ${libdir}/libfolks-telepathy.so \
-d ${libdir}/libglib-2.0.so \
-d ${libdir}/libgobject-2.0.so \
-d ${libdir}/libgee.so \
-d ${libdir}/libgtk-x11-2.0.so \
-d ${bindir}/empathy \
-c "${bindir}/empathy"

Each -d option tells systemtap to load unwind data from the given library or executable, which is the key thing I was missing yesterday; these options are necessary for the backtraces to be useful, since systemtap stops unwinding a backtrace at the first frame it can't map to a symbol name. Note that it's necessary to explicitly tell systemtap to load data from the empathy executable, even though it then runs Empathy to trace it.

This gives output like the following when tracing the EmpathyMainWindow object:

EmpathyIndividualStore 0x09c05a10 ref (2)
g_object_ref+0x138
g_value_object_collect_value+0xe0
g_value_set_instance+0x190
.L1016+0x1e0
g_signal_emit_by_name+0x165
gtk_tree_sortable_sort_column_changed+0x78
gtk_tree_store_set_sort_column_id+0xde
gtk_tree_sortable_set_sort_column_id+0xe6
empathy_individual_store_set_sort_criterium+0x108
individual_store_setup+0x162
empathy_individual_store_init+0xb0
g_type_create_instance+0x1c3
g_object_constructor+0x1d
g_object_newv+0x438
.L345+0xfd
g_object_new+0x8d
empathy_individual_store_new+0xb6
empathy_main_window_init+0x890
g_type_create_instance+0x1c3
g_object_constructor+0x1d
empathy_main_window_constructor+0x4c

EmpathyIndividualStore 0x09c05a10 unref (2)
g_object_unref+0x13f
g_value_object_free_value+0x2a
g_value_unset+0x6d
.L1041+0x100
g_signal_emit_by_name+0x165
gtk_tree_sortable_sort_column_changed+0x78
gtk_tree_store_set_sort_column_id+0xde
gtk_tree_sortable_set_sort_column_id+0xe6
empathy_individual_store_set_sort_criterium+0x108
individual_store_setup+0x162
empathy_individual_store_init+0xb0
g_type_create_instance+0x1c3
g_object_constructor+0x1d
g_object_newv+0x438
.L345+0xfd
g_object_new+0x8d
empathy_individual_store_new+0xb6
empathy_main_window_init+0x890
g_type_create_instance+0x1c3
g_object_constructor+0x1d
empathy_main_window_constructor+0x4c

The only thing I need to do now is to figure out how to script systemtap so that it indents each backtrace nicely according to the reference count of the object.

Reference count debugging with gdb

As I was hacking today, I ran into some hard-to-debug reference counting problems with one of my classes. The normal smattering of printf()s didn't help, and neither did this newfangled systemtap, which was a bit disappointing.

It worked, in that my probes were correctly run and correctly highlighted each reference/dereference of the class I was interested in, but printing a backtrace only extended to the g_object_ref()/g_object_unref() call, and no further. I'm guessing this was a problem with the location of the debug symbols for my code (since it was in a development prefix, whereas systemtap was not), but it might be that systemtap hasn't quite finished userspace stuff yet. That's what I read, at least.

In the end, I ended up using conditional breakpoints in gdb. This was a lot slower than systemtap, but it worked. It's the sort of thing I would've killed to know a few years (or even a few months) ago, so hopefully it's useful for someone (even if it's not the most elegant solution out there).

set pagination off
set $foo=0
break main
run

break g_object_ref
condition 2 _object==$foo
commands
	silent
	bt 8
	cont
	end

break g_object_unref
condition 3 _object==$foo
commands
	silent
	bt 8
	cont
	end

break my_object_init
commands
	silent
	set $foo=my_object
	cont
	end
enable once 4
cont

The breakpoint in main() is to stop gdb discarding our breakpoints out of hand because the relevant libraries haven't been loaded yet. $foo contains the address of the first instance of MyObject in the program; if you need to trace the n+1th instance, use ignore 4 n to only fire the my_object_init breakpoint on the n+1th MyObject instantiation.

This can be extended to track (a fixed number of) multiple instances of the object, by using several $fooi variables and gdb's if statements to set them as appropriate. This is left as an exercise to the reader!

I welcome the inevitable feedback and criticism of this approach. It's hacky, ugly and slower than systemtap, but at least it works.

Reviewing and applying a patch

I've been fortunate enough to have been reviewing a lot of patches recently. Fortunate, because it means other people are contributing to my library. However, few of these contributions are without their problems; as with all contributions, each patch generally goes through two or three iterations before I think it's near enough to being ready that it's easier for me to apply the patch than it is to comment on it and request an updated version.

Generally, when patches get to this stage, it's just the really small, nitpicky things which are still wrong. Rogue whitespace, quirky indentation, typos in documentation…these are all really minor things, but they take time to check through and correct. One of the latest libgdata patches was, I think, a bit of a rushed job; and so there were more of these niggly problems than usual. Instead of going through and fixing all the problems and never giving detailed feedback about them to the patch contributor – as is usually the way, since the changes are so trivial, albeit cumulatively not inconsiderate – I decided to make a screencast of the process I go through when reviewing a patch.

Reviewing and applying a patch

It's about half an hour long, unscripted and unedited, so there are a couple of mistakes and omissions in there (for example, I later noticed I'd forgotten to mention the lack of input validation on the new function). However, I think it's quite comprehensive. It's aimed at those who are getting used to the open-source patch submission and development process, though those more talented than me welcome to watch it and lambast me for not using Emacs. Hopefully it's useful to someone, anyway. It's licenced under cc-by-sa 2.0.

Update on JS string concatenation

Somebody over at Slashdot has just pointed out a method of easily concatenating numbers. Although this is helpful, it can't address the underlying problem, which is that JavaScript shouldn't be using the + operator for two things, and it still doesn't address my main problem of JS concatenating instead of adding two numeric strings together. :(

JavaScript string concatenation

I've been coding JavaScript quite a lot for ABXY recently, and one thing which has got me really annoyed (apart from JavaScript's odd and overly-flexible "OOP" architecture) is its + operator.

In theory, the + operator is brilliant. Instead of being boring, and only doing one thing, it can be used for both numerical addition, and string concatenation. That's fine, you might say, but tell me; have you ever considered what happens when you want to add together two numerical strings, or concatenate two numbers?

The answer is that in the first instance, JavaScript will concatenate the strings, and in the second, it will add the two numbers together. That's perfect, according to the design, but it's not helpful. :( Most input and other variables in JavaScript are strings, so every time you want to add two numerical strings together, you have to cast them as numbers, using the parseInt or the parseFloat function, which is easy to forget at best, and downright inefficient at worst.

This is awkward, and unlike the rest of JavaScript; in other situations, JavaScript will happily cast between types to find two which match the operation you're trying to perform. Why is it different here? It's different because JavaScript doesn't know whether you're wanting to concatenate, or to add, and this is the underlying problem. JavaScript should follow PHP's example, and have separate addition and concatenation operators (+ and ., respectively).

Relative units and elastic layouts

The astute members in my audience might have realised (if they're reading via a web browser) that the site has recently undergone a bit of a re-jig. Not just on the outside, but also on the inside. As well as using all the nice class features of PHP 5, the site now has a completely elastic layout.

What's an elastic layout? I hear one person cry! It's a new name for a layout paradigm which has been technically possible for a while. Basically, it means that the whole site's layout resizes according to the browser's default text size — not just the text in the site. This is a leap from liquid layouts, as it means that the site effectively zooms in when the browser's font size is increased, and zooms out when the browser's font size is decreased.

This isn't actually that hard to achieve, but few developers that I've seen appear to have picked up on it. Even those who have implemented it have typically left borders and padding with absolute values rather than relative ones. I would argue that this isn't a very good idea. Let's assume (for the sake of argument) that you have a nice big 50" television linked to the Internet viewing an elastic web page which has absolute sizes for the paddings, borders and margins. Looking at the screen from a couple of metres away, and assuming standard margins and paddings of 5px, and borders of 1px in width, the text would almost run together, and the margins, paddings and borders would not be discernible. This can't be good, and this is why I believe that every measurement in an elastic page should be elastic, and use relative units (barring special cases where images have to be used and alignment is critical: as images for the web are almost completely bitmap-based, absolute units still have to be used here).

The most commonly-used unit with which people implement elastic layouts is the em. This unit originated in ye olde typography, and is the height of a capital "M" in the current font face. (Note that it is not the width of the capital "M", as sometimes believed, although this usually applies, as "M"s are usually square in standard font faces.) The em migrated to CSS in the first version of CSS, but hasn't been used too much since (remember debates about relative-vs.-absolute font sizes?).

The main thing to remember about ems is that 1em is equal to the current font size in points. Therefore, 2em is equal to double the current font size in points! :o

The other thing to remember about ems is that they multiply together to generate the final dimensions. So, if you had a parent element with font-size: 2em;, and a child element with font-size: 0.8em;, the final font size would be 1.6em, relative to the parent element's parent (in this example, we can assume that its font size is just 1em). This works exactly the same way for other properties using em units, such as padding or margin, although you must bear in mind that it's always relevant to the parent element's font size.

GET and POST

One of the most basic features of a website is a form. You can use them to send data to a website, search for things, or manipulate the URL. Many less experienced web developers will have heard of GET and POST requests, but what are they really, and what are the differences between them?

To explain them, let's go back to fundamentals. Every time you get a web page from a server, your browser sends an HTTP request to the server, and gets a response. A typical HTTP request to retrieve this site is as follows:

GET /index.php?media=rss HTTP/1.1rn
Host: tecnocode.co.ukrn
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-GB; rv:1.8.0.5) Gecko/20060731 Ubuntu/dapper-security Firefox/1.5.0.5rn
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5rn
Accept-Language: en-gb,en-us;q=0.7,en;q=0.3rn
Accept-Encoding: gzip,deflatern
Accept-Charset: UTF-8,*rn
Keep-Alive: 300rn
Connection: keep-alivern
Cookie: foo=bar; foo2=bar
rn

I'm not going to explain it all, but basically, it's asking the server for the /index.php?media=rss page on tecnocode.co.uk (second line). All the other lines are there to detail what can and can't be accepted, and how the connection is going to be handled. The whole thing is terminated with a line containing only rn (the UNIX carriage return and newline escape sequences).

It's the first line we're interested in here, as that is the one detailing the fact that this is a GET request. As you can see, GET requests are used to retrieve most pages you download off the web, but you might not realise that they can be used in forms as well.

POST requests are different to GETs, as instead of encoding parameters from the form in the URL, it encodes them in a different part of the request.

POST /index.php?page=login&paragraph=login HTTP/1.1rn
Host: tecnocode.co.ukrn
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-GB; rv:1.8.0.5) Gecko/20060731 Ubuntu/dapper-security Firefox/1.5.0.5rn
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5rn
Accept-Language: en-gb,en-us;q=0.7,en;q=0.3rn
Accept-Encoding: gzip,deflatern
Accept-Charset: UTF-8,*rn
Keep-Alive: 300rn
Connection: keep-alivern
Cookie: foo=bar; foo2=bar
Content-Type: application/x-www-form-urlencodedrn
Content-Length: 28
rn
username=DrBob&password=test

You can see here that this request is basically the same as the GET request above, apart from a few minor things. Firstly, the first line says POST instead of GET, and secondly, there are some extra lines at the bottom. Content-Type and Content-Length tell the server the type (encoding) and length of the form data, respectively, then the form data itself is sent; in this case, my username "DrBob", and a fictitious password "test". It's because of this separation of the form data from the URL that POST pages can't be referenced by URL, as a GET request would miss out the POSTed form data.

The next step in understanding, is how to use both GET and POST requests in forms. You can already make a GET request by using a hyperlink, but that doesn't enable you to query the user for their input to the URL's parameters.

<form action="http://tecnocode.co.uk/" method="get">
	<fieldset>
		<legend>Search terms</legend>
		<label for="query">Enter your search terms. You can "-exclude" keywords.</label>
		<input type="text" name="query" id="query" value="" />
		<input type="hidden" name="page" value="search" />
	</fieldset>
	<fieldset>
		<input type="submit" value="Search" />
	</fieldset>
</form>

Here we have a simple search form, which uses GET. The main difference between this and a POST form is that the method for a GET form is "get". However, there is another difference, and that's that if you try to put parameters on the URL in the action attribute, they will be ignored. With GET forms, all parameters must be done as <input /> fields; that means moving any action parameters to hidden inputs, as is shown in the example. If that example was used with a query of "moo", the URL "http://tecnocode.co.uk/?page=search&query=moo" would be returned.

<form action="http://tecnocode.co.uk/?page=login&paragraph=login" method="post">
	<fieldset>
		<legend>Username</legend>
		<label for="username">Your unique username.</label>
		<input name="username" id="username" type="text" value="" />
	</fieldset>
	<fieldset>
		<legend>Password</legend>
		<label for="password">Your personal password.</label>
		<input name="password" id="password" type="password" value="" />
	</fieldset>
	<fieldset class="submit">
		<input type="submit" value="Login" />
	</fieldset>
</form>

This form makes a POST request with login details, because its method attribute is "post". With POST forms, parameters can be put into the action URL, because the form data itself will be submitted separately from them. When used with the username "DrBob", and the password "test", this form will generate the POST request used as an example further up. Note that no inputs are ever securely encoded: the password field is transmitted in plain-text, because this form only operates over HTTP, as opposed to HTTPS; SSL certificates (which are required for HTTPS to work) cost lots of money.

So when should you use POST, and when should you use GET? Well, the best way to remember, is that POST requests should change the state of the server. By that, it is meant that they should trigger some action which will result in future page requests returning pages which are different to those returned before the POST. A good example of this would be to add a news item to a site. GET requests are used for every hyperlink in a site, but they shouldn't be limited to those. For example, if you have a search function on your website, you will most likely want a search form to feed it. However, this form should not use POST! It does not change the state of the server, and thus should make a GET request. Another effect of using GET requests for tasks such as these is that the user's browser doesn't prompt them if they want to re-submit the data when they refresh, and they can bookmark/share the URL of the returned page without it appearing differently for other people.

Some common situations for using GET forms follow:

  • Search
  • Selecting something to view/edit/delete/etc. out of a list
  • Navigating pages (pagination)
  • Selecting the page's stylesheet/view mode (e.g. switching to debug mode on a large web application)
  • Selecting a file mirror for a download

Although it's probably obvious, a list of common POST form usages follows:

  • Adding/Editing/Deleting an item
  • Logging into a site
  • Logging out of a site (yes, this changes the site's state)
  • Submitting a comment or trackback

Just remember: POST changes the server's state. ;)

Getting on form

A lot of sites use forms; it's an ideal way to submit information (how else are you supposed to do it?), but how many actually use forms properly? How many have semantically-correct markup?

Let's start by looking at the elements you should use in a form:

  • <form>
  • <fieldset>
  • <legend>
  • <label>
  • <input />
  • <textarea>
  • <select>
  • <option>
  • <optgroup>
  • <button>

When used properly, these create forms which are both accessible and semantically correct, so other computers can extract information from the forms' markup. However, most websites only make use of a subset of this list of elements, with usually only five elements being used:

  • <form>
  • <input />
  • <textarea>
  • <select>
  • <option>

You'll notice that the <fieldset>, <legend> and <label> tags are missing, and that's what makes most forms poor.

The <label> tag is the easiest to add, and solves most of the accessibility problems associated with badly-coded forms. As the name suggests, it provides a label for a form control, but this label is associated with the element through the use of the for attribute, so that when a user clicks on the label, focus will be given to the associated form control.

<label for="example_input">This is an example input, using the <tt>&lt;input /&gt;</tt> element.</label><input id="example_input" type="text" value="Example input" />

You have two options when using a <label> element. The first is to write a short label (e.g. "Example input" for the above markup), and the second is to write a longer and more informative label (e.g. what's in the above markup). As far as the specifications go, neither is incorrect. Personally, I believe the latter is the better, as it provides more help as to how to use the form control, and a simple label can be provided by the <legend> tag, which is discussed later.

One thing many web developers do when using labels is to place the associated form control inside the label, after (or before) the label text. Although this is permitted by the specifications, it provides no advantage, and just makes the markup ungainly. :(

The <fieldset> element is next on the list, and it's quite important. It allows form controls to be grouped together in logical sections.

<fieldset>
<label for="example_input">This is an example input, using the <tt>&lt;input /&gt;</tt> element.</label><input id="example_input" type="text" value="Example input" />
</fieldset>

How you group form controls is very much your own choice. Personally, I stray a little on the wild side and usually assign a fieldset to each form control. This is perhaps segregating them too finely, and is something I should work on to change. One example of a good group would be "Personal details".

Fieldsets by default appear without a label, which isn't very helpful to users, and isn't very good for accessibility (screenreaders don't know anything about the grouping), so you can use the <legend> element to assign a label to a fieldset, much like a <label> assigns a label to a form control.

<fieldset>
<legend>Personal details</legend>
<label for="example_input">This is an example input, using the <tt>&lt;input /&gt;</tt> element.</label><input id="example_input" type="text" value="Example input" />
</fieldset>

The <legend> tag must come directly after the opening part of the <fieldset> tag, or it is invalid.

Another little-used form element is the <optgroup> element. It's old, dating back a long time, but not many people are aware of its existence (or if they are, they don't use it). It's used to group <option> elements together logically in a selection list.

<select id="example_select">
<optgroup label="First group">
<option value="1" selected="selected">1</option>
<option value="2">2</option>
</optgroup>
<optgroup label="Second group">
<option value="3">3</option>
<option value="4">4</option>
</optgroup>
</select>

The option group's label value is required, and specifies the label to display as the title for the group.

The final element most authors don't use is the <button> element. I must confess that I didn't know much about it until I read the W3C's definition:

Buttons created with the BUTTON element function just like buttons created with the INPUT element, but they offer richer rendering possibilities: the BUTTON element may have content. For example, a BUTTON element that contains an image functions like and may resemble an INPUT element whose type is set to "image", but the BUTTON element type allows content.

Visual user agents may render BUTTON buttons with relief and an up/down motion when clicked, while they may render INPUT buttons as "flat" images.

The <button> element offers rich possibilities, but you must always remember the accessibility concerns when using it. Would a screen reader read content which is inside a button? Common sense should prevail.

The final point to make about form authoring is that of ease-of-use. It it just about always in one's interest to make a form as easy to use as possible, and just by using the accesskey and tabindex attributes a complex form can be easy to use.

The accesskey attribute can be applied to:

  • <a>
  • <area />
  • <button>
  • <input />
  • <label>
  • <legend>
  • <textarea>

The character (from the document's character set) in the attribute value will – when pressed – give focus to the element. This doesn't have to just be used for form controls; you could give the access key of "H" to a home link. However, form controls are the main use of accesskey. In my opinion, it's best to be careful with your use of access keys, applying them only to general elements such as the "Submit" and "Reset" buttons to cut down on maintenance when you come to add to the form.

The tabindex attribute can be applied to:

  • <a>
  • <area />
  • <button>
  • <input />
  • <object>
  • <select>
  • <textarea>

The number in the attribute specifies the tab order of the element, with lower numbers coming before higher ones (except "0", which puts the element last). Elements with identical tab indexes will be focussed in the order they appear in the markup. Usually there's no need to change the tab order of elements in a form, as it's usually calculated properly by the browser, but in some cases (especially if you're moving things around a lot with CSS) it's necessary to intervene and change the tab order manually.

That's it. If you knew how to make decent forms before, the information here should enable you to write splendid ones. :)