Tag Archives: memory management

What’s the difference between int and size_t?

Passing variable-sized buffers around is a common task in C, especially in networking code. Ignoring GLib convenience API like GBytes, GInputVector and GOutputVector, a buffer is simply an offset and length describing a block of memory. This is just a void* and int, right? What’s the problem?

tl;dr: I’m suggesting to use (uint8_t*, size_t) to describe buffers in C code, or (guint8*, gsize) if you’re using GLib. This article is aimed at those who are still getting to grips with C idioms. See also: (const gchar*) vs. (gchar*) and other memory management stories.

There are two problems here: this effectively describes an array of elements, but void* doesn’t describe the width of each element; and (assuming each element is a byte) an int isn’t wide enough to index every element in memory on modern systems.

But void* could be assumed to refer to an array of bytes, and an int is probably big enough for any reasonable situation, right? Probably, but not quite: a 32-bit signed integer can address 2 GiB (assuming negative indices are ignored) and an unsigned integer can address 4 GiB ((int doesn’t have a guaranteed width (C11 standard, §6.2.5¶5), which is another reason to use size_t. However, on any relevant modern platform it is at least 32 bits.)). Fine for network processing, but not when handling large files.

How is size_t better? It’s defined as being big enough to refer to every addressable byte of memory in the current computer system (caveat: this means it’s architecture-specific and not suitable for direct use in network protocols or file formats). Better yet, it’s unsigned, so negative indices aren’t wasted.

What about the offset of the buffer — the void*? Better to use a uint8_t*, I think. This has two advantages: it explicitly defines the width of each element to be one byte; and it’s distinct from char* (it’s unsigned rather than signed ((Technically, it’s architecture-dependent whether char is signed or unsigned (C11 standard, §6.2.5¶15), and whether it’s actually eight bits wide (though in practice it is really always eight bits wide), which is another reason to use uint8_t.))), which makes it more obvious that the data being handled is not necessarily human-readable or nul-terminated.

Why is it important to define the width of each element? This is a requirement of C: it’s impossible to do pointer arithmetic without knowing the size of an element, and hence the C standard forbids arithmetic on void* pointers, since void doesn’t have a width (C11 standard, §6.2.5¶¶19,1). So if you use void* as your offset type, your code will end up casting to uint8_t* as soon as an offset into the buffer is needed anyway.

A note about GLib: guint8 and uint8_t are equivalent, as are gsize and size_t — so if you’re using GLib, you may want to use those type aliases instead.

For a more detailed explanation with some further arguments, see this nice article about size_t and ptrdiff_t by Karpov Andrey.

Using libgee from C to access libfolks

If you’re interfacing with Vala code from C, it’s possible that you’ll have to make calls to libgee, a Vala data structure library. libgee is written in Vala, but its C API is perfectly usable once you get your head around the idioms.

A common idiom is that of the GeeIterator, which is the recommended way to iterate through sets, maps and lists (which all implement the GeeIterable interface). For a loop, you create a new GeeIterator, then call its get and next methods to retrieve the object for the current iteration and advance to the next iteration, respectively.

For example, to list all the phone numbers in a FolksIndividual:

GeeSet *phone_numbers;
GeeIterator *iter;
FolksPhoneDetails *details;

details = FOLKS_PHONE_DETAILS (my_individual);
phone_numbers = folks_phone_details_get_phone_numbers (details);
iter = gee_iterable_iterator (GEE_ITERABLE (phone_numbers));

while (gee_iterator_next (iter)) {
	FolksPhoneFieldDetails *details;
	gchar *number;

	/* Get the current iteration’s element.
	 * Note that ownership is transferred. */
	details = gee_iterator_get (iter);
	number = folks_phone_field_details_get_normalised (details);

	g_message ("Phone number: %s", number);

	g_free (number);
	g_object_unref (details);  /* important! */
}

g_object_unref (iter);  /* important! */

One key thing to note here is that the return values from gee_iterator_get() and gee_iterable_iterator() are transfer-full and need to be unreffed. Failing to unref them is a common mistake when using libgee from C, and results in leaked objects — and was the cause of several memory leaks recently fixed in Empathy. When using libgee from Vala this isn’t a problem, since the memory management is handled automatically.

This has been a public service announcement.

(const gchar*) vs. (gchar*) and other memory management stories

This article now been upstreamed to the GNOME Programming Guidelines, where future updates will be made to it.

Memory management in C is often seen as a minefield, and while it’s not as simple as in more modern languages, if a few conventions are followed it doesn’t have to be hard at all. Here are some pointers about heap memory management in C using GObject and the usual GNOME types and coding conventions, including GObject introspection. A basic knowledge of C is assumed, but no more. Stack memory allocation and other obscure patterns aren’t covered, since they’re much less commonly used in a typical GObject application.

The most important rule when thinking about memory management is: which code has ownership of this memory?

Ownership can be defined as the responsibility to free or deallocate a piece of memory. If code which does own some memory doesn’t deallocate that memory, that’s a memory leak. If code which doesn’t own some memory deallocates it (in addition to the owner doing so), that’s a double-free. Both are bad.

For example, if a function returns a newly allocated string and doesn’t retain a pointer to it, ownership of the string is transferred to the caller. Note that when thinking about ownership transfer, malloc()/free() and reference counting are treated the same: in the former case, a newly allocated piece of heap memory is being transferred; in the latter, a newly incremented reference.

gchar *transfer_full_function (void) {
	return g_strdup ("Newly allocated string.");
}

/* Ownership is transferred here. */
gchar *my_string = transfer_full_function ();

Conversely, if the function does retain a pointer (e.g. inside an object), ownership is not transferred to the caller.

const gchar *transfer_none_function (void) {
	return "Static string.";
}

const gchar *transfer_none_method (MyObject *self) {
	gchar *new_string = g_strdup ("Newly allocated string.");
	/* Ownership is retained by the object here. */
	self->priv->saved_string = new_string;
	return new_string;
}

/* In both of these cases, ownership is not transferred. */
const gchar *my_string1 = transfer_none_function ();
const gchar *my_string2 = transfer_none_method (some_object);

In all of these examples, you may notice the return type of the functions reflects whether they transfer ownership of their return values. const return types indicate no transfer, and non-const return types indicate some transfer.

While this is useful, it’s by no means completely clear. For example, what if a function returns a non-const GList*? Is ownership of the list elements transferred, or just the list? What if the function’s author forgot to make the return type const, and actually there’s no transfer? This is where documentation comments are useful.

There’s a convention in GNOME documentation comments to specify the function which should be used to free a returned value. If such a function is mentioned, ownership of the returned memory is transferred. If no function is mentioned, ownership probably isn’t transferred, but it’s hard to be sure. That’s why it’s good to always be explicit when writing documentation comments.

/**
 * my_object_get_some_string:
 * @self: a #MyObject
 *
 * Gets the value of the #MyObject:some-string property.
 *
 * Return value: (transfer none): some string, owned by the
 * object and must not be freed
 */
const gchar *my_object_get_some_string (MyObject *self) {
	return self->priv->some_string;
}

/**
 * my_object_build_result:
 * @self: a #MyObject
 *
 * Builds a result string which probably represents something
 * meaningful.
 *
 * Return value: (transfer full): a newly allocated result
 * string; free with g_free()
 */
gchar *my_object_build_result (MyObject *self) {
	return g_strdup_printf ("%s %s",
	                        self->priv->some_string,
	                        self->priv->some_other_string);
}

/**
 * my_object_dup_controller:
 * @self: a #MyObject
 *
 * Gets the value of the #MyObject:controller property,
 * incrementing the controller's reference count.
 *
 * Return value: (transfer full): the object's
 * controller; unref with g_object_unref()
 */
MyController *my_object_dup_controller (MyObject *self) {
	return g_object_ref (self->priv->controller);
}

When GObject introspection was introduced, these kinds of documentation comments were formalised as introspection annotations: (transfer full), (transfer container) or (transfer none), as documented on wiki.gnome.org. These allow the runtimes of language bindings to manage memory correctly. Since a C programmer is essentially doing the same job as a language runtime when writing C code, the information provided by transfer annotations is sufficient for perfect memory management. Unfortunately, not all parameters and return types of all functions have these annotations added. The examples above do, however.

Finally, a few libraries use a function naming convention. Functions named *_get_* do not transfer ownership, whereas functions named *_dup_* (for ‘duplicate’) do transfer ownership. This can be seen in the examples above, or with json_node_get_array() vs. json_node_dup_array(). Be aware, though, that only a few libraries use this convention. Other libraries use *_get_* for both functions which do and do not transfer ownership of their results. Other code, such as that generated by gdbus-codegen, uses a different and incompatible convention: *_get_* methods signify full transfer, and *_peek_* methods signify no transfer. For example, goa_object_get_manager() vs. goa_object_peek_manager(). For this reason, going by function naming conventions only works within libraries, not between different libraries.

Memory management of parameters is analogous to return values: look at whether the parameter is const and whether there are any introspection annotations for it.

In summary, here are a set of guidelines one can follow to determine whether ownership of a return value is transferred, and hence whether the caller needs to free it:

  1. If the type has an introspection (transfer) annotation, look at that.
  2. Otherwise, if the type is const, there is no transfer.
  3. Otherwise, if the function documentation explicitly specifies the return value must be freed, there is full or container transfer.
  4. Otherwise, if the function is named *_dup_*, there is full or container transfer.
  5. Otherwise, if the function is named *_peek_*, there is no transfer.
  6. Otherwise, you need to look at the function’s code to determine whether it intends ownership to be transferred. Then file a bug against the documentation for that function, and ask for an introspection annotation to be added.

Some common pitfalls:

  • If you’re using an explicit typecast (e.g. casting a (const gchar*) return value to (gchar*)), it’s likely something’s wrong.
  • Generally, return values which are not transferred (such as (const gchar*)) are freed when the owning object is destroyed — so if you need such a value to persist, you must copy it or increase its reference count. You then have ownership of the copy or new reference, and are responsible for freeing it (not the original).

How can one check for incorrect memory handling? Use Valgrind. It will detect leaks and double-frees, and is simple to use:

valgrind --tool=memcheck --leak-check=full my-program-name

Or, if running your program from the source directory, use the following to avoid running leak checking on the libtool helper scripts:

libtool --mode=execute valgrind --tool=memcheck --leak-check=full ./my-program-name

Valgrind lists each memory problem it detects, along with a short backtrace (if you’ve compiled your program with debug symbols), allowing the cause of the memory error to be pinpointed and fixed!