Tag Archives: systemtap

Instrumenting the GLib main loop with Dunfell

tl;dr: Visualise your main context and sources using Dunfell. Feedback and ideas welcome.

At the DX hackfest, I’ve been working on a new tool for instrumenting and visualising the behaviour of the GLib main context (or main contexts) in your program.

Screenshot from 2016-01-29 11-17-35

It’s called Dunfell (because I’m a sucker for hills) and at a high level it works by using SystemTap to record various GMainContext interactions in your program, saving them to a log file. The log file can then be examined using a viewer program.

The source is available on GitLab or GitHub because I still haven’t decided which is better.

In the screenshot above, each vertical line is a thread, each blue box is one dispatch phase of the main context which is currently running on that thread, each orange blob is a new GSource being created, and the green blob is a GSource which has been selected for closer inspection.

At the moment, it requires a couple of GLib patches to add some more SystemTap probe points, and it also requires a recent version of GTK+. It needs SystemTap, and I’ve only tested it on Fedora, so it might need some patching to work with the SystemTap installed on other distributions.

Screenshot from 2016-01-29 11-57-39

This screenshot is of a trace of the buffered-input-stream test from GIO, showing I/O callbacks being made across threads as idle source callbacks.

More visualisation ideas are welcome! At the moment, what Dunfell draws is quite simplistic. I hope it will be able to solve various common debugging problems eventually but suggestions for ways to do this intuitively, or for other problems to visualise, are welcome. Here are the use cases I was initially thinking about (from the README):

  • Detect GSources which are never added to a GMainContext.
  • Detect GSources which are dispatched too often (i.e. every main context iteration).
  • Detect GSources whose dispatch function takes too long (and hence blocks the main context).
  • Detect GSources which are never removed from their GMainContext after being dispatched (but which are never dispatched again).
  • Detect GMainContexts which have GSources attached or (especially) events pending, but which aren’t being iterated.
  • Monitor the load on each GMainContext, such as how many GSources it has attached, and how many events are processed each iteration.
  • Monitor ongoing asynchronous calls and GTasks, giving insight into their nesting and dependencies.
  • Monitor unfinished or stalled asynchronous calls.
  • Allow users to record logs to send to the developers for debugging on a different machine. The users may have to install additional software to record these logs (some component of Dunfell, plus its dependencies), but should not have to recompile or otherwise modify the program being debugged.
  • Work with programs which purely use GLib, through to programs which use GLib, GIO and GTK+.
  • Allow visualisation of this data, both in a standalone program, and in an IDE such as GNOME Builder.
  • Allow visualising differences between two traces.
  • Minimise runtime overhead of logging a program, to reduce the risk of disturbing race conditions by enabling logging.
  • Connecting to an already-running program is not a requirement, since by the time you’ve decided there’s a problem with a program, it’s already in the wrong state.

Reference count debugging with systemtap

I got some really helpful comments on yesterday's post about reference count debugging with gdb which enabled me to get systemtap working.

Getting systemtap working (on Fedora 13)

Install the systemtap-* and kernel-devel packages as per the instructions on the systemtap wiki. Note that the kernel packages need to be for the same version as the kernel you're currently running. I got caught out by this since I hadn't rebooted since I last downloaded an updated kernel package. You then need to add yourself to the stapdev and stapusr groups. Run the command stap -v -e 'probe vfs.read {printf("read performed\n"); exit()}' to test whether everything's installed and working properly. systemtap might ask you to run a make command at this point, which you need to do.

Writing systemtap probes

The probe I'm using to sort out referencing issues is the following, based off the examples Alex Larsson gave when static probes were initially added to GLib and GObject. I've saved it as refs.stp:

global alive
global my_object = "FooObject"

probe gobject.object_new {
	if (type == my_object)
		alive++
}

probe gobject.object_ref {
	if (type == my_object) {
		printf ("%s %p ref (%u)\n", type, object, refcount)
		print_ubacktrace_brief ()
		printf ("\n")
	}
}

probe gobject.object_unref {
	if (type == my_object) {
		printf ("%s %p unref (%u)\n", type, object, old_refcount)
		print_ubacktrace_brief ()
		printf ("\n")

		if (old_refcount == 1)
			alive--
	}
}

probe end {
	printf ("Alive objects: \n")
	if (alive > 0)
		printf ("%d\t%s\n", alive, my_object)
}

This counts how many instances of the FooObject class are created (using a probe on g_object_new()) and destroyed (probing on g_object_unref() and decrementing the alive counter when the last reference is dropped). References and dereferences are also counted, with a short backtrace being outputted for each, which is the key thing I was looking for when debugging reference counting problems.

Using the probes

I was debugging problems in Empathy, so I had to use the following command:

stap refs.stp \
-d ${libdir}/libfolks.so \
-d ${libdir}/libfolks-telepathy.so \
-d ${libdir}/libglib-2.0.so \
-d ${libdir}/libgobject-2.0.so \
-d ${libdir}/libgee.so \
-d ${libdir}/libgtk-x11-2.0.so \
-d ${bindir}/empathy \
-c "${bindir}/empathy"

Each -d option tells systemtap to load unwind data from the given library or executable, which is the key thing I was missing yesterday; these options are necessary for the backtraces to be useful, since systemtap stops unwinding a backtrace at the first frame it can't map to a symbol name. Note that it's necessary to explicitly tell systemtap to load data from the empathy executable, even though it then runs Empathy to trace it.

This gives output like the following when tracing the EmpathyMainWindow object:

EmpathyIndividualStore 0x09c05a10 ref (2)
g_object_ref+0x138
g_value_object_collect_value+0xe0
g_value_set_instance+0x190
.L1016+0x1e0
g_signal_emit_by_name+0x165
gtk_tree_sortable_sort_column_changed+0x78
gtk_tree_store_set_sort_column_id+0xde
gtk_tree_sortable_set_sort_column_id+0xe6
empathy_individual_store_set_sort_criterium+0x108
individual_store_setup+0x162
empathy_individual_store_init+0xb0
g_type_create_instance+0x1c3
g_object_constructor+0x1d
g_object_newv+0x438
.L345+0xfd
g_object_new+0x8d
empathy_individual_store_new+0xb6
empathy_main_window_init+0x890
g_type_create_instance+0x1c3
g_object_constructor+0x1d
empathy_main_window_constructor+0x4c

EmpathyIndividualStore 0x09c05a10 unref (2)
g_object_unref+0x13f
g_value_object_free_value+0x2a
g_value_unset+0x6d
.L1041+0x100
g_signal_emit_by_name+0x165
gtk_tree_sortable_sort_column_changed+0x78
gtk_tree_store_set_sort_column_id+0xde
gtk_tree_sortable_set_sort_column_id+0xe6
empathy_individual_store_set_sort_criterium+0x108
individual_store_setup+0x162
empathy_individual_store_init+0xb0
g_type_create_instance+0x1c3
g_object_constructor+0x1d
g_object_newv+0x438
.L345+0xfd
g_object_new+0x8d
empathy_individual_store_new+0xb6
empathy_main_window_init+0x890
g_type_create_instance+0x1c3
g_object_constructor+0x1d
empathy_main_window_constructor+0x4c

The only thing I need to do now is to figure out how to script systemtap so that it indents each backtrace nicely according to the reference count of the object.

Reference count debugging with gdb

As I was hacking today, I ran into some hard-to-debug reference counting problems with one of my classes. The normal smattering of printf()s didn't help, and neither did this newfangled systemtap, which was a bit disappointing.

It worked, in that my probes were correctly run and correctly highlighted each reference/dereference of the class I was interested in, but printing a backtrace only extended to the g_object_ref()/g_object_unref() call, and no further. I'm guessing this was a problem with the location of the debug symbols for my code (since it was in a development prefix, whereas systemtap was not), but it might be that systemtap hasn't quite finished userspace stuff yet. That's what I read, at least.

In the end, I ended up using conditional breakpoints in gdb. This was a lot slower than systemtap, but it worked. It's the sort of thing I would've killed to know a few years (or even a few months) ago, so hopefully it's useful for someone (even if it's not the most elegant solution out there).

set pagination off
set $foo=0
break main
run

break g_object_ref
condition 2 _object==$foo
commands
	silent
	bt 8
	cont
	end

break g_object_unref
condition 3 _object==$foo
commands
	silent
	bt 8
	cont
	end

break my_object_init
commands
	silent
	set $foo=my_object
	cont
	end
enable once 4
cont

The breakpoint in main() is to stop gdb discarding our breakpoints out of hand because the relevant libraries haven't been loaded yet. $foo contains the address of the first instance of MyObject in the program; if you need to trace the n+1th instance, use ignore 4 n to only fire the my_object_init breakpoint on the n+1th MyObject instantiation.

This can be extended to track (a fixed number of) multiple instances of the object, by using several $fooi variables and gdb's if statements to set them as appropriate. This is left as an exercise to the reader!

I welcome the inevitable feedback and criticism of this approach. It's hacky, ugly and slower than systemtap, but at least it works.