As with many things in GLib, the allocations in this code use refcounting, rather than new/free, to manage object lifecycles. This makes it quite hard to debug lifecycle issues, because the location of a bug (a ref or unref call which isn’t correct) can be quite far removed (in time and code) from where the effects of that bug become visible. This is because the effects of refcounting problems only become visible when an object’s refcount reaches zero, or when the program ends and its refcount still hasn’t reached zero.
While debugging this code, I tried an approach I haven’t before: changing some of the ref calls on the buggy object to be copy calls instead. (Specifically, changing g_object_ref() to g_dbus_message_copy().) That split up the lifecycle of the object into smaller pieces, narrowing down the sets of ref/unref calls which could be buggy. Ultimately, this allowed me to find some bugs in the code, and hopefully those are the bugs causing the refcounting issue. Since the issue is intermittent, it’s a bit hard to be sure.
This debugging approach was possible in this case because the object I was debugging is immutable, so passing around copies of it doesn’t affect the behaviour of other bits of code vs passing around the original. Hence this approach is only applicable in some situations. But it’s another good reason why using immutable objects is quite helpful when writing code, and it’s certainly an approach I’m going to be using again when I can.
The second part of my project in Endless Orange Week was to look at state persistence for apps and sessions. At its core, this means providing
some way for apps to save and restore their state (such as which part of the UI you’re looking at, what’s selected, and unsaved content);
code in the session manager to save and restore the state of all applications, and a list of which applications are running, when shutting down/restarting/logging out/logging in/starting up.
Those two bullet points hide a lot of complexity, and it’s not surprising that I didn’t get particularly far in this project! It requires coordinated changes in a lot of components: GLib, GTK, gnome-session and applications themselves.
A lot of these changes have been prototyped or worked on before, by various people, but nothing has yet come together. In fact, gnome-session used to fully support restoring apps to a certain degree — before it was ported away from XSMP, it used to support saving the set of apps when closing a session, and re-starting those apps when starting the session again. It did not support restoring the state of each app, though, just the fact that it was running.
Breaking down the problem
Updating and building on the proposals and branches from others so far, we end up with:
Changes to GLib to add a ‘restart data’ property to GApplication, which allows the application to expose its current state to the session manager (to be saved), and which is initialised with the restored state from the session manager on startup. These build heavily on changes proposed by Bastien Nocera, but my tweaks to them are still pretty exploratory and not ready for review.
Code in GTK to support serialising the widget tree. This implements saving the state of the UI, and was prototyped by Matthias Clasen. My additions (also not yet ready for review) tie it in to the gnome-session API. Further work (but not very much of it) would have to be done to tie Matthias’ proposals into the final shape of the GLib API.
The final two pieces of the puzzle above took most of my week, and included a lot of time spent learning the architecture of gnome-session, and working out a bit of a protocol for apps to register for session restore and, in particular, for gnome-session to be able to re-launch apps the right way when restoring a session. That’s something which deserves a blog post of its own at some point in the future, once I’m sure I’ve got everything straight in my head.
In summary, there’s not been much progress in terms of creating merge requests. But the week was, I think, well spent: I’ve learned a lot about the shape of the problem, and feel like I have a better idea of how to split it up into chunks which it might be possible to land over time to get the feature working. Many thanks to Endless for giving me the opportunity to do so.
Certainly, I think this project is too big to easily do in a single GNOME release. There’s too much coordination required between different projects with different cadences and development resources.
The next step will be to land the preparatory gnome-session cleanups, and to discuss and land the GLib API so that people can start building with that.
Recently at Endless we had a week of focused working on projects which are not our day-to-day work. It was called ‘Endless Orange Week’, and everyone was encouraged to explore a topic of their choosing.
I chose to look at two projects, both of which included a D-Bus/API component. My thinking was that review of the new interfaces on each project might take a while, so it would make sense to have two projects running in parallel so I could switch between them when blocked.
I’m going to blog about the two projects separately, to avoid one mega-long post.
The first project was to add a D-Bus debug interface for applications. This would allow debug output from an application to be turned on and off at runtime, rather than just being set with a command line argument or environment variable when the application is first started.
This would allow users and developers to get debug output from long-running applications without having to restart them, as quite often restarting a process will destroy the state you were hoping to debug.
What I came up with is GDebugController, which is awaiting review in GLib now. It’s an interface, currently implemented by GDebugControllerDBus. When instantiated, GDebugControllerDBus creates a new D-Bus object and interface for controlling the debug output from the application. It hooks into the standard g_debug() message functions, and can be hooked into a custom log writer function if your application uses one of those.
It essentially exists to expose one D-Bus property and allow that to be hooked in to your log writer. It has to be a bit more complex than that, though, as it needs to be able to handle authorisation: checking that the D-Bus peer who’s requesting to enable debug output on your application is actually allowed to do so. For services in particular, this is important, as allowing any peer to enable debug output could count as a privilege escalation. They might be able to slow your process down due to the volume of debug output it produces; fill the disk up; or look at the outputted logs and see private information. GDebugControllerDBus has an authorize signal to support this, and it works quite similarly to the GDBusInterfaceSkeleton::g-authorize-method signal.
Using it in an application
Firstly, you need to wait for it to be reviewed and land in GLib. The API might change during review.
Once it’s landed, assuming nothing changes, you just need to create an instance of GDebugControllerDBus. It will create the D-Bus object and hook it all up. When peers change the debug level in your application, the default handler will call g_log_set_debug_enabled() which will change the behaviour of GLib’s default log writer function.
If you have a custom log writer function, you will need to change it to check g_log_get_debug_enabled() and output debug messages if it’s true.
To use polkit for authorisation, set the GssService:debug-controller-action-id property to the ID of the polkit action you want to use for authorising enabling/disabling debug mode. libgsystemservice will handle the polkit checks using that. Here’s an example.
If that property is not set, a default policy will be used, where debug requests will be accepted unconditionally if your service is running on the session bus, and rejected unconditionally if it’s running on the system bus. The thinking is that there’s no security boundary on the session bus (all peers are equally trusted), whereas there are a lot of security boundaries on the system bus so libgsystemservice is best to fail closed and force you to write your own security policy.
That’s it! Reviews and feedback welcome. Many thanks to Endless for running this week and actively encouraging everyone to make use of it.
You could now, for example, define a GError domain for text parser errors which includes context information about a parsing failure, such as the current line and character position. Or attach the filename of a file which was being read, to the GError informing of a read failure. Define an extended error domain using G_DEFINE_EXTENDED_ERROR(). The extended information is stored in a ‘private’ struct provided by you, similarly to how it’s implemented for GObjects with G_DEFINE_TYPE_WITH_PRIVATE().
There are code examples on how to use the new APIs in the GLib documentation, so I won’t reproduce them here.
An important limitation to note is that existing GError domains which have ever been part of a stable public API cannot be extended retroactively unless you are breaking ABI. That’s because extending a GError domain increases the size of the allocated GError instances for that domain, and it’s possible that users of your API will have stack-allocated GErrors in the past.
Please don’t stack-allocate GErrors, as it makes future extensions of the API impossible, and doesn’t buy you notable extra performance, as GErrors should not be used on fast paths. By their very nature, they’re for failure reporting.
The new APIs are currently unstable, so please try them out and provide feedback now. They will be frozen with the release of GLib 2.67.3, scheduled for 11th February 2021.
g_file_set_contents() has worked fine for many years (and will continue to do so). However, it doesn’t provide much flexibility. When writing a file out on Linux there are various ways to do it, some slower but safer — and some faster, but less safe, in the sense that if your program or the system crashes part-way through writing the file, the file might be left in an indeterminate state. It might be garbled, missing, empty, or contain only the old contents.
g_file_set_contents() chose a fairly safe (but not the fastest) approach to writing out files: write the new contents to a temporary file, fsync() it, and then atomically rename() the temporary file over the top of the old file. This approach means that other processes only ever see the old file contents or the new file contents (but not the partially-written new file contents); and it means that if there’s a crash, either the old file will exist or the new file will exist. However, it doesn’t guarantee that the new file will be safely stored on disk by the time g_file_set_contents() returns. It also has fewer guarantees if the old file didn’t exist (i.e. if the file is being written out for the first time).
In most situations, this is the right compromise. But not in all of them — so that’s why g_file_set_contents_full() now exists, to let the caller choose their own compromise.
Situations where your code might want a looser set of guarantees from the defaults might be when writing out cache files (where it typically doesn’t matter if they’re lost or corrupted), or when writing out large numbers of files where you’re going to call fsync() once after the whole lot (rather than once per file).
Conversely, your code might want a tighter set of guarantees when writing out files which are well-formed-but-incorrect when empty or filled with zeroes (as filling a file with zeroes is one of the failure modes of the existing g_file_set_contents() defaults, if the file is being created), or when writing valuable user data.
Following on from the heap profiling I did on gnome-software to try and speed it up for Endless, the next step was to try profiling the computation done when starting up gnome-software — which bits of code are taking time to run?
tl;dr: There is new tooling in sysprof and GLib from git which makes profiling the performance of high-level tasks simpler. Some fixes have landed in gnome-software as a result.
Approaches which don’t work
The two traditional tools for this – callgrind, and print statements – aren’t entirely suitable for gnome-software.
I tried running valgrind --tool=callgrind gnome-software, and then viewing the results in KCachegrind, but it slowed gnome-software down so much that it was unusable, and the test/retry cycle of building and testing changes would have been soul destroyingly slow.
callgrind works by simulating the CPU’s cache and looking at cache reads/writes/hits/misses, and then attributing costs for those back up the call graph. This makes it really good at looking at the cost of a certain function, or the total cost of all the calls to a utility function; but it’s not good at attributing the costs of higher-level dynamic tasks. gnome-software uses a lot of tasks like this (GsPluginJob), where the task to be executed is decided at runtime with some function arguments, rather than at compile time by the function name/call. For example “get all the software categories” or “look up and refine the details of these three GsApp instances”.
That said, it was possible to find and fix a couple of bits of low-hanging optimisation fruit using callgrind.
Print statements are the traditional approach to profiling higher-level dynamic tasks: print one line at the start of a high-level task with the task details and a timestamp, and print another line at the end with another timestamp. The problem comes from the fact that gnome-software runs so many high-level tasks (there are a lot of apps to query, categorise, and display, using tens of plugins) that reading the output is quite hard. And it’s even harder to compare the timings and output between two runs to see if a code change is effective.
Having looked at sysprof briefly for the heap profiling work, and discounted it, I thought it might make sense to come back to it for this speed profiling work. Christian had mentioned at GUADEC in Thessaloniki that the design of sysprof means apps and libraries can send their own profiling events down a socket, and those events will end up in the sysprof capture.
So I played around with this newly-instrumented version of gnome-software for a bit, but found that there were still empty regions in the profiling trace, where time passed and computation was happening, but nothing useful was logged in the sysprof capture. More instrumentation was needed.
sysprof + GLib
gnome-software does a lot of its computation in threads, bringing the results back into the main thread to be rendered in the UI using idle callbacks.
For example, the task to list the apps in a particular category in gnome-software will run in a thread, and then schedule an idle callback in the main thread with the list of apps. The idle callback will then iterate over those apps and add them to (for example) a GtkFlowBox to be displayed.
Adding items to a GtkFlowBox takes some time, and if there are a couple of hundred of apps to be added in a single idle callback, that can take several hundred milliseconds — a long enough time to block the main UI from being redrawn that the user will notice.
How do you find out which idle callback is taking too long? sysprof again! I added sysprof support to GLib so that GSource.dispatch events are logged (along with a few others), and now the long-running idle callbacks are displayed in the sysprof graphs. Thanks to Christian and Richard for their reviews and contributions to those changes.
This capture file was generated using sysprof-cli --gtk --use-trace-fd -- gnome-software, and the ‘gnome-software’ and ‘GLib’ lines in the ‘Timings’ row need to be made visible using the drop-down menu in the ‘Timings’ row.
Marc-André Lureau has landed GUri support in GLib, and it’ll be available in GLib 2.65.1 (due out in the next few days).
GUri is a new API for parsing and building URIs, roughly equivalent to SoupURI already provided by libsoup — but since URIs are so pervasive, and used even if you’re not actually doing HTTP conversations, it makes sense to have a structured representation for them in GLib.
It’s the start of a new GLib release cycle, and so it’s time to share what people have been contributing so far. GLib 2.65.1 will be out soon, and it will contain a new test macro, g_assert_no_errno(). This checks that a POSIX-style function (like, say, rmdir()) succeeds when run. If the function fails (and indicates that by returning a negative integer) then g_assert_no_errno() will print out the error message corresponding to the current value of errno.
This should slightly simplify tests which have to use POSIX-style functions which don’t support GError.
Thanks to Simon McVittie for his help in putting it together and getting it tested and merged.
For a few weeks now, we’ve had coding style and thinko checks running in CI for GLib, long enough to see that it’s working OK (check out this pipeline!) and is perhaps time to share with others.
There are two jobs in the checks, both of which run in a new style-check stage of our CI pipeline, which runs before anything else. One job checks the coding style of a merge request, using clang-format. The other job checks for any lines which introduce TODO comments (or similar). These jobs are intended to be fast, but also to not fail the pipeline if they produce warnings. Coding style is subjective, and nobody has yet come up with a mechanical style description which doesn’t have ugly corner cases. So the intention of these jobs is to help remind people of the coding style, to avoid reviewers having to check coding style manually, and hence to only end up thinking about it or discussing it when the CI says the style is wrong.
The first job, style-check-diff, uses a script to work out the diff from the merge request’s branch point, and then passes that to clang-format-diff.py, a script from LLVM which uses clang-format to reformat only the changed lines in a diff. If any reformatting occurs, that means the merge request differs from the GLib coding style (as described by .clang-format) and should be warned about.
There are some limitations to clang-format: it can’t completely describe how the GLib coding style aligns function arguments. That’s unfortunate, because GNOME Builder aligns function arguments by default (and it’s nice behaviour). Hopefully somebody will implement support in clang-format for this sometime, so that it can accurately describe the coding style which is used by a large number of GNOME projects.
The second job, check-todos, also uses a script to work out the diff from the merge request’s branch point. It passes that to a little Python program which checks for banned words in the commit message and added lines in the diff. The aim is to prevent any comments containing TODO or WIP from accidentally getting merged, as these strings are often used by developers to indicate something they need to come back to before they’re finished — and it’s easy to miss them!
I’ve been using the convention that comments containing FIXME are OK to be merged, as they indicate something that will need updating in future, but can be merged as-is for now (such as a workaround).
Feedback, improvements, and copying welcome. These scripts should run on any CI image which has git, clang and Python.
GLib 2.63.1 has been released. The final new API to mention in this mini-series of blog posts about what’s in 2.63.1 is g_get_os_info().
g_get_os_info() is a way to get identifying information about the OS. On Linux, this is gathered from /etc/os-release. On other OSs, it’s gathered using platform-specific APIs (on other Unixes, this means falling back to uname()).
To use it, pass it a key, like G_OS_INFO_KEY_PRETTY_NAME, and it’ll return a newly-allocated string with the corresponding value for the current OS, or NULL if there was none. Different OSs support different sets of keys, and the amount of support and set of keys may change over time.
g_autofree gchar *os_name = g_get_os_info (G_OS_INFO_KEY_PRETTY_NAME);
g_print ("OS: %s\n", os_name);
/* Prints “OS: Fedora 30 (Workstation Edition)” for me */