Tag Archives: performance

GNOME Software performance in GNOME 40

tl;dr: Use callgrind to profile CPU-heavy workloads. In some cases, moving heap allocations to the stack helps a lot. GNOME Software startup time has decreased from 25 seconds to 12 seconds (-52%) over the GNOME 40 cycle.

To wrap up the sporadic blog series on the progress made with GNOME Software for GNOME 40, I’d like to look at some further startup time profiling which has happened this cycle.

This profiling has focused on libxmlb, which gnome-software uses extensively to query the appstream data which provides all the information about apps which it shows in the UI. The basic idea behind libxmlb is that it pre-compiles a ‘silo’ of information about an XML file, which is cached until the XML file next changes. The silo format encodes the tree structure of the XML, deduplicating strings and allowing fast traversal without string comparisons or parsing. It is memory mappable, so can be loaded quickly and shared (read-only) between multiple processes. It allows XPath queries to be run against the XML tree, and returns the results.

gnome-software executes a lot of XML queries on startup, as it loads all the information needed to show many apps to the user. It may be possible to eliminate some of these queries – and some earlier work did reduce the number by binding query parameters at runtime to pre-prepared queries – but it seems unlikely that we’ll be able to significantly reduce their number further, so better speed them up instead.

Profiling work which happens on a CPU

The work done in executing an XPath query in libxmlb is largely on the CPU — there isn’t much I/O to do as the compiled XML file is only around 7MB in size (see ~/.cache/gnome-software/appstream), so this time the most appropriate tool to profile it is callgrind. I ruled out using callgrind previously for profiling the startup time of gnome-software because it produces too much data, risks hiding the bigger picture of which parts of application startup were taking the most time, and doesn’t show time spent on I/O. However, when looking at a specific part of startup (XML queries) which are largely CPU-bound, callgrind is ideal.

valgrind --tool=callgrind --collect-systime=msec --trace-children=no gnome-software

It takes about 10 minutes for gnome-software to start up and finish loading the main window when running under callgrind, but eventually it’s shown, the process can be interrupted, and the callgrind log loaded in kcachegrind:

Here I’ve selected the main() function and the callee map for it, which shows a 2D map of all the functions called beneath main(), with the area of each function proportional to the cumulative time spent in that function.

The big yellow boxes are all memset(), which is being called on heap-allocated memory to set it to zero before use. That’s a low hanging fruit to optimise.

In particular, it turns out that the XbStack and XbOperands which libxmlb creates for evaluating each XPath query were being allocated on the heap. With a few changes, they can be allocated on the stack instead, and don’t need to be zero-filled when set up, which saves a lot of time — stack allocation is a simple increment of the stack pointer, whereas heap allocation can involve page mapping, locking, and updates to various metadata structures.

The changes are here, and should benefit every user of libxmlb without further action needed on their part. With those changes in place, the callgrind callee map is a lot less dominated by one function:

There’s still plenty left to go at, though. Contributions are welcome, and we can help you through the process if you’re new to it.

What’s this mean for gnome-software in GNOME 40?

Overall, after all the performance work in the GNOME 40 cycle, startup time has decreased from 25 seconds to 12 seconds (-52%) when starting for the first time since the silo changed. This is the situation in which gnome-software normally starts, as it sits as a background process after that, and the silo is likely to change every day or two.

There are plans to stop gnome-software running as a background process, but we are not there yet. It needs to start up in 1–2 seconds for that to give a good user experience, so there’s a bit more optimisation to do yet!

Aside from performance work, there’s a number of other improvements to gnome-software in GNOME 40, including a new icon, some improvements to parts of the interface, and a lot of bug fixes. But perhaps they should be explored in a separate blog post.

Many thanks to my fellow gnome-software developers – Milan, Phaedrus and Richard – for their efforts this cycle, and my employer the Endless OS Foundation for prioritising working on this.

Controlling safety vs speed when writing files

GLib 2.65.1 has been released with a new g_file_set_contents_full() API which you should consider using instead of g_file_set_contents() for writing out a file — it’s a drop-in replacement. It provides two additional arguments, one to control the trade-off between safety and performance when writing the file, and one to set the file’s mode (permissions).

What’s wrong with g_file_set_contents()?

g_file_set_contents() has worked fine for many years (and will continue to do so). However, it doesn’t provide much flexibility. When writing a file out on Linux there are various ways to do it, some slower but safer — and some faster, but less safe, in the sense that if your program or the system crashes part-way through writing the file, the file might be left in an indeterminate state. It might be garbled, missing, empty, or contain only the old contents.

g_file_set_contents() chose a fairly safe (but not the fastest) approach to writing out files: write the new contents to a temporary file, fsync() it, and then atomically rename() the temporary file over the top of the old file. This approach means that other processes only ever see the old file contents or the new file contents (but not the partially-written new file contents); and it means that if there’s a crash, either the old file will exist or the new file will exist. However, it doesn’t guarantee that the new file will be safely stored on disk by the time g_file_set_contents() returns. It also has fewer guarantees if the old file didn’t exist (i.e. if the file is being written out for the first time).

In most situations, this is the right compromise. But not in all of them — so that’s why g_file_set_contents_full() now exists, to let the caller choose their own compromise.

Choose your own tradeoff

The level of safety/speed of g_file_set_contents() can be chosen using GFileSetContentsFlags.

Situations where your code might want a looser set of guarantees from the defaults might be when writing out cache files (where it typically doesn’t matter if they’re lost or corrupted), or when writing out large numbers of files where you’re going to call fsync() once after the whole lot (rather than once per file).

In these situations, you might choose G_FILE_SET_CONTENTS_NONE.

Conversely, your code might want a tighter set of guarantees when writing out files which are well-formed-but-incorrect when empty or filled with zeroes (as filling a file with zeroes is one of the failure modes of the existing g_file_set_contents() defaults, if the file is being created), or when writing valuable user data.

In these situations, you might choose G_FILE_SET_CONTENTS_CONSISTENT | G_FILE_SET_CONTENTS_DURABLE.

The default flags used by g_file_set_contents() are G_FILE_SET_CONTENTS_CONSISTENT | G_FILE_SET_CONTENTS_ONLY_EXISTING, which makes its definition:

gboolean
g_file_set_contents (const gchar  *filename,
                     const gchar  *contents,
                     gssize        length,
                     GError      **error)
{
  return g_file_set_contents_full (filename, contents, length,
                                   G_FILE_SET_CONTENTS_CONSISTENT |
                                   G_FILE_SET_CONTENTS_ONLY_EXISTING,
                                   0666, error);
}

Check your code

So, maybe now is the time to quickly grep your code for g_file_set_contents() calls, and see whether the default tradeoff is the right one in all the places you call it?

Startup time profiling of gnome-software

Following on from the heap profiling I did on gnome-software to try and speed it up for Endless, the next step was to try profiling the computation done when starting up gnome-software — which bits of code are taking time to run?

tl;dr: There is new tooling in sysprof and GLib from git which makes profiling the performance of high-level tasks simpler. Some fixes have landed in gnome-software as a result.

Approaches which don’t work

The two traditional tools for this – callgrind, and print statements – aren’t entirely suitable for gnome-software.

I tried running valgrind --tool=callgrind gnome-software, and then viewing the results in KCachegrind, but it slowed gnome-software down so much that it was unusable, and the test/retry cycle of building and testing changes would have been soul destroyingly slow.

callgrind works by simulating the CPU’s cache and looking at cache reads/writes/hits/misses, and then attributing costs for those back up the call graph. This makes it really good at looking at the cost of a certain function, or the total cost of all the calls to a utility function; but it’s not good at attributing the costs of higher-level dynamic tasks. gnome-software uses a lot of tasks like this (GsPluginJob), where the task to be executed is decided at runtime with some function arguments, rather than at compile time by the function name/call. For example “get all the software categories” or “look up and refine the details of these three GsApp instances”.

That said, it was possible to find and fix a couple of bits of low-hanging optimisation fruit using callgrind.

Print statements are the traditional approach to profiling higher-level dynamic tasks: print one line at the start of a high-level task with the task details and a timestamp, and print another line at the end with another timestamp. The problem comes from the fact that gnome-software runs so many high-level tasks (there are a lot of apps to query, categorise, and display, using tens of plugins) that reading the output is quite hard. And it’s even harder to compare the timings and output between two runs to see if a code change is effective.

Enter sysprof

Having looked at sysprof briefly for the heap profiling work, and discounted it, I thought it might make sense to come back to it for this speed profiling work. Christian had mentioned at GUADEC in Thessaloniki that the design of sysprof means apps and libraries can send their own profiling events down a socket, and those events will end up in the sysprof capture.

It turns out that’s remarkably easy: link against libsysprof-capture-4.a and call sysprof_capture_writer_add_mark() every time a high-level task ends, passing the task duration and details to it. There’s even an example app in the sysprof repository.

So I played around with this newly-instrumented version of gnome-software for a bit, but found that there were still empty regions in the profiling trace, where time passed and computation was happening, but nothing useful was logged in the sysprof capture. More instrumentation was needed.

sysprof + GLib

gnome-software does a lot of its computation in threads, bringing the results back into the main thread to be rendered in the UI using idle callbacks.

For example, the task to list the apps in a particular category in gnome-software will run in a thread, and then schedule an idle callback in the main thread with the list of apps. The idle callback will then iterate over those apps and add them to (for example) a GtkFlowBox to be displayed.

Adding items to a GtkFlowBox takes some time, and if there are a couple of hundred of apps to be added in a single idle callback, that can take several hundred milliseconds — a long enough time to block the main UI from being redrawn that the user will notice.

How do you find out which idle callback is taking too long? sysprof again! I added sysprof support to GLib so that GSource.dispatch events are logged (along with a few others), and now the long-running idle callbacks are displayed in the sysprof graphs. Thanks to Christian and Richard for their reviews and contributions to those changes.

This capture file was generated using sysprof-cli --gtk --use-trace-fd -- gnome-software, and the ‘gnome-software’ and ‘GLib’ lines in the ‘Timings’ row need to be made visible using the drop-down menu in the ‘Timings’ row.

It’s important to call g_task_set_source_tag() or g_task_set_name() on all the GTasks in your code, and to call g_source_set_name() on the GSources (like this), so that the marks in the capture file have helpful names.

In it, you can see the ‘get-updates’ plugin job on gnome-software’s flatpak plugin is taking 1.5 seconds (in a thread), and then 175ms to process the results in the main thread.

The selected row above that is showing it’s taking 110ms to process the results from a call to gs_plugin_loader_job_get_categories_async() in the main thread.

What’s next?

With the right tooling in place, it should be easier for me and others to find and fix performance issues like these, in gnome-software and in other projects.

I’ve submitted a few fixes, but there are more to do, and I need to shift my focus onto other things for now.

Please try out the new sysprof features, and add libsysprof-capture-4.a support to your project (if it would help you debug high-level performance problems). Ask questions on Discourse (and @ me).

To try out the new features, you’ll need the latest versions of sysprof and GLib from git.

Easily speed up CI by reducing download size

Every time a CI pipeline runs on GitLab, it downloads the git repository for your project. Often, pipeline jobs are set up to make further downloads (of dependencies or subprojects), which are also run on each job.

Assuming that you’ve built a Docker image containing all your dependencies, to minimise how often they’re re-downloaded (you really should do this, it speeds up CI a lot), you can make further improvements by:

  1. Limiting the clone depth of your repository in the GitLab settings: Settings ? CI/CD, and change it to use a ‘git shallow clone’ of depth 1.
  2. Adding --branch, --no-tags and --depth 1 arguments to every git clone call you make during a CI job. Here’s an example for GLib.
  3. Adding depth=1 to your Meson .wrap files to achieve the same thing when (for example) meson subprojects download is called. See the same example merge request.

For GLib, the difference between git clone https://gitlab.gnome.org/GNOME/glib.git and git clone --depth 1 https://gitlab.gnome.org/GNOME/glib.git is 66MB (reducing from 74MB to 8MB), or a factor of 10. It won’t be as much for younger or smaller projects, but still worthwhile.