Add -fno-omit-frame-pointer and -mno-omit-leaf-frame-pointer to #

Date proposed: 2023-12-14
RFC MR: https://gitlab.archlinux.org/archlinux/rfcs/-/merge_requests/26

Summary #

This RFC proposes to add -fno-omit-frame-pointer and -mno-omit-leaf-frame-pointer to the default compilation flags to improve the effectiveness of profiling and debugging tools.

Most of this RFC was adapted from my Fedora change proposal.

Motivation #

Why perform full system profiling #

Generally, when implementing optimizations after receiving a report on a performance issue, there are two hurdles a developer must overcome:

They have to recompile their program with sufficient debugging information to enable accurate and reliable profiling. Frame pointers are an example of such information. They have to reproduce the scenario under which the software performed poorly. They have to gather the necessary profiling data by running the recompiled program in the reproduced scenario. After gathering the profiling data, the developer can use that data to guide possible optimizations. Usually, this ends being an iterative process, where a possible optimization is implemented, and the scenario is rerun with the recompiled program to measure the effects on performance.

When dealing with a single program without dependencies, recompiling the software, reproducing the scenario and gathering the profiling data might not be terribly hard to achieve. However, when dealing with a large program with many dependencies, either in the form of shared libraries or via IPC, recompiling all of these dependencies with debugging information, reproducing the exact scenario under which the performance issue occurs, and gathering all the profiling data from all the dependencies becomes a complicated exercise.

An interesting approach to avoid the above hurdles is to make sure we can do profiling of the entire system directly on a user's machine. This approach means we don't have to recompile our software, don't need to reproduce the scenario under which the software performs poorly, and gives us a single unified approach to gather profiling data for all the applications we're interested in. Naturally, this approach depends on being able to profile the entire system efficiently so that there's no noticeable impact on any running services.

Another requirement (unrelated to this proposal, but interesting nonetheless) is that we need logic to only enable profiling when it's interesting to do so. There's a few different options:

On demand profiling: Only start profiling when we receive an explicit request to do so. Interval based continuous profiling: Profile for a specific amount of time every X seconds/minutes/hours/... Trigger based profiling: Start profiling based on some predefined conditions, such as high CPU or memory usage. If we agree that being able to do full system profiling is useful, the next section explains why we need frame pointers in all software running on the system to be able to do effective full system profiling.

How to do full system profiling #

Probably the most prominent way to do full system profiling on Linux with low overhead is by using the perf sampling profiler. Sampling profilers like perf operate by "statistical profiling" (sampling). They take a sample every N events e.g. "cpu-cycles" to understand the statistical breakdown of time spent in functions or function callstacks executing on the CPU. perf has an accompanying Linux subsystem that allows it to take samples every N events with very low overhead. A perf sample can include all kinds of information, but for profiling, what we're typically interested in is the call stack of the programs that are currently executing.

To record samples for specific hardware/software events using the perf subsystem on Linux, developers can use the perf_event_open() system call. To have the recorded samples include the call stack, the PERF_SAMPLE_CALLCHAIN can be set in the config struct passed to perf_event_open(). For userspace stacks, the call stack can only be sampled in kernelspace if the userspace program and its dependencies are built with frame pointers. If frame pointers are not available, PERF_SAMPLE_STACK_USER can be used to sample the entire stack instead, allowing for unwinding in userspace instead using e.g. DWARF debugging info.

The perf subsystem also has support for attaching BPF programs to a perf event file descriptor. The program will be called every time an event is sampled and is provided the perf event data. This can be used to attach arbitrary logic to perf sampling, and makes it possible to implement custom logic on top of perf's sampling without having to leave kernelspace. To get access to the userspace stack from BPF, bpf provides the bpf_get_stackid() helper function. Similar to PERF_SAMPLE_CALLCHAIN, this function depends on the userspace program and its dependencies to have been compiled with frame pointers for it to be able to traverse the call stack.

To get accurate profiling results, we want to be able to sample events at a relatively high sampling rate. This means that we want to do the minimal amount of work every time we sample an event to avoid overhead. Traversing a stack using frame pointers is cheap, since we only have to traverse the frame pointers until we reach the top of the stack. In comparison, to unwind using DWARF, we first have to copy the full stack from kernelspace to userspace, and then unwind the stack using DWARF debugging info, which is relatively slow (see https://fzn.fr/projects/frdwarf/frdwarf-oopsla19.pdf). Because of this, to make full system profiling using perf work effectively, it's imperative that all software running on the system is compiled with frame pointers so that the call stack can be unwound in kernelspace for minimum overall overhead.

Frame pointers for debugging and tracing with BPF #

The above profiling example was just one use case where we benefit from having access to the frame pointer in BPF. Since frame pointers enable BPF to unwind every userspace stack, we can get an accurate callstack from every BPF program we can think of. This makes certain kinds of debugging much easier, especially tools where we want to investigate who is calling a specific function.

A good example is the ustack() function in bpftrace. bpftrace is a high level tracing language for BPF that can easily hook into system calls, function calls, kernel tracepoints, and more. And since frame pointers guarantee that bpftrace's ustack() helper function works all the time, we’re able to log the full callstack from every bpftrace script.

As another example, the bcc tools directory has around ten BPF tools with a -U option to print the userspace stack when tracing some event, such as tracing calls to cap_capable() for security capability checks, or just tracing slow function calls in general with the funcslower script. All these tools can only work reliably when all software running on the system is built with frame pointers.

All the above tooling enables in-depth debugging of applications without needing to modify the source code of the applications itself. BPF can be used to attach to function calls, kernel tracepoints, kernel functions, system calls, and all of this is presented in an easy to use fashion via bcc and bpftrace.

To summarize, BPF tooling that works with or benefits from stack trace information in general will work much more reliably when all software is built with frame pointers. As a result, implementing this change proposal will make the BPF tracing ecosystem of tools much more useful on Arch Linux in general, whereas currently many of the current BPF tools are hamstrung due to the lack of frame pointers.

Unwinding #

How does the profiler get the list of function names? There are two parts of it:

Unwinding the stack - getting a list of virtual addresses pointing to the executable code.
Symbolization - translating virtual addresses into human-readable information, like function name, inlined functions at the address, or file name and line number.

Unwinding is what we're interested in for the purpose of this proposal. The important things are:

Data on stack is split into frames, each frame belonging to one function.
Right before each function call, the return address is put on the stack. This is the instruction address in the caller to which we will eventually return — and that's what we care about.
One register, called the "frame pointer" or "base pointer" register (RBP), is traditionally used to point to the beginning of the current frame. Every function should back up RBP onto the stack and set it properly at the very beginning.

The “frame pointer” part is achieved by adding push %rbp, mov %rsp,%rbp to the beginning of every function and by adding pop %rbp before returning. Using this knowledge, stack unwinding boils down to traversing a linked list:

https://i.imgur.com/P6pFdPD.png

Frame pointers in other distributions #

Fedora has been defaulting to compiling its packages with frame pointers since Fedora 38. As far as we're aware, nobody has reported any noticeable slowdowns in Fedora after enabling compilation with frame pointers. Also, Ubuntu has announced that for its next LTS release (24.04), packages will be compiled with frame pointers enabled by default.

Specification #

To make it possible to rely on the frame pointer being available, we'll add -fno-omit-frame-pointer and -mno-omit-leaf-frame-pointer to the default C/C++ compilation flags. This will instruct the compiler to make sure the frame pointer is always available. This will in turn allow profiling tools to provide accurate performance data which can drive performance improvements in core libraries and executables. It'll also make stacktraces from all BPF tooling more reliable as they'll be guaranteed to have access to a reliable stacktrace via frame pointers.

For rust, the equivalent compilation flag we'll enable is -Cforce-frame-pointers=yes.

We'll exclude the linux package from this change as the kernel has its own unwinding format (ORC) and as such does not benefit from being compiled with frame pointers.

After 6 months or earlier if substantial (reproducible) feedback shows that frame pointers noticeably slows applications down, the RFC will be discussed again to see if the change should be revisited.

Drawbacks #

The frame pointer register is not necessary to run a compiled binary. It makes it easy to unwind the stack, and some debugging tools rely on frame pointers, but the compiler knows how much data it put on the stack, so it can generate code that doesn't need the RBP. Not using the frame pointer register can make a program more efficient:

We don’t need to back up the value of the register onto the stack, which saves 3 instructions per function.
We can treat the RBP as a general-purpose register and use it for something else.

Whether the compiler sets frame pointer or not is controlled by the -fomit-frame-pointer flag and the default is "omit", meaning we can’t use this method of stack unwinding by default.

Potential performance impact #

From https://hal.inria.fr/hal-02297690/document, a paper on DWARF unwinding, we find that Google compiles all its internal critical software with frame pointers to ensure fast and reliable backtraces.
Given that the kernel uses the ORC debuginfo format and this works well, we'll keep compiling the kernel without frame pointers since there's no benefits for profiling or debugging to be gained by compiling the kernel with frame pointers. This prevents any regressions in kernel performance such as those reported by https://lore.kernel.org/all/20170602104048.jkkzssljsompjdwy@suse.de/T/#u.
Brendan Gregg from Intel advocates making -fno-omit-frame-pointer the default in GCC (https://www.brendangregg.com/Slides/SCALE2015_Linux_perf_profiling.pdf)

Should individual libraries or executables notice a significant performance degradation caused by including the frame pointer everywhere, these packages can opt-out on an individual basis.

Benchmarking of the performance impact #

As part of the change proposal for enabling frame pointers on Fedora by default, we compared a number of benchmarks on a Fedora 37 system where every package is built with frame pointers against the same benchmarks on a regular Fedora 37 system. The source code for the benchmarks can be found in the fpbench repository on GitHub. We used copr to build the packages required to run the benchmarks (and their dependencies) with frame pointers. Then, using mkosi, we build one Fedora 37 container with frame pointers and one Fedora 37 container without frame pointers and run various benchmarks inside these containers.

The full results of the benchmarks can be found in the README of the fpbench repository.

Summarizing the results:

Compiling the kernel with GCC is 2.4% slower with frame pointers
Running Blender to render a frame is 2% slower on our specific testcase
openssl/botan/zstd do not seem to be affected significantly when built with frame pointers
The impact on CPython benchmarks can be anywhere from 1-10% depending on the specific benchmark
Redis benchmarks do not seem to be significantly impacted when built with frame pointers

Aside from the Python performance benchmarks, the impact of building with frame pointers is limited on the benchmarks we performed. Our findings on the impact of the Python benchmarks when CPython is built with frame pointers are discussed in this comment on the FESCo issue.

Alternatives Considered #

There are a few alternative ways to unwind stacks instead of using the frame pointer:

DWARF data - The compiler can emit extra information that allows us to find the beginning of the frame without the frame pointer, which means we can walk the stack exactly as before. The problem is that we need to unwind the stack in kernel space which isn't implemented in the kernel. Given that the kernel implemented its own format (ORC) instead of using DWARF, it's unlikely that we'll see a DWARF unwinder in the kernel any time soon. The perf tool allows you to use the DWARF data with --call-graph=dwarf, but this means that it copies the full stack on every event and unwinds in user space. This has very high overhead. For more details on why DWARF unwinding is slow, please see https://hal.inria.fr/hal-02297690/document which contains detailed information on the problems with DWARF unwinding.
ORC (undwarf) - problems with unwinding in kernel led to creation of another format with the same purpose as DWARF, just much simpler. This can only be used to unwind kernel stack traces; it doesn't help us with userspace stacks. More information on ORC can be found in the LWN article “The ORCs are coming”.
LBR - New Intel CPUs have a feature that gives you source and target addresses for the last 16 (or 32, in newer CPUs) branches with no overhead. It can be configured to record only function calls and to be used as a stack, which means it can be used to get the stack trace. Sadly, you only get the last X calls, and not the full stack trace, so the data can be very incomplete. On top of that, many Arch Linux users might still be using CPUs without LBR support which means we wouldn't be able to assume working profilers on an Arch Linux system by default.
CTF Frame - An in progress RFC will add support to binutils to attach a new ctf_frame section to ELF binaries containing unwinding information. This new unwinding format claims to be more compact than eh_frame, faster to unwind, and simpler to implement an unwinder with. Should this format be accepted into binutils and should the kernel merge a CTF unwinder in the future, we could start building applications with CTF frame unwind information which could then be used in the kernel for unwinding userspace stacks instead of frame pointers. Unfortunately, CTF Frame is still a work-in-progress and won't be available for some time (if at all).
Shadow Stacks - Shadow stacks are a hardware feature found on new Intel and AMD CPUs that improve security by copying return address information to a separate read-only shadow stack so that it's possible to verify the return address on the original stack wasn't modified by for example a buffer overflow. This information could potentially be used to unwind the stack. However, it's very early days for shadow stacks, they're only supported on very new CPU models, there's no kernel support yet and it's not completely certain that we'll be able to use this information for unwinding. As such, it's not a viable option for unwinding at this time but it might become one at some point in the future.

To summarize, if we want complete stacks with reasonably low overhead (which we do, there's no other way to get accurate profiling data from running services), frame pointers are currently the best option.