0002 March

Provide a x86-64-v3 microarchitecture level port

Summary

Provide a second Arch Linux port using -march=x86-64-v3 in the build flags.

Motivation

Arch used to pride itself in providing optimised binaries out of the box. However, the days where our i686 showed improvements over other distributions are long behind us.

Recently, AMD, Intel, Red Hat, and SUSE collaborated to define three x86-64 microarchitecture levels on top of the x86-64 baseline. The three microarchitectures group together CPU features roughly based on hardware release dates.

The first of these microarchitecture levels, x86-64-v2, assumes the following on top of base level x86_64 instructions:

CMPXCHG16B, LAHF-SAHF, POPCNT, SSE3, SSE4.1, SSE4.2, SSSE3.

This basically raises the processor feature level requirement to around Intel Nehalem, and supports any x86_64 processor made in the last decade.

The x86-64-v3 microarchitecture requires the following instruction sets:

AVX, AVX2, BMI1, BMI2, F16C, FMA, LZCNT, MOVBE, XSAVE.

That is close to a Haswell processor, but does exclude some recent low end Intel CPU that removed AVX support.

Finally, x86-64-v4 requires:

AVX512F, AVX512BW, AVX512CD, AVX512DQ, AVX512VL

These microarchitecture became available in GCC version 11 (unreleased) and LLVM version 12 (unreleased), and are supported in glibc-2.33 and binutils-2.36.

You can see what architecture is supported by your CPU by running:

/lib/ld-linux-x86-64.so.2 --help

Subdirectories of glibc-hwcaps directories, in priority order:
  x86-64-v4
  x86-64-v3 (supported, searched)
  x86-64-v2 (supported, searched)

RHEL9 will use x86-64-v2 as its baseline.

This RFC is proposing adding an x86_64_v3 port in Arch Linux. Assuming SSE4 and AVX2 (and others) while compiling will provide greater out-of-the-box performance in Arch Linux. There are also implications in terms of battery life for laptop users.

Benchmarks

It is difficult to benchmark an entire system, and workloads that benefit most often have CPU detection inbuilt and use optimised paths. Also, the relevant GCC and LLVM releases are not yet available. To make some tests equivalent to x86-64-v3 using current GCC and LLVM, we can compile packages using:

CFLAGS="$CFLAGS -mcx16 -msahf -mpopcnt -msse3 -msse4.1 -msse4.2 -mssse3 \
          -mavx -mavx2 -mbmi -mbmi2 -mf16c -mfma -mlzcnt -mmovbe -mxsave"
CXXFLAGS="$CFLAGS"

Some benchmarks performed rebuilding packages with and without the above CFLAGS additions against repositories from 2021-03-12:

firefox-86.0.1-1: benchmarking on Basemark Web 3.0 (https://web.basemark.com/) seven times (alternating installs) gave a median score of 514.68 for v1 and 565.42 for v3, representing a 9.9% improvement. Note, this was rebuilding only firefox itself, and none of its dependencies, thus representing a lower bound.

openssl-1.1.1.j-1: benchmarking using openssl speed rsa showed improvements in the range of 3.4% to 5.1% for signing and verifying with keys of different sizes.

Benchmarks posted on the arch-general mailing list [1] show a median performance benefit of -march=haswell (roughly x86_64-v3) of around 10%.

[1] https://lists.archlinux.org/pipermail/arch-general/2021-March/048739.html

Specification

We will provide a second port where the distributed makepkg.conf includes the following:

CARCH="x86_64_v3"
CHOST="x86_64-pc-linux-gnu"

CFLAGS="-march=x86-64-v3 -mtune=generic ...
CXXFLAGS="$CFLAGS"

Alternatives Considered

Moving the baseline to x86-64-v2 was discussed, but the gains were not considered enough to justify removal of support for hardware without SSE4.2.

Providing all four architectures would require a lot of resources in terms of build time, and mirror space. Providing x86-64 and x86-64-v3 only is a trade-off in gaining support for more optimised binary support for new hardware (while not requiring the absolute latest) and additional build time and storage associated with providing multiple architectures.

Drawbacks

Providing a second architecture would increase our repo size by approximately 66% (~32GB).

Building two architectures will take additional packager time unless automated.

Some developers may not have hardware to debug issues found purely in x86-64-v3 packages. It is likely these issues are very rare.

Unresolved Questions

When "Architecture = auto" is set in pacman.conf, pacman will use uname to detect the architecture. As this "port" is more of an optimised rebuild rather than a architecture change, uname will report x86_64. We could patch pacman to use x86_64_v3 instead, but that may not be the correct solution.

It would be preferable if pacman on x86_64_v3 could still install packages from x86_64, particularly for non-Arch repositories that may not want to build for both architectures. This would also allow a transition into x86_64_v3 when firstly [core] gets rebuilt, followed by other repos one at a time. Your friendly pacman developers may be willing to add the ability to specify multiple architectures in pacman.conf.