Profile- and Extension- Optimized Distros

This started as a summary of the https://lists.riseproject.dev/g/TSC/topic/homework_assignments_for/99162090?p=,,,20,0,0,0::recentpostdate/sticky,,,20,2,0,99162090,previd%3D1685566967087182717,nextid%3D1683083871956746760&previd=1685566967087182717&nextid=1683083871956746760 thread.

Context

Linux Distros generally compile with one set of compiler flags and target architecture that works for all hardware they support in a given architecture (e.g. x86-64 or aarch64). They also generally support a wide range of hardware. As an example, it's easy to find stories online from folks who pull out a 10 year old laptop and install the latest Ubuntu on it. This means Ubuntu is compiling for the x86 ISA from 10+ years ago. That's not ideal, but it's also not the end of the world b/c 10 years ago, x86 had a pretty solid ISA. Maybe it didn't have AVX512, but earlier forms of SIMD were there, etc

[[UD: First, it's more than just compiling the distribution and second, there is an assumption made which Intel and AMD have kept in mind when revamping the CPUs. For each version of the distribution the decision of the least common ISA set is defined.  Usually this is stable for many years but it can change.  This means no precautions are taken to help deployment on CPUs not meeting the base requirement except that perhaps some tests for the required features are executed and the system is stopped early if the tests fail.  This works well enough for an ecosystem like x86 because there are architected CPU features which are not removed.  Some features are effectively made no-ops, others will depend on fallback cases which had to be present.  Unless this is possible the baseline cannot be raised.  The RISC-V profiles are going in the right direction but there isn't sufficient protection that future revisions of the profiles maintain older profiles as a subset.  I know this is not generally desirable but there should be a version of the profiles with that property.  Manufacturers interested in platforms with longevity will adopt these profiles.]]

The concern for RISC-V is that if Linux Distros target the least common denominator of all our hardware, we end up with a pretty bad ISA, relatively speaking: no vector extensions, no unaligned access, no bit manipulation, etc. For this reason, it's especially critical for RISC-V more than other architectures that we find/create solutions to allow distros to ship optimized code to our hardware.

The "baseline ISA" targeted by hardware is known as the "RVA ISA Profiles". These are being released on a year-ish basis going forward, but it is recognized that the software ecosystem and the distros are not prepared or willing to change that fast.  Hence there is the idea now formed in RVI of there being major and minor RVA profile releases.  RVA20 was the initial major release of RVA, followed by minor RVA22 and RVA23 releases (the latter is currently gestating and should be ratified in a quarter or so).

Also note that while RVA profiles mandate a lot (with a rising bar of mandates over time), there are some RISC-V ISA extensions that are considered Optional - with the idea that there will be software support that takes advantage of them when they are present in a system.  Similarly, some of the new ISA extensions that appear in following minor profile releases may be practical to also support in this manner.

All this does is explode the number of configurations to optimize against. The focus is specifically userspace. Linux kernel itself has sufficient mechanisms for enabling features (e.g. Sv48 vs Sv39) or choosing optimized code paths. With userspace, you're dealing with optimizing system libraries and some applications.

Local solutions

Some approaches work for an individual library or application, to optimize specific code paths.

ifuncs

ifuncs are best when there's a functional interface and a custom version of the target function can provide *significant* performance benefits. Small wins in the implementation will be nullified by the additional level of indirection that is inherent in the ifunc dispatch. It's primarily used within glibc to dispatch to custom mem*, str* or some math functions where things like FMA capabilities can provide significant speedups.

[[UD: ifuncs are used automatically and widely when gcc's ifunc function attribute is used. This is a documented feature of the compiler, it's use is not constraint to the runtime libraries. Possible performance issues with ifuncs are not due to the extra indirection itself, that's not much of a factor on x86 machines with extensive caching and branch prediction.  The problem is inlining, more concretely, the compiler's inability to inline functions that are ifuncs.  It requires that the function that can benefit from inlining such a specialized function itself is made an ifunc.]]

https://lpc.events/event/11/contributions/1103/attachments/832/1576/Puzzle%20for%20RISC-V%20ifunc.pdf from 2021.

Is there anything left to do as of 2023?

[[UD: What is completely missing is discovery.  The profile specification says:

   3.3. Profile ISA Features

  []  Platforms may provide a discovery mechanism to determine what optional extensions are present.

The discovery mechanisms must be architected and mandatory.]]

Function multi-versioning

There's a similar capability called function multi-versioning which tends to be used by user code that wants to specialize itself. Essentially with a source annotation one can arrange to have a given function or set of functions compiled multiple times with different options and use an ifunc-like handler to select them at runtime. It's mean to provide applications a cleaner approach to implementing dynamic dispatch.

[[UD: I think you're mixing up two things.  Function versioning is a mechanism to maintain backward compatibility while changing the semantic of a function for current and future uses.  It does not use ifunc, it uses symbol versioning.  Of course you could call the use of ifuncs also "multu-versioning" but this is redundant since there is already a point above covering ifuncs.]]

Roll-your-own

There are roll-your-own dynamic dispatch implementations in libraries like OpenSSL. These can be built on top of executing magic instructions to find out the processor's capabilities and then specializing based on that. This will have to be converted to the new kernel interfaces being implemented for discovery. The ARM LSE implementation is closer to this than ifuncs.

[[UD: This is not a alternative. The feature discovery mentioned here also has to be present in the ifunc dispatcher. There is no compiler-generated magic to make this somehow work.  The ifunc dispatcher needs to discover the properties of the current CPU and then use an encoded heuristic to decide when implementation to use.  The only difference of a mechanism in OpenSSL or the Intel MKL etc to ifunc is the efficiency: once the target function is identified ifunc calls just require a single indirect function call while OpenSLL/MKL/… always require a call into a dispatching function which itself performs the indirect call.  ifuncs are much more efficient.

The requirement for userlevel code to discover a CPU's feature set is common to all kinds of dispatching code, including what you call "global solutions" below.]]

Global solutions

Can a single distro ship with multiple versions of all  binaries, targeting specific RVA profile levels or extensions?

[[UD: That's really not possible (except when providing completely separate installation) nor is it wanted.  For individual binaries there might be multiple versions and a dispatching wrapper (binary or just a shell script) can select an appropriate version.  What is much more common (based on 20+ years experience) is that the code in question is in DSOs and then multiple versions of a DSO are provided.  That's support in the dynamic linker if the platforms provides the necessary features.  See below.]]

Debian Multi-Arch

https://wiki.debian.org/Multiarch/HOWTO

This implies treating every RVA profile as a new arch?

[[UD: This is relevant here only as far as running 32-bit binaries should be supported on 64-bit systems.  This is completely orthogonal to the problem discussed here so far.  There are reasons to support this and distributions will make their own calls.  But there is no agreement on names for existing platforms and I don't expect anyone will agree in future either.  E.g., on Fedora and others wwe have /lib and /lib64 directories for the 32-bit and 64-bit libraries respectively.  Debian systems use /lib always for the native form and /lib32 on 64-bit systems for the 32-bit versions.  I very much doubt that either camp will diverge from that going forward.  This is best left to the individual distribution to decide.  Aside, no features from the system are needed.  The decision what directory to look into is exclusively made based on the dynamic and the bit-ness of the executable.]]

OpenSuse x86-v3-like approach

https://www.phoronix.com/news/openSUSE-TW-x86-64-v3-HWCAPS

The idea here is that for key packages that benefit from being compiled to a more modern architecture, they are compiled multiple times by the distro: the classic least-common denominator, but also x86-v2 and x86-v3. The client (apt-get, etc) installs the best-architecture version of the package available from the package server. At runtime, glibc searches many paths to find .so files to load, starting from most specific/best arch, down to most generic. This solution may help at least with packages that are shared objects, but with more effort could also work for executables.

[[UD: First, in general the idea is to install all the variants of the package in question.  Otherwise the remainder of your paragraph doesn't make any sense.  Depending on the distribution the packaging of the different DSO version does not even involve multiple packages.  The different can all be in the same package and unconditionally installed.  The idea is that the dynamic linker searches for each requested DSO a deterministic list of filesystem locations in a meaningful order.  The list and it's order is determined based on the system's features.  The dynamic linker needs the same kind of discovery mechanisms as ifunc dispatchers, perhaps with a bit less granularity because it would be too costly to provide too many versions.  The cost of testing (or the risk of not testing) all combinations of features, binaries, and DSOs would be too high.

For x86 the sets of features that are deemed worth compiling a separate DSO for have been agreed upon in lengthy discussions and is by no means SUSE-specific. A decision on the levels (base, v2, v3) was possible because the differences between base Opteron and varios SSE/AVX versions was large enough.  But it also depends on common availability and performance characteristics of the different implementations which has be verified.]]

FatELF: Universal Binaries for Linux

https://icculus.org/fatelf/

FatELF is a file format that embeds multiple ELF binaries for different architectures into one file. This is the Linux equivalent of what Mac OS X calls "Universal Binaries."

The format is very simple: it adds some accounting info at the start of the file, and then appends all the ELF binaries after it, adding padding for alignment. The end of the file isn't touched, so you can still do things like self-extracting .zip files for multiple architectures with FatELF.

This allows RVA profile-optimized applications and libraries, without disturbing how these are packaged and distributed.

The open here is how to tag an ELF object as being optimized for an RVA profile or extension. A special .note section?

Also see https://www.reddit.com/r/programming/comments/a0rku/the_fatelf_project_universal_binaries_for_linux/.

[[UD: This is misleading and I would venture to guess, unrealistic.  MacOS uses fat binaries to have one binary for x86 and Arm-based systems (maybe even others).  That's not what we are talking about here.  There is also nothing "simple" about this: there isn't tool support, there are myriads of potential compatibility problems.  This is a much-too-heavy approach to handle ISA nuances.]]

Extension discovery

In userspace.

[[UD: Yes.  But how?  Fast and reliably.  In the past we relied on reading out /proc/cpuinfo for some of the work.  This punts to discovery to the kernel but would require standardization of the /proc/cpuinfo format.  Also, people complained about the requirement of /proc being mounted.  Therefore most discovery, at least on x86, is now done using the documented and architected CPU features.  Intel and AMD both provide the cpuid interface for discovery.]]

SIGILL

Use a signal handler to detect support for instructions provided by a specific extension.

This has been used in qemu to probe Zba, Zbb, Zicond extensions, e.g. https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg00602.html.

[[UD: That is perhaps acceptable for an application but not a runtime.  The runtime environment cannot hijack the signal handler.  First, they are a process property.  Other threads might rely on the SIGILL handler concurrently for something else.  Second, signal delivery is slow and the number of features to test and the functions/DSOs depending on them might be large.  Furthermore, until an ISA extension is officially sanctified it's not even possible to rely on the instruction encoding to be a valid way to test for a feature.  Maybe the CPU version uses a completely different instruction at that position in the instruction encoding table.  This indirect discovery cannot possibly ever be a reliable discovery mechanism.]]

RISC-V Hardware Probing Interface

https://docs.kernel.org/riscv/hwprobe.html

[[UD: If the kernel can reliably probe the CPU that's certainly a possibility.  This is the equivalent of the cpuid instruction on x86 systems.  But it requires that a) the parameters are part of the kernel ABI (not just API), b) that it is extended beyond the coarse-grained information that is available so far, and c) that the ABI is maintained globally (for each OS) and that no vendor goes and by themselves adds a flag without global agreement.]]


Feature Selection

[[UD: One of the biggest problems will be the selection of the features that are worthwhile and consistent.  It makes no sense to specialize for a feature if it exists on different platforms with different performance characteristics.  Also, the selection must be minimal.  Each selectable feature increases the testing matrix significantly and, I expect, also creates dependencies on specific hardware platforms for testing.  E.g., there will be a lot of push-back against implementing crypto features like AES and SHA through multiple different ISA extensions.  Any kind of functionality like that is subject to stringent requirements with regard to testing and certification. This group cannot just rubberstamp individuals hardware vendors decisions for their respective ISA which might not be driven exclusively for the benefit of the common good.  Instead, I expect, vendors want to capitalize on first-mover advantages without standardization which would slow things down.  This must be prevented unless it is clear that other manufacturers are most likely fall in line.  The number of participants in the market is just too high.  The situation for x86 is different with just two major participants who also over the decades learned to see the benefit of a common ISA.]]

big-LITTLE Implications

[[UD: CPUs already have cores with different capabilities which can create problems.  This is nothing specific to RISC-V but for other architectures it is something that was added afterwards.  We should think about this ahead of the introduction of the problem.  Without consideration being given the use of cores with different capabilities in the same process is problematic.  We might need to define something which either prevents the problem coming up or something that transparently handles this problem.]]