Banana-PI F3 (Spacemit K1/M1) Performance Notes
RISE takes no position on endorsing any given RISC-V implementation and RISE encourages vendors to produce optimization guides for their chips. In the absence of such documentation we are making an attempt to gather performance information that may be of use to software developers, particularly those writing vector code by hand or compiler developers. To that end this page attempts to document notable performance characteristics of the Spacemit K1/M1 core found in the Banana-PI F3 and other systems.
VXRM assignments:
The vxrm register holds the current rounding mode setting and is used in routines such as pixel_avg within x264 to help implement the ceiling halfword average idiom. Setting the vxrm register may be an expensive operation on some micro-architectures, including the K1/M1. Compilers do try to minimize the number of vxrm assignments using lazy code motion/partial redundancy elimination (LCM/PRE) techniques. However, those techniques have limitations in that they never allow for speculative code motion. In pixel_avg, we hoist the vxrm setting from the innermost loop to the outer loop. Further hoisting is not possible without some degree of speculation. Note however that the innermost loop gets fully vectorized and does not actually need to iterate. So there is actually a 1:1 ratio of vxrm sets to vaaddu instructions with minimal separation between the vxrm set and the use in the vaaddu instruction.
It is expected that vxrm settings are generally rare and that needing multiple distinct settings of vxrm is also rare. So the simplistic approach being explored in GCC is to recognize cases where we can treat vxrm as a function invariant rather than just a loop invariant and in that scenario handle the vxrm mode setting in the function prologue rather than placement points computed by LCM/PRE. This introduces a degree of speculation of vxrm setting, but in practice with the x264 benchmark the speculation is always profitable. It is expected that such speculation would be profitable the vast majority of cases where vxrm sets are generated by the compiler in support of instructions like vaaddu.
Initial benchmarking of x264 shows a 2% performance improvement of the x264 refrate benchmark in spec2017 from this speculation on the Banana-PI F3 board.
Both GCC and LLVM have code which can aggressively hoist VXRM assignments. It may be worth experimentation by someone to see if a sequence not using vaaddu is ultimately faster for some uarchs and submit patches to make using vaaddu conditional on specific uarch tuning values.
Multiplies:
It has been reported that the K1/M1 chip handles 32-bit integer multiplies roughly 3X faster than 64-bit multiplies. So if dataflow can determine that the upper bits are not needed and thus a mulw instruction is sufficient, then that can be a highly profitable transformation to make.