CT_00_032 -- xz vectorization using "early break" vectorization
About
Early break vectorization is a method for vectorizing loops which scan memory, breaking from the scan loop when a particular condition is met. Traditionally these vectorization opportunities have been recognized with loop idiom recognizers such as rawmemchr. Early break vectorization allows these loops to be vectorized in a more generic way without writing a custom loop recognizer. It is expected this can be used to vectorize the key search loop in the xz benchmark from spec2017.
While Ventana has a custom solution in place, that's really just meant for evaluation, not integration. We've used this work item to track that custom solution's progress. We have a distinct project in 2025 for the more general approach being taken in the vectorizer.
Stakeholders/Partners
RISE:
Ventana: Robin Dapp
Ventana: Jeff Law
SiFive: Craig Topper
External:
Linaro: Alex Coplan, Tamar Christiana
Dependencies
Status
Updates
- Winding down this phase. Ventana's custom solution appears to work well and indicates the more general solution should prove quite profitable. Reach out to Robin or Jeff to get a copy of the custom solution if you would like it.
- Robin has made minor adjustments to the code generation for Ventana's internal version of the xz loop vectorization
- Remove the use of fault-only-first loads and use standard loads instead
- Remove the VL CSR read from the loop (no longer needed after removing fault-on-first loads)
- Fast path using scalar when the number of elements is small
- This is showing a roughly 5% improvement on xz inputs 2 & 3 on Ventana's uarch
- It's still our belief that the early break optimization is the best way to handle this going forward and Ventana will drop its custom recognizer once the early break optimization is working reasonably well.
- Work on "early break" optimization continues upstream
- After further evaluation Robin and Jeff believe that we do not need to use Fault-only-First loads as there is no risk of overrun
- Should simplify the implementation and remove need to expose fault-only-first load semantics generically to GCC
- May allow an implementation to land in time for gcc-15
- No need to worry about uarchs that break down a FOF load at page boundaries (simplifies hardware)
- Allows removal of VL read from loop, which can help some uarchs
- Within Ventana Robin has built a custom recognizer (roughly similar to LLVM's approach) that we're testing on design
- It appears that a fast path using only scalar when the loop trip count is low is useful
- vsetvl hoisting out of the loop does not look feasible, nor does it appear that doing so would help performance
- The key search loop in xz can be vectorized. The compiler team at ARM is currently working on the general form of this optimization
- Ventana has written a custom recognizer & code generation for benchmarking purposes.
- Very clear win on the BPI.
- But also sensitive to VL CSR read latency
- Project added as priority for 2H 2024.