About
x264 is a critical benchmark for vectorization in the spec suite showing a roughly 2X improvement across many archtiectures once vector is enabled. This work item is mean to track further improvements that may be possible in the benchmark through compiler improvements.
There is an arithmetic right shift followed by a masking operation in quant_4x4 that can be simplified into a logical right shift eliminating a small amount of code on a critical path.- Vector setup and element extraction is likely sub-optimal in the SATD routine, particularly for zvl512b
- Much of this can be fixed by adjusting how we expand the vector setup code
- Robin owns this
Rearrangement of SLP nodes with multiple occurrences in the in the same statement to avoid duplicates with a vec_perm to restore the original ordering may have as much as a 10% benefit for vectorized x264.Additional information from GCC's bug databaseProposed patch, probably won't go in as-is, but can be used for experimentation
- GCC does not make good use of widening vector ops that overlap source/destination registers. Expectation is this is another 1-2% improvement
- GCC does not hoist vxrm assignments aggressively, which can significantly impact performance if the uarch does not provide fast vxrm access. This is about 2% on the BPI
Stakeholders/Partners
RISE:
Ventana: Robin Dapp
Ventana: Jeff Law – currently looking at vxrm hoisting
External:
VRULL: Manolis Tsamis
Dependencies
Status
Updates
- VRULL's patch has been upstreamed and we're seeing desired vectorization for the other loop in the SATD routines
- Note overlapping with widening ops and problems with vxrm hoisting improvement opportunities
- Project reported as a priority for 2H2024, broken out from original effort