...
There is an arithmetic right shift followed by a masking operation in quant_4x4 that can be simplified into a logical right shift eliminating a small amount of code on a critical path.- Vector setup and element extraction is likely sub-optimal in the SAD/SATD routines.
- Removal of VIEW_CONVERT_EXPR nodes is likely important
- And optimization of BITFIELD_INSERT_EXPR
Rearrangement of SLP nodes with multiple occurrences in the in the same statement to avoid duplicates with a vec_perm to restore the original ordering may have as much as a 10% benefit for vectorized x264.Additional information from GCC's bug databaseProposed patch, probably won't go in as-is, but can be used for experimentation
- GCC does not make good use of widening vector ops that overlap source/destination registers. Expectation is this is another 1-2% improvement
- GCC does not hoist vxrm assignments aggressively, which can significantly impact performance if the uarch does not provide fast vxrm access. This is about 2% on the BPI
Stakeholders/Partners
RISE:
Ventana: Robin Dapp
Ventana: Jeff Law – currently looking at vxrm hoisting
External:
VRULL: Manolis Tsamis
...
Page Properties | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Updates
- VRULL's patch has been upstreamed and we're seeing desired vectorization for the other loop in the SATD routines
- Note overlapping with widening ops and problems with vxrm hoisting improvement opportunities
- Project reported as a priority for 2H2024, broken out from original effort
...