~~There is an arithmetic right shift followed by a masking operation in quant_4x4 that can be simplified into a logical right shift eliminating a small amount of code on a critical path.~~
Vector setup and element extraction is likely sub-optimal in the SAD/SATD routines.
1. Removal of VIEW_CONVERT_EXPR nodes is likely important
2. And optimization of BITFIELD_INSERT_EXPR
Rearrangement of SLP nodes with multiple occurrences in the in the same statement to avoid duplicates with a vec_perm to restore the original ordering may have as much as a 10% benefit for vectorized x264.
1. ~~Additional information from GCC's bug database~~
2. ~~Proposed patch, probably won't go in as-is, but can be used for experimentation~~
GCC does not make good use of widening vector ops that overlap source/destination registers. Expectation is this is another 1-2% improvement
GCC does not hoist vxrm assignments aggressively, which can significantly impact performance if the uarch does not provide fast vxrm access. This is about 2% on the BPI

Stakeholders/Partners

RISE:

Ventana: Robin Dapp

Ventana: Jeff Law – currently looking at vxrm hoisting

VRULL: Manolis Tsamis

...

Page Properties

Development

Status

Development Timeline

2H2024

Upstreaming

Status

Upstream Version

gcc-15

Spring 2025

Contacts

Jeff Law (Ventana)

Dependencies

14 Aug 2024

VRULL's patch has been upstreamed and we're seeing desired vectorization for the other loop in the SATD routines
Note overlapping with widening ops and problems with vxrm hoisting improvement opportunities

05 Jun 2024

...