...
There is an arithmetic right shift followed by a masking operation in quant_4x4 that can be simplified into a logical right shift eliminating a small amount of code on a critical path.- Vector setup and element extraction is likely sub-optimal in the SAD/SATD routines.
- Removal of VIEW_CONVERT_EXPR nodes is likely important And optimization of BITFIELD_INSERT_EXPRSATD routine, particularly for zvl512b
- Much of this can be fixed by adjusting how we expand the vector setup code
- Robin owns this
Rearrangement of SLP nodes with multiple occurrences in the in the same statement to avoid duplicates with a vec_perm to restore the original ordering may have as much as a 10% benefit for vectorized x264.Additional information from GCC's bug databaseProposed patch, probably won't go in as-is, but can be used for experimentation
- GCC does not make good use of widening vector ops that overlap source/destination registers. Expectation is this is another 1-2% improvement
- GCC does not hoist vxrm assignments aggressively, which can significantly impact performance if the uarch does not provide fast vxrm access. This is about 2% on the BPI
...