About

x264 is a critical benchmark for vectorization in the spec suite showing a roughly 2X improvement across many archtiectures once vector is enabled. This work item is mean to track further improvements that may be possible in the benchmark through compiler improvements.

SAD optimization
1. Shorter SADs (sad_x3_8x8) can benefit from strided loads
SATD
1. Use strided loads to avoid permutations in the first SATD loop
2. Revisit profitability from deriving permutation constants from each other using vadd.vi may not be needed anymore
3. Use wider vectors in the 2nd loop. Smart unrolling seems to be the key here
vaaddu
1. Designs which flush pipeline on VXRM assignment may be better off using (a + b + 1) >> 1
2. Implies expander should probably be conditional on a suitable uarch flag
3. Designs with good vxrm behavior could probably be using vaaddu more
Revisit conservative vsetvl elimination
1. Jeff's patch is a reasonable start
2. Needs to be re-benchmarked
3. Rather than swapping elements, shift them in the array to perturb the schedule less
4. May want some degree of freedom, particularly if uarch doesn't handle vsetvl efficiently.

Stakeholders/Partners

RISE:

Ventana: Robin Dapp – Cost model, permutation improvements, etc. Overall lead

Ventana: Jeff Law – everything scheduling related

External:

Dependencies

Status

Development	IN PROGRESS
Development Timeline	1H2025
Upstreaming	IN PROGRESS
Upstream Version	gcc-16 Spring 2026
Contacts	Jeff Law (Ventana)
Dependencies

Updates

03 Feb 2025

Project reported as a priority for 1H2025, broken out from original effort

CT_00_050 -- Improve x264 vectorization