/
CT_00_050 -- Improve x264 vectorization

CT_00_050 -- Improve x264 vectorization

About

x264 is a critical benchmark for vectorization in the spec suite showing a roughly 2X improvement across many archtiectures once vector is enabled.  This work item is mean to track further improvements that may be possible in the benchmark through compiler improvements.


  1. SAD optimization
    1. Shorter SADs (sad_x3_8x8) can benefit from strided loads
  2. SATD
    1. Use strided loads to avoid permutations in the first SATD loop
    2. Revisit profitability from deriving permutation constants from each other using vadd.vi may not be needed anymore
    3. Use wider vectors in the 2nd loop.  Smart unrolling seems to be the key here
  3. vaaddu
    1. Designs which flush pipeline on VXRM assignment may be better off using (a + b + 1) >> 1
    2. Implies expander should probably be conditional on a suitable uarch flag
    3. Designs with good vxrm behavior could probably be using vaaddu more
  4. Revisit conservative vsetvl elimination
    1. Jeff's patch is a reasonable start
    2. Needs to be re-benchmarked
    3. Rather than swapping elements, shift them in the array to perturb the schedule less
    4. May want some degree of freedom, particularly if uarch doesn't handle vsetvl efficiently.


Stakeholders/Partners

RISE:

Ventana: Robin Dapp – Cost model, permutation improvements, etc.  Overall lead

Ventana: Jeff Law – everything scheduling related


External:




Dependencies


Status

Development

IN PROGRESS


Development Timeline1H2025
Upstreaming

IN PROGRESS


Upstream Version

gcc-16

Spring 2026




Contacts

Jeff Law (Ventana)


Dependencies




Updates


   

  • Project reported as a priority for 1H2025, broken out from original effort


Related content