CT_00_035 -- Improve x264 vectorization

About

x264 is a critical benchmark for vectorization in the spec suite showing a roughly 2X improvement across many archtiectures once vector is enabled.  This work item is mean to track further improvements that may be possible in the benchmark through compiler improvements.


  1. There is an arithmetic right shift followed by a masking operation in quant_4x4 tUnhat can be simplified into a logical right shift eliminating a small amount of code on a critical path.
  2. Vector setup and element extraction is likely sub-optimal in the SATD routine, particularly for zvl512b
    1. New vector permute cases need to be pushed upstream
    2. Manolis's patch to optimize permutes feeding permutes actually generates worse code for RV, will be a problem soon
    3. Improve code for permute using merge+masking
  3. Rearrangement of SLP nodes with multiple occurrences in the in the same statement to avoid duplicates with a vec_perm to restore the original ordering may have as much as a 10% benefit for vectorized x264.
    1. Additional information from GCC's bug database
    2. Integrated upstream
  4. GCC does not make good use of widening vector ops that overlap source/destination registers.  Expectation is this is another 1-2% improvement
  5. Strongly recommend a good uarch description for the scheduler.  We (Ventana) are seeing 30%+ improvements in various pico benchmarks derived from x264 by good scheduling
  6. GCC is emitting an unnecessary load of 0 into a GPR before emitting vmv.s.x to initialize the accumulator in reduction ops.  We should use x0 instead.
  7. Small constants can be splatted across a vector without needing to load the constant into a GPR first using vmv.v.i.
  8. GCC should eliminate vsetvl instructions by better grouping vector instructions using the same vector configuration
    1. Must be done very conservatively so as not to otherwise perturb the schedule causing data or functional hazards
    2. Thinking is to do this by tracking last vector configuration in the scheduler then reordering within the highest priority ready insns to prefer a vector instruction with the same vector configuration.  Seems to save about 1% on x264 from an icount standpoint.
    3. Prototype for this has been posted.  Improves dynamic instruction counts, but actually performs worse at least on Ventana's design
      1. Likely points to a costing problem since it should only be rearranging instructions with the same predicted cost
  9. GCC does not hoist vxrm assignments aggressively, which can significantly impact performance if the uarch does not provide fast vxrm access.   This is about 2% on the BPI
    1. Internal implementation done, needs uptsreaming
  10. SAD optimization
    1. Expose to vectorizer that it can safely do element, but not vector aligned loads/stores safely and performantly (-mno-vector-strict-align)
    2. Increase minimum vector length.  Default ZVL is 128.  Increasing to 256 or 512 helps
    3. Combination of (a) and (b) result in doing 16 lane operations in an unrolled loop with a single vredsum
    4. Currently exploring if doing a strided load can increase to 32 lanes per vector op and if doubling the thruput of the vector code offsets the cost of the more complex vector load
    5. Investigating if a SAD instruction (similar to x86 and aarch64) would help
    6. First pass scheduling is not getting the vector loads out of the way fast enough.  This is expected to be a small change to the mapping of insns to specific types for the scheduler
  11. SATD
    1. Manolis's code allows vectorization of the first loop and seems to be performing reasonably well.
    2. Manolis's permutation optimization was a step in the wrong direction, ultimately generating an expensive permute rather than one that was relatively inexpensive
    3. Unclear if Manolis's followup pass to further optimize active lanes in an SLP group will help reduce cost of 10-b.
    4. Scheduling will be important for this routines as well.
    5. Dual ALUs in the uarch is seen as particularly important
    6. It is possible to derive permutation constants from each other with vadd.vi instructions. 
  12. sub4x4_dct
    1. Store-forward-bypass issues likely to trigger here with simple vector store, feeding a wider segmented load
      1. Can avoid using more permutes, which have their own concerns
      2. Segmented loads/stores may not be performant on all uarchs
    2. Improving various permutation patterns may help some here
    3. Unclear if vectorization will help here, especially if scalar uarch is high performing
  13. quant4x4
    1. Benefits greatly from zicond and if-conversion
    2. Unclear if vectorization will help here, especially if scalar uarch is high performing


Stakeholders/Partners

RISE:

Ventana: Robin Dapp – Cost model, permutation improvements, etc.  Overall lead

Ventana: Raphael Zinsly – currently looking at wide SADs

Ventana: Jeff Law – everything scheduling related


External:

                     VRULL:  Manolis Tsamis



Dependencies


Status

Development

IN PROGRESS


Development Timeline2H2024
Upstreaming

IN PROGRESS


Upstream Version

gcc-15

Spring 2025




Contacts

Jeff Law (Ventana)


Dependencies




Updates

 

  • Code to adjust mapping from instruction to scheduling type has been posted upstream and is expected to integrate shortly.
  • A patch to derive one permutation constant from another (SATD related) has been posted upstream.
  • Various permutation adjustments have been posted upstream.

 

  • Code to aggressively hoist VXRM assignments has been integrated
  • Manolis's code for SLP grouping seems to be behaving much better now
  • A vector SAD instruction can dramatically help the most important loops in x264 from spec
  • Scheduling is problematical, but looks to be solvable.
  • Several bugfixes/enhancements from Ventana should be landing upstream soon

 

  • Code to improve loading 0 into the 0th element of a vector has been integrated
  • Code to improve splatting small constants -16..15 across a vector has been integrated
  • Note how we could eliminate more vsetvl instructions

 

  • Note various performance issues found and paths of investigation.

  • VRULL's patch has been upstreamed and we're seeing desired vectorization for the other loop in the SATD routines
  • Note overlapping with widening ops and problems with vxrm hoisting improvement opportunities

 

  • Project reported as a priority for 2H2024, broken out from original effort