Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. There is an arithmetic right shift followed by a masking operation in quant_4x4 that can be simplified into a logical right shift eliminating a small amount of code on a critical path.
  2. Vector setup and element extraction is likely sub-optimal in the SATD routine, particularly for zvl512b
    1. Much of this can be fixed by adjusting how we expand the vector setup code
    2. Robin owns thisNew vector permute cases need to be pushed upstream
    3. Manolis's patch to optimize permutes feeding permutes actually generates worse code for RV, will be a problem soon
    4. Improve code for permute using merge+masking
  3. Rearrangement of SLP nodes with multiple occurrences in the in the same statement to avoid duplicates with a vec_perm to restore the original ordering may have as much as a 10% benefit for vectorized x264.
    1. Additional information from GCC's bug database
    2. Proposed patch, probably won't go in as-is, but can be used for experimentationIntegrated upstream
  4. GCC does not make good use of widening vector ops that overlap source/destination registers.  Expectation is this is another 1-2% improvement
  5. GCC does not hoist vxrm assignments aggressively, which can significantly impact performance if the uarch does not provide fast vxrm access.   This is about 2% on the BPI
    1. Internal implementation done, needs uptsreaming
  6. SAD optimization
    1. Expose to vectorizer that it can safely do element, but not vector aligned loads/stores safely and performantly (-mno-vector-strict-align)
    2. Increase minimum vector length.  Default ZVL is 128.  Increasing to 256 or 512 helps
    3. Combination of (a) and (b) result in doing 16 lane operations in an unrolled loop with a single vredsum
    4. Currently exploring if doing a strided load can increase to 32 lanes per vector op and if doubling the thruput of the vector code offsets the cost of the more complex vector load
    5. Investigating if a SAD instruction (similar to x86 and aarch64) would help
  7. sub4x4_dct
    1. Store-forward-bypass issues likely to trigger here with simple vector store, feeding a wider segmented load
    2. uarch behavior in that case may be critical
    3. Can avoid using more permutes, which have their own concerns
    4. Segmented loads/stores may not be performant on all uarchs
    5. Unclear if vectorizatoin will help here, especially if scalar uarch is high performing
  8. quant4x4
    1. Benefits greatly from zicond and if-conversion
    2. Unclear if vectorization will help here, especially if scalar uarch is high performing


Stakeholders/Partners

RISE:

...

Page Properties


Development

Status
colourBlue
titleIN PROGRESS


Development Timeline2H2024
Upstreaming

Status
colourBlue
titleIN PROGRESS


Upstream Version

gcc-15

Spring 2025




Contacts

Jeff Law (Ventana)


Dependencies





Updates

 

  • Note various performance issues found and paths of investigation.

  • VRULL's patch has been upstreamed and we're seeing desired vectorization for the other loop in the SATD routines
  • Note overlapping with widening ops and problems with vxrm hoisting improvement opportunities

...