...
There is an arithmetic right shift followed by a masking operation in quant_4x4 that can be simplified into a logical right shift eliminating a small amount of code on a critical path.- Vector setup and element extraction is likely sub-optimal in the SATD routine, particularly for zvl512b
- Much of this can be fixed by adjusting how we expand the vector setup code
- Robin owns thisNew vector permute cases need to be pushed upstream
- Manolis's patch to optimize permutes feeding permutes actually generates worse code for RV, will be a problem soon
- Improve code for permute using merge+masking
Rearrangement of SLP nodes with multiple occurrences in the in the same statement to avoid duplicates with a vec_perm to restore the original ordering may have as much as a 10% benefit for vectorized x264.Additional information from GCC's bug databaseProposed patch, probably won't go in as-is, but can be used for experimentationIntegrated upstream
- GCC does not make good use of widening vector ops that overlap source/destination registers. Expectation is this is another 1-2% improvement
- GCC does not hoist vxrm assignments aggressively, which can significantly impact performance if the uarch does not provide fast vxrm access. This is about 2% on the BPI
- Internal implementation done, needs uptsreaming
- SAD optimization
- Expose to vectorizer that it can safely do element, but not vector aligned loads/stores safely and performantly (-mno-vector-strict-align)
- Increase minimum vector length. Default ZVL is 128. Increasing to 256 or 512 helps
- Combination of (a) and (b) result in doing 16 lane operations in an unrolled loop with a single vredsum
- Currently exploring if doing a strided load can increase to 32 lanes per vector op and if doubling the thruput of the vector code offsets the cost of the more complex vector load
- Investigating if a SAD instruction (similar to x86 and aarch64) would help
- sub4x4_dct
- Store-forward-bypass issues likely to trigger here with simple vector store, feeding a wider segmented load
- uarch behavior in that case may be critical
- Can avoid using more permutes, which have their own concerns
- Segmented loads/stores may not be performant on all uarchs
- Unclear if vectorizatoin will help here, especially if scalar uarch is high performing
- quant4x4
- Benefits greatly from zicond and if-conversion
- Unclear if vectorization will help here, especially if scalar uarch is high performing
Stakeholders/Partners
RISE:
...
Page Properties | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Updates
- Note various performance issues found and paths of investigation.
- VRULL's patch has been upstreamed and we're seeing desired vectorization for the other loop in the SATD routines
- Note overlapping with widening ops and problems with vxrm hoisting improvement opportunities
...