...
There is an arithmetic right shift followed by a masking operation in quant_4x4 that can be simplified into a logical right shift eliminating a small amount of code on a critical path.- Vector setup and element extraction is likely sub-optimal in the SATD routine, particularly for zvl512b
- New vector permute cases need to be pushed upstream
- Manolis's patch to optimize permutes feeding permutes actually generates worse code for RV, will be a problem soon
- Improve code for permute using merge+masking
Rearrangement of SLP nodes with multiple occurrences in the in the same statement to avoid duplicates with a vec_perm to restore the original ordering may have as much as a 10% benefit for vectorized x264.Additional information from GCC's bug databaseIntegrated upstream
- GCC does not make good use of widening vector ops that overlap source/destination registers. Expectation is this is another 1-2% improvement
GCC is emitting an unnecessary load of 0 into a GPR before emitting vmv.s.x to initialize the accumulator in reduction ops. We should use x0 instead.Small constants can be splatted across a vector without needing to load the constant into a GPR first using vmv.v.i.- GCC should eliminate vsetvl instructions by better grouping vector instructions using the same vector configuration
- Must be done very conservatively so as not to otherwise perturb the schedule causing data or functional hazards
- Thinking is to do this by tracking last vector configuration in the scheduler then reordering within the highest priority ready insns to prefer a vector instruction with the same vector configuration. Seems to save about 1% on x264 from an icount standpoint.
- GCC does not hoist vxrm assignments aggressively, which can significantly impact performance if the uarch does not provide fast vxrm access. This is about 2% on the BPI
- Internal implementation done, needs uptsreaming
- SAD optimization
- Expose to vectorizer that it can safely do element, but not vector aligned loads/stores safely and performantly (-mno-vector-strict-align)
- Increase minimum vector length. Default ZVL is 128. Increasing to 256 or 512 helps
- Combination of (a) and (b) result in doing 16 lane operations in an unrolled loop with a single vredsum
- Currently exploring if doing a strided load can increase to 32 lanes per vector op and if doubling the thruput of the vector code offsets the cost of the more complex vector load
- Investigating if a SAD instruction (similar to x86 and aarch64) would help
- sub4x4_dct
- Store-forward-bypass issues likely to trigger here with simple vector store, feeding a wider segmented load
- uarch behavior in that case may be critical
- Can avoid using more permutes, which have their own concerns
- Segmented loads/stores may not be performant on all uarchs
- Unclear if vectorizatoin will help here, especially if scalar uarch is high performing
- quant4x4
- Benefits greatly from zicond and if-conversion
- Unclear if vectorization will help here, especially if scalar uarch is high performing
...
Page Properties | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Updates
- Code to improve loading 0 into the 0th element of a vector has been integrated
- Code to improve splatting small constants -16..15 across a vector has been integrated
- Note how we could eliminate more vsetvl instructions
- Note various performance issues found and paths of investigation.
...