/
CT_00_035 -- Improve x264 vectorization
CT_00_035 -- Improve x264 vectorization
About
x264 is a critical benchmark for vectorization in the spec suite showing a roughly 2X improvement across many archtiectures once vector is enabled. This work item is mean to track further improvements that may be possible in the benchmark through compiler improvements.
- There is an arithmetic right shift followed by a masking operation in quant_4x4 tUnhat can be simplified into a logical right shift eliminating a small amount of code on a critical path.
- Vector setup and element extraction is likely sub-optimal in the SATD routine, particularly for zvl512b
- New vector permute cases need to be pushed upstream
- Manolis's patch to optimize permutes feeding permutes actually generates worse code for RV, will be a problem soon
- Improve code for permute using merge+masking
- Rearrangement of SLP nodes with multiple occurrences in the in the same statement to avoid duplicates with a vec_perm to restore the original ordering may have as much as a 10% benefit for vectorized x264.
- Additional information from GCC's bug database
- Integrated upstream
- GCC does not make good use of widening vector ops that overlap source/destination registers. Expectation is this is another 1-2% improvement
- Strongly recommend a good uarch description for the scheduler. We (Ventana) are seeing 30%+ improvements in various pico benchmarks derived from x264 by good scheduling
- GCC is emitting an unnecessary load of 0 into a GPR before emitting vmv.s.x to initialize the accumulator in reduction ops. We should use x0 instead.
- Small constants can be splatted across a vector without needing to load the constant into a GPR first using vmv.v.i.
- GCC should eliminate vsetvl instructions by better grouping vector instructions using the same vector configuration
- Must be done very conservatively so as not to otherwise perturb the schedule causing data or functional hazards
- Thinking is to do this by tracking last vector configuration in the scheduler then reordering within the highest priority ready insns to prefer a vector instruction with the same vector configuration. Seems to save about 1% on x264 from an icount standpoint.
- Prototype for this has been posted. Improves dynamic instruction counts, but actually performs worse at least on Ventana's design
- Likely points to a costing problem since it should only be rearranging instructions with the same predicted cost
- GCC does not hoist vxrm assignments aggressively, which can significantly impact performance if the uarch does not provide fast vxrm access. This is about 2% on the BPI
- Internal implementation done, needs uptsreaming
- SAD optimization
- Expose to vectorizer that it can safely do element, but not vector aligned loads/stores safely and performantly (-mno-vector-strict-align)
- Increase minimum vector length. Default ZVL is 128. Increasing to 256 or 512 helps
- Combination of (a) and (b) result in doing 16 lane operations in an unrolled loop with a single vredsum
- Currently exploring if doing a strided load can increase to 32 lanes per vector op and if doubling the thruput of the vector code offsets the cost of the more complex vector load
- Investigating if a SAD instruction (similar to x86 and aarch64) would help
- First pass scheduling is not getting the vector loads out of the way fast enough. This is expected to be a small change to the mapping of insns to specific types for the scheduler
- SATD
- Manolis's code allows vectorization of the first loop and seems to be performing reasonably well.
- Manolis's permutation optimization was a step in the wrong direction, ultimately generating an expensive permute rather than one that was relatively inexpensive
- Unclear if Manolis's followup pass to further optimize active lanes in an SLP group will help reduce cost of 10-b.
- Scheduling will be important for this routines as well.
- Dual ALUs in the uarch is seen as particularly important
- It is possible to derive permutation constants from each other with vadd.vi instructions.
- sub4x4_dct
- Store-forward-bypass issues likely to trigger here with simple vector store, feeding a wider segmented load
- Can avoid using more permutes, which have their own concerns
- Segmented loads/stores may not be performant on all uarchs
- Improving various permutation patterns may help some here
- Unclear if vectorization will help here, especially if scalar uarch is high performing
- Store-forward-bypass issues likely to trigger here with simple vector store, feeding a wider segmented load
- quant4x4
- Benefits greatly from zicond and if-conversion
- Unclear if vectorization will help here, especially if scalar uarch is high performing
Stakeholders/Partners
RISE:
Ventana: Robin Dapp – Cost model, permutation improvements, etc. Overall lead
Ventana: Raphael Zinsly – currently looking at wide SADs
Ventana: Jeff Law – everything scheduling related
External:
VRULL: Manolis Tsamis
Dependencies
Status
Updates
- 2024 Efforts are done, relevant 2025 efforts are in a new page.
- Code to adjust mapping from instruction to scheduling type has been posted upstream and is expected to integrate shortly.
- A patch to derive one permutation constant from another (SATD related) has been posted upstream.
- Various permutation adjustments have been posted upstream.
- Code to aggressively hoist VXRM assignments has been integrated
- Manolis's code for SLP grouping seems to be behaving much better now
- A vector SAD instruction can dramatically help the most important loops in x264 from spec
- Scheduling is problematical, but looks to be solvable.
- Several bugfixes/enhancements from Ventana should be landing upstream soon
- Code to improve loading 0 into the 0th element of a vector has been integrated
- Code to improve splatting small constants -16..15 across a vector has been integrated
- Note how we could eliminate more vsetvl instructions
- Note various performance issues found and paths of investigation.
- VRULL's patch has been upstreamed and we're seeing desired vectorization for the other loop in the SATD routines
- Note overlapping with widening ops and problems with vxrm hoisting improvement opportunities
- Project reported as a priority for 2H2024, broken out from original effort
, multiple selections available,
Related content
CT_00_028 -- Investigate and improve Scalar code generation for cactuBSSN
CT_00_028 -- Investigate and improve Scalar code generation for cactuBSSN
Read with this
CT_00_018 -- Evaluate and potentially improve x264 vectorization
CT_00_018 -- Evaluate and potentially improve x264 vectorization
More like this
CT_00_037 -- Zicond with if-conversion improvements (GCC)
CT_00_037 -- Zicond with if-conversion improvements (GCC)
Read with this
RISCV64 new vector instructions requirements for video / multimedia
RISCV64 new vector instructions requirements for video / multimedia
More like this
CT_00_033 -- New instruction fusions
CT_00_033 -- New instruction fusions
Read with this
CT_01_001 - Autovectorization -- Basic Functionality (LLVM)
CT_01_001 - Autovectorization -- Basic Functionality (LLVM)
More like this