Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

About

x264 should see roughly a 2X performance improvement from autovectorization based on data from other architectures.   We need to verify we see similar improvements on RISC-V and if not address any shortcomings in the code generation.


The SAD routines are somewhat notorious for having low trip counts on their loops.  As a result poor vector setup can significantly reduce the benefits from autovectorization.  Using masked loads and/or strided loads can help widen the vectorization factor. and improve performance.  Improvements to tree-ssa-forwprop.cc can eliminate the various VIEW_CONVERT_EXPR statements, collapse permutations, simplify bit insertion/extraction, etc.  The goal being to hand off nearly optimal code to the RTL phase of the compiler. 


It is believed that some work on finding a way to encourage unrolling an outer loop to enable wider vectorization of an inner loop would help the SATD routines.  Neither GCC nor LLVM do a good job at this.


The SATD routines may have a loop which is not currently vectorized.  We need to perform variable expansion before vectorization to have any chance of vectorizing the first part of the SATD routines.


get_ref, sub_dct and other routines do provide some vector opportunities as well and need to be investigated.


Note there are scalar improvements for LLVM tracked in a distinct project.




Stakeholders/Partners

RISE:

Ventana: Robin Dapp

Ventana: Jeff Law


External:



Dependencies


Status

Development

COMPELTE


Development Timeline1H2024
Upstreaming

COMPLETE


Upstream Version

gcc-14

Spring 2024




Contacts

Jeff Law (Ventana)


Dependencies




Updates

 

  • Testing on the k230 board shows "only" a 17% runtime improvement when the target for x264 vectorization is a 50% runtime improvement (which will double the spec score)
  • However, it looks like the cost of a vector ALU op is at least 3X LMUL
  • So a performant uarch where vector ALU ops of reasonable size (128 bits) are 1c would see the expected 50% runtime improvement.
  • Considering this resolved.

 

  • Dynamic instruction rates cut by 47%, so in the right ballpark for a 2X performance improvement
    • x86 shows a roughly 88% improvement (ie, runtime nearly cut in half)
    • aarch64 shows roughly a 104% improvement (ie run time cut by more than 50%)
    • 47% reduction in dynamic cycle counts is in the right ballpark
    • Need to do performance testing to reach closure

 

  • Project reported as a priority for 1H2024


  • No labels