...

x264 should see roughly a 2X performance improvement from autovectorization based on data from other architectures. We need to verify we see similar improvements on RISC-V and if not address any shortcomings in the code generation.

The SAD routines are somewhat notorious for having low trip counts on their loops. As a result poor vector setup can significantly reduce the benefits from autovectorization. Using masked loads and/or strided loads can help widen the vectorization factor. and improve performance. Improvements to tree-ssa-forwprop.cc can eliminate the various VIEW_CONVERT_EXPR statements, collapse permutations, simplify bit insertion/extraction, etc. The goal being to hand off nearly optimal code to the RTL phase of the compiler.

It is believed that some work on finding a way to encourage unrolling an outer loop to enable wider vectorization of an inner loop would help the SATD routines. Neither GCC nor LLVM do a good job at this.

The SATD routines may have a loop which is not currently vectorized. We need to perform variable expansion before vectorization to have any chance of vectorizing the first part of the SATD routines.

get_ref, sub_dct and other routines do provide some vector opportunities as well and need to be investigated.

Note there are scalar improvements for LLVM tracked in a distinct project Based on extrapolation of data from the k230 board it is expected that a uarch which can do 128bit vector ALU ops in a single cycle will see a runtime reduction of 50% for this benchmark. That translates into a 2X improvement in the spec2017 score for 525.x264_r.

Given that's precisely the goal we were shooting for, this is considered done.

Note that Robin is investigating improving the generated code for the SATD routines. Esesntially we're doing a lot of byte loads when we should be loading larger values. This may provide a another small improvement on top of the basic vectorization.

Stakeholders/Partners

RISE:

...

Page Properties

Development

Status


colour	Green
title	COMPELTE

Development Timeline

1H2024

Upstreaming

Status


colour	Green
title	COMPLETE

Upstream Version

gcc-14

Spring 2024

Contacts

Jeff Law (Ventana)

Dependencies

Closure needs

performance testing

...

Updates

09 May 2024

Note additional opportunities for improvement.

25 Apr 2024

Testing on the k230 board shows "only" a 17% runtime improvement when the target for x264 vectorization is a 50% runtime improvement (which will double the spec score)
However, it looks like the cost of a vector ALU op is at least 3X LMUL
So a performant uarch where vector ALU ops of reasonable size (128 bits) are 1c would see the expected 50% runtime improvement.
Considering this resolved.

14 Mar 2024

Dynamic instruction rates cut by 47%, so in the right ballpark for a 2X performance improvement
- x86 shows a roughly 88% improvement (ie, runtime nearly cut in half)
- aarch64 shows roughly a 104% improvement (ie run time cut by more than 50%)
- 47% reduction in dynamic cycle counts is in the right ballpark
- Need to do performance testing to reach closure

...

Versions Compared

Old Version 3

New Version Current

Key

Stakeholders/Partners

RISE:

Updates

Page Comparison

Versions Compared

Old Version 3

New Version Current

Key

Stakeholders/Partners

RISE:

Updates