x264 should see roughly a 2X performance improvement from autovectorization based on data from other architectures. We need to verify we see similar improvements on RISC-V and if not address any shortcomings in the code generation.

The SAD routines are somewhat notorious for having low trip counts on their loops. As a result poor vector setup can significantly reduce the benefits from autovectorization. Using masked loads and/or strided loads can help widen the vectorization factor. and improve performance. Improvements to tree-ssa-forwprop.cc can eliminate the various VIEW_CONVERT_EXPR statements, collapse permutations, simplify bit insertion/extraction, etc. The goal being to hand off nearly optimal code to the RTL phase of the compiler.

The SATD routines may have a loop which is not currently vectorized. We need to perform variable expansion before vectorization to have any chance of vectorizing the first part of the SATD routines.

get_ref, sub_dct and other routines do provide some vector opportunities as well and need to be investigated.

Stakeholders/Partners

...

Based on extrapolation of data from the k230 board it is expected that a uarch which can do 128bit vector ALU ops in a single cycle will see a runtime reduction of 50% for this benchmark. That translates into a 2X improvement in the spec2017 score for 525.x264_r.

Given that's precisely the goal we were shooting for, this is considered done.

Note that Robin is investigating improving the generated code for the SATD routines. Esesntially we're doing a lot of byte loads when we should be loading larger values. This may provide a another small improvement on top of the basic vectorization.

Stakeholders/Partners

RISE:

Ventana: Robin Dapp

Ventana: Jeff Law

External:

...

Page Properties

Development

Status


colour	RedGreen
title	NOT STARTEDCOMPELTE

Development Timeline

NA1H2024

Upstreaming

Status


colour	RedGreen
title	NOT STARTEDCOMPLETE

Upstream Version

gcc-14

Spring 2024

Contacts

Jeff Law (Ventana)

Dependencies

None

UpdatesUpdates

09 May 2024

Note additional opportunities for improvement.

25 Apr 2024

Testing on the k230 board shows "only" a 17% runtime improvement when the target for x264 vectorization is a 50% runtime improvement (which will double the spec score)
However, it looks like the cost of a vector ALU op is at least 3X LMUL
So a performant uarch where vector ALU ops of reasonable size (128 bits) are 1c would see the expected 50% runtime improvement.
Considering this resolved.

14 Mar 2024

Dynamic instruction rates cut by 47%, so in the right ballpark for a 2X performance improvement
- x86 shows a roughly 88% improvement (ie, runtime nearly cut in half)
- aarch64 shows roughly a 104% improvement (ie run time cut by more than 50%)
- 47% reduction in dynamic cycle counts is in the right ballpark
- Need to do performance testing to reach closure

29 Dec 2023

Project reported as a priority for 1H2024

...

Versions Compared

Old Version 1

New Version Current

Key

Stakeholders/Partners

Stakeholders/Partners

RISE:

External:

UpdatesUpdates

Page Comparison

Versions Compared

Old Version 1

New Version Current

Key

Stakeholders/Partners

Stakeholders/Partners

RISE:

External:

UpdatesUpdates