x264 should see roughly a 2X performance improvement from autovectorization based on data from other architectures. Based on extrapolation of data from the k230 board it is expected that a uarch which can do 128bit vector ALU ops in a single cycle will see a runtime reduction of 50% for this benchmark. That translates into a 2X improvement in the spec2017 score for 525.x264_r.
Given that's precisely the goal we were shooting for, this is considered done.
Stakeholders/Partners
RISE:
Ventana: Robin Dapp
Ventana: Jeff Law
External:
Dependencies
Status
Development
Development Timeline
1H2024
Upstreaming
Upstream Version
gcc-14
Spring 2024
Contacts
Jeff Law (Ventana)
Dependencies
Updates
Testing on the k230 board shows "only" a 17% runtime improvement when the target for x264 vectorization is a 50% runtime improvement (which will double the spec score)
However, it looks like the cost of a vector ALU op is at least 3X LMUL
So a performant uarch where vector ALU ops of reasonable size (128 bits) are 1c would see the expected 50% runtime improvement.
Considering this resolved.
Dynamic instruction rates cut by 47%, so in the right ballpark for a 2X performance improvement
x86 shows a roughly 88% improvement (ie, runtime nearly cut in half)
aarch64 shows roughly a 104% improvement (ie run time cut by more than 50%)
47% reduction in dynamic cycle counts is in the right ballpark