About

The cam4 benchmark in spec2017 makes heavy use of complex double precision division which is implemented within the libgcc library. Complex division can be incredibly expensive due to the long latency, non-pipelineable, division operations and various special cases to deal with boundary conditions.

By using "-fcx-limited-range" when compiling the benchmarks, the compiler can open code the complex division and ignore many of the corner cases, significantly improving performance. This is considered safe for the spec2017 suite and just needs to be tested and verified.

My recollection is this only affected the speed, not the rate runs of cam4, but this should be verified.

Note that RISC-V does not have a reciprocal estimator, so we can't turn the divisions into reciprocal multiplications, but even so this should significantly improve performanceis reasonably friendly for vectorization, performance gains relative to a single FPU scalar implementation should be on the order of 17%. On the BPI-F3, vectorization is dropping the instruction counts by 25.76%, but the cycle counts (as measured by perf) are up just over 7%. There is little data on the uarch in that k1 processor, but based on my (Jeff)'s observations it's a pretty weak vector architecture, roughly on par with the c908 in the k230 board (which didn't have enough memory to reliably run most of the FP benchmarks).

While these numbers are disappointing from a cycle count standpoint, that is more likely an artifact of the design of the k1 processor rather than a weakness in the vector code generation in the compiler. I (Jeff) strongly suspect that if we had thorough documentation on the k1 uarch and exposed the various aspects to the vectorizer's cost model that we would find that most vectorization opportunities would considered unprofitable and dropped.

Stakeholders/Partners

RISE:

Ventana: Jeff Law

...

Robin Dapp – lead developer

Rivos:

External:

Rivai: Juzhe

Dependencies

Status

Page Properties

Development

Status


colour	RedGreen
title	NOT STARTEDCOMPLETED

Development Timeline

NA1H2024

Upstreaming

Status


colour	RedGreen
title	NOT STARTEDCOMPLETED

Upstream Version

gcc-14

Spring 2024

Contacts

Robin Dapp (Ventana)

Jeff Law (Ventana)

Dependencies

Closure needs

None

...

Performance testing

Updates

28 May 2024

Added note on performance on the k1 design (BPI-F3 board).

14 Mar 2024

We are currently seeing an 18% reduction in dynamic instruction counts for GCC using vector operations which is roughly in line with expectations.
- x86 gets an approximate performance improvement of 12% from vectorization
- aarch64 gets an approximate performance improvement of 6% vectorization
- The 18% reduction for risc-v doesn't necessarily mean a 18% performance improvement, but in general we should be seeing instruction count improvements at or larger than the performance improvements seen on the competitive architectures
- Conclusion: We're in the ballpark. Next steps are to confirm on real vector hardware, keeping in mind that uarch issues may come into play

29 Dec 2023

Project reported as a priority for 1H2024

...

Versions Compared

Old Version 3

New Version Current

Key

About

Stakeholders/Partners

RISE:

External:

Dependencies

Status

Updates

Page Comparison

Versions Compared

Old Version 3

New Version Current

Key

About

Stakeholders/Partners

RISE:

External:

Dependencies

Status

Updates