About

The cam4 benchmark in spec2017 is reasonably friendly for vectorization, performance gains relative to a single FPU scalar implementation should be on the order of 17%.   On the BPI-F3, vectorization is dropping the instruction counts by 25.76%, but the cycle counts (as measured by perf) are up just over 7%.  There is little data on the uarch in that k1 processor, but based on my (Jeff)'s observations it's a pretty weak vector architecture, roughly on par with the c908 in the k230 board (which didn't have enough memory to reliably run most of the FP benchmarks).


While these numbers are disappointing from a cycle count standpoint, that is more likely an artifact of the design of the k1 processor rather than a weakness in the vector code generation in the compiler.  I (Jeff) strongly suspect that if we had thorough documentation on the k1 uarch and exposed the various aspects to the vectorizer's cost model that we would find that most vectorization opportunities would considered unprofitable and dropped.

Stakeholders/Partners

RISE:

Ventana: Robin Dapp – lead developer

Rivos:


External:

                     Rivai: Juzhe




Dependencies


Status


Development


Development Timeline1H2024
Upstreaming


Upstream Version

gcc-14

Spring 2024



Contacts

Robin Dapp (Ventana)

Jeff Law (Ventana)


Dependencies

Closure needs

Performance testing




Updates