CT_00_016 -- Vectorize wrf benchmark from spec2017

About

The WRF benchmark in spec2017 is reasonably friendly for vectorization, performance gains relative to a single FPU scalar implementation should be on the order of 40%.   Verify the benchmark vectorizes and sees a comparable performance improvement on RISC-V.




Stakeholders/Partners

RISE:

Ventana: Robin Dapp – lead developer

Ventana: Jeff Law


External:

                     Rivai: Juzhe



Dependencies


Status

Development

COMPLETE


Development Timeline1H2024
Upstreaming

COMPLETED


Upstream Version

gcc-14

Spring 2024




Contacts

Robin Dapp (Ventana)

Jeff Law (Ventana)


Dependencies

Closure needs

performance testing



Updates

 

  • This benchmark does a unaligned vector access (less than element alignment) which faults on the k1.  Work is underway to be less aggressive with allowing unaligned vector memory accesses and after that work lands we will retest wrf to compare with and without vector.  Expectations on the k1 are that we will likely see a significant reduction in dynamic instructions, but that performance (as measured by cycles) may well show a regression due to the design of the k1 vector unit.

 

  • Currently seeing a 46% reduction in dynamic instructions
    • Actual improvement from vectorization seen on x86_64 – 37%
    • Actual improvement from vectorization seen on aarch64 – 37%
    • Again, we're counting dynamic instructions on RISC-V and actual improvement on the competitive architectures
    • Need to have a dynamic instruction count improvements at or better than the real improvement seen on the competitive architectures
    • Conclusion: Hitting the mark for this phase.  Next step is to verify performance on real hardware

 

  • Project reported as a priority for 1H2024