About
Of the FP benchmarks within spec2017, lbm has the highest potential for vectorization. On other architectures improvements of greater than 2X can be seen when enabling autovectorization. The key routine to vectorize is " LBM_performStreamCollideTRT" and I don't think it's being vectorized at all at this timeThe WRF benchmark in spec2017 is reasonably friendly for vectorization, performance gains relative to a single FPU scalar implementation should be on the order of 40%. Verify the benchmark vectorizes and sees a comparable performance improvement on RISC-V.
Stakeholders/Partners
RISE:
Ventana: Robin Dapp – lead developer
Ventana: Jeff Law
External:
Rivai: Juzhe
Dependencies
Status
Page Properties | |||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
...
|
Updates
- This benchmark does a unaligned vector access (less than element alignment) which faults on the k1. Work is underway to be less aggressive with allowing unaligned vector memory accesses and after that work lands we will retest wrf to compare with and without vector. Expectations on the k1 are that we will likely see a significant reduction in dynamic instructions, but that performance (as measured by cycles) may well show a regression due to the design of the k1 vector unit.
- Currently seeing a 46% reduction in dynamic instructions
- Actual improvement from vectorization seen on x86_64 – 37%
- Actual improvement from vectorization seen on aarch64 – 37%
- Again, we're counting dynamic instructions on RISC-V and actual improvement on the competitive architectures
- Need to have a dynamic instruction count improvements at or better than the real improvement seen on the competitive architectures
- Conclusion: Hitting the mark for this phase. Next step is to verify performance on real hardware
- Project reported as a priority for 1H2024
...