CT_00_028 -- Investigate and improve Scalar code generation for cactuBSSN

CT_00_028 -- Investigate and improve Scalar code generation for cactuBSSN

About

Per recent competitive analysis of SPEC 2017 RISC-V RVV vs. aarch64 SVE2, measured as QEMU dynamic icounts. the worst performing benchmark is Cactu (2,852,277,890,338 vs. 1,363,212,534,747)

Cactu has lot of array and stack accesses (register spills). Both involve base + offset computation which are generated poorly on RISC-V. This is in part due to ISA providing only simple addressing modes for load/stores and a very limited S12 constant encoding for ALU insns such as ADD. This causes the effective address to be 3-4 instruction sequence which on aarch64 can be done in one. e.g.

ldr x1, [x21, x1, lsl #3]
sh3add a4, a4, a3
add    a4, a4, a1
ld     a4, 280(a4)

li     a3,4096
addi   a3,a3,-2048 # 2048
addi   a5,sp,1664  # sp + 1664
add    a3,a3,a5    # sp + 3712
fsd    fs2,0(a3)


  • (a) For constants offsets (and constants in general), gcc follows the 4096 +/- addend idiom to materialize constants larger than S12.
    • This can be optimized for a certain class of constants which can be expressed as sum of S12 by not generating standalone constant and instead fusing the two S12 bits with the operations (also avoids clobbering a reg)
    • This would help both array and stack accesses, any base + offset computation.
  • (d) Cactu spills are coming from sched1 not reducing live range. Debug/fix PR/114729

Initial work on (a) seems promising as we are able to shave off around 300 million 290 billion instruction or 10% of Cactu. Note however this is still "damage control" as the heart of the matter is the extraneous reloads.


It's believed that roughly half of the instructions executed are directly or indirectly related to spilling.  If the first pass scheduler is turned off, then the dynamic instruction count drops from ~2.5T instructions to 1.2T instructions which could bring RISC-V roughly on-par with AArch64.  


It has been observed during testing that the new pattern to allow add with a larger range of constants inhibits shNadd instruction generation in some circumstances, including within a semi-hot loop in 557.xz.  This can be easily addressed with two additional patterns that Ventana has written and started testing internally.

Stakeholders/Partners

RISE:

Rivos: Vineet Gupta

Ventana: Jeff Law – general oversight / guidance, testing, regression fixing, etc.

External:


Dependencies



Status

Development

DONE


Development Timeline1H2024
Upstreaming

DONE



Upstream Version

gcc-15 (target)

(Spring 2025)





Contacts

Vineet Gupta (Rivos)

Jeff Law (Ventana)


DependenciesNone


Updates

  • Marking this effort as done.  There's a separate follow-up for 2025 projects. 

 

  • PR/114729: The patch which prevents deliberate spill has been merged g7bef3482f27 - this brings Cactu icounts to 2.1 trillion (from 2.6)
  • Another half a trillion need to be recovered, more work needed.

 

  • Vineet has posed a patch series that significantly improves the performance of Cactubssn on RISC-V as well as AArch64 platforms
    • The first of the two notable changes appears to have broad consensus to move forward after some naming & documentation changes
    • The second patch appears to be somewhat more controversial and will likely need more significant revisions

 

  • Ventana has upstreamed a potential fix for the 557.xz performance regression
  • Ventana has also identified a likely performance regression in the 502.gcc, currently under investigation
    • Failure to simplify a fp + C1 address computation used later in temp + C2 memory reference.
    • After register elimination that turns into fp + C1' + C2 were C1' + C2 should simplify enough to fit into a simm12, but isn't
    • Thinking it looks like a f-m-o failure
  • Rivos (Vineet) making slow progress on the larger problem of why we're running out of registers

 

  • Items (a) above addressed: patch to improve array/stack spill code (base+offset) where offset can be expressed as sum of two S12 is now merged
  • This addresses PR/106265: As reported previously this shaves off 290 billion dynamic icounts from Cactu - roughly 10% for zba_zbb_zbs_zicond build
  • There's a followup patch to apply same semantics to prologue/epilogue expansion which is currently failing a fortran test.

 

  • 557.xz is negatively impacted by Vineet's patch to improve spill code generation.
  • Essentially its disturbing GCC's ability to create shNadd insns when the addend is a constant such as 2048.
  • An additional pair of define_insn_and_split patterns can easily fix this based on Ventana's internal testing.

 

  • sched1 failing to reduce live range, leading to needless new pseudo allocation in inner loop, causing spill in outer loop.
  • Tracking the investigation as PR/114729

 

  • Abe to reduce a source file from cactu exhibiting just one stack spill with -fschedule-insns and none with -fno-schedule-insns.
  • The actual spill insn is emitted by IRA.

 

  • Added note about potential to cut the dynamic instruction count in half and bring this benchmark roughly on-par with AArch64.

 

  • Vineet has posted work to adjust the quality of spill code which eliminates ~300b instructions (10%) from the cactu benchmark.  Showed ~3.5% cycle improvement in Ventana's testing
    • General consensus to move forward.  Minor updates planned.
  • Vineet is going to explore the spilling issue.   Some suspicion this may intersect with "pending list queue" flushing issue in the scheduler.  Hoping that's the case as it's trivial to adjust the size of the pending list.

 

  • Setup this work entry.