While porting the H264 standard encoder x264 to RISC-V, we've identified several operations that are challenging to implement efficiently with existing RVV instructions. In some cases, implementations require too many instructions and or transfer to / from memory, potentially impacting encoder performance.
...
This is an open collaboration. All ideas and contributions are valuable as we work together to enhance RISC-V's video codec capabilities.
Contact Information
- Yin Tong - yintong.ustc@bytedance.com
- Jiayan Qian - qianjiayan.1@bytedance.com
- Punit Agrawal - punit.agrawal@bytedance.com
...
- Vector transpose
- Absolute difference
- Zero-extended vmv.x.s
- Rounded Shift Right Narrow
- Signed saturate and Narrow to Unsigned
1. Vector transpose instructions
...
In other ISAs, matrix transposition is usually implemented in two ways. Below, we will introduce these methods using aarch64 and loongarch as examples. The implementation in x86 is similar to loongarch, while the implementation in ARM is similar to aarch64.
...
AArch64
In aarch64, there are trn1
and trn2
instructions. By combining one trn1
and one trn2
, multiple 2x2 matrix transpositions can be completed between two vector registers. Larger matrix transpositions can be achieved by repeatedly calling 2x2 matrix transpositions of different scales. The aarch64's transpose macro implementation in x264 is as follows:
...
Using RISC-V RVV, we have discovered two methods to perform matrix transposition(thanks camel-cdr for the assistance provided):
- Using segmented load or store
- Using vrgather
- Using vnsrl
Here, we use the example of transposing a 4x8 (2x4x4) matrix (transposing the left 4x4 and the right 4x4 separately) to illustrate these two methods.
Segmented load or store
In this way, we can use the `vssseg4e16.v` instruction to store each row of the original matrix into memory by columns, and then read them back by rows. Since we are transposing a 4x8 matrix, we also need to use `vslide` to combine the contents of the two registers together.
Code Block |
---|
// Using extra loads and stores, and use vslide to combine them .macro TRANSPOSE4x8_16 buf, bstride, v0, v1, v2, v3, t0, t1, t2, t3 vssseg4e16.v \v0, (\buf), \bstride vsetivli zero, 4, e16, mf2, ta, ma vle16.v \v0, (\buf) add \buf, \buf, \bstride vle16.v \v1, (\buf) add \buf, \buf, \bstride vle16.v \v2, (\buf) add \buf, \buf, \bstride vle16.v \v3, (\buf) add \buf, \buf, \bstride vle16.v \t0, (\buf) add \buf, \buf, \bstride vle16.v \t1, (\buf) add \buf, \buf, \bstride vle16.v \t2, (\buf) add \buf, \buf, \bstride vle16.v \t3, (\buf) add \buf, \buf, \bstride vsetivli zero, 2, e64, m1, tu, ma vslideup.vi \v0, \t0, 1 vslideup.vi \v1, \t1, 1 vslideup.vi \v2, \t2, 1 vslideup.vi \v3, \t3, 1 .endm // under VLEN=128 function transpose4x8_16_one vsetivli zero, 8, e16, m1, ta, ma mv t0, a0 vl4re16.v v0, (a0) li t1, 8 TRANSPOSE4x8_16 t0, t1, v0, v1, v2, v3, v8, v9, v10, v11 vs4r.v v0, (a0) ret endfunc |
...
For creating index by hand, the idea is to set the index for gathering vector N
to (i&3)*vl+(i&~3u)+N
, where i
is the element index obtained by vid.v.
Code Block |
---|
// Using vrgather with index created by hand .macro TRANSPOSE4x8_16_vrgather v0, v1, v2, v3, t0, t1, t2, t3, t4, t5, t6, t7, s0 vsetivli zero, 8, e16, m1, ta, ma vid.v \t0 li \s0, 8 vand.vi \t1, \t0, 3 vmul.vx \t1, \t1, \s0 vand.vi \t0, \t0, -4 vadd.vv \t4, \t1, \t0 vadd.vi \t5, \t4, 1 vadd.vi \t6, \t4, 2 vadd.vi \t7, \t4, 3 li \s0, 32 vsetvli zero, \s0, e16, m4, ta, ma vrgatherei16.vv \t0, \v0, \t4 vmv.v.v \v0, \t0 .endm // under VLEN=128 function transpose4x8_16_two vl4re16.v v0, (a0) TRANSPOSE4x8_16_vrgather v0, v1, v2, v3, v8, v9, v10, v11, v12, v13, v14, v15, t0 vs4r.v v0, (a0) ret endfunc |
...
This is also one of the main reasons why we want to add instructions similar to `trn1` and `trn2` in RVV.
Vnsrl
Olaf pointed out a new method to achieve matrix transposition, using the vnsrl instruction in RVV along with vslide instructions to achieve the effect of zip1 and zip2 in AArch64. Olaf provided detailed information for this method, and we are very grateful for his work. Below is an approach that works with VLEN=128:
...
Implementation in other ISAs
...
AArch64
Aarch64 AArch64 has a few different instructions based on the signedness and data type of input and output to calculate absolute differences
- SABD / UABD - signed / unsigned absolute difference
- SABDL / UABDL - signed / unsigned absolute difference (double-width result)
- SABA / UABA - signed / unsigned absolute difference and add
- SABAL/ UABAL - signed / unsigned absolute difference(double-width result) and add
x86
Compute sum of absolute difference: psadbw
Implementation in RISCV64
...
Code Block |
---|
//uint16_t with zbb extension vsetivli zero, 1, e16, m1, ta, ma vmv.x.s a1, v1 zext.h a1, a1 |
4. Rounded Shift Right Narrow
Introduction
RVV 1.0 has instructions to -
- shift + scaling: rssra
- shift + narrow: vnsrl
- clip + narrow: vnclip
Implementation in RISCV64
Code Block |
---|
// AArch64 implementation rshrn v20.8b, v20.8h, #3 rshrn2 v20.16b, v21.8h, #3 // RISCV64 implementation vsetivli zero, 8, e16, m1, ta, ma vssrl.vi v20, v20, 3 vssrl.vi v21, v21, 3 vsetivli zero, 8, e8, mf2, ta, ma vncvt.x.x.w v20, v20 vncvt.x.x.w v21, v21 vsetivli zero, 16, e8, m1, ta, ma vslideup.vi v20, v21, 8 |
5. Signed saturate and Narrow to Unsigned
Introduction
Implementation in RISCV64
...