...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
While porting the H264 standard encoder x264 to RISC-V, we've identified several operations that are challenging to implement efficiently with existing RVV instructions. In some cases, implementations require too many instructions and or transfer to / from memory, potentially impacting encoder performance.
On this page, we would like to document these operations to -
Summarize the the need for these operations - both for H264 but hopefully for other multi-media projects too
Contrast with existing support on other architectures
Be a basis for discussion about efficient implementations - both in software and hardware.
For operations that cannot be efficiently implemented in RISC-V, we would like to propose new instructions for video encoding and decoding to boost RISC-V's performance in this domain. We hope that experience from across the broader multimedia projects / codec ecosystem can help guide improvements to RISC-V.
Please do reach out to the members below or the RISE Systems Libraries WG if you have suggestions for better implementations of the operations supported here. Also, if you have come across operations that you feel are needed for multimedia workloads but not supported well today.
This is an open collaboration. All ideas and contributions are valuable as we work together to enhance RISC-V's video codec capabilities.
Contact
...
Information
Yin Tong - yintong.ustc@bytedance.com
Jiayan Qian - qianjiayan.1@bytedance.com
Punit Agrawal - punit.agrawal@bytedance.com
Collection list
Vector transpose
Absolute difference
Zero-extended vmv.x.s
Rounded Shift Right Narrow
Signed saturate and Narrow to Unsigned
1. Vector transpose instructions
...
AArch64
In aarch64, there are trn1
and trn2
instructions. By combining one trn1 and one trn2, multiple 2x2 matrix transpositions can be completed between two vector registers. Larger matrix transpositions can be achieved by repeatedly calling 2x2 matrix transpositions of different scales. The aarch64's transpose macro implementation in x264 is as follows:
...
In loongarch, matrix transposition is implemented using the Interleave method.
vilvl (Vector Interleave Low)
vilvh (Vector Interleave High)
The Loongarch's 4x4 transpose macro implementation in x264 is as follows:
...
These two instructions in LoongArch are essentially the same as zip1 and zip2 in AArch64. Similarly, the punpckl / h instructions in x86 exhibit the same behavior. In x264, x86 also uses punpckl / h for matrix transposition.
...
Using RISC-V RVV, we have discovered two methods to perform matrix transposition(thanks camel-cdr for the assistance provided):
Using segmented load or store
Using vrgather
Using vnsrl
Here, we use the example of transposing a 4x8 (2x4x4) matrix (transposing the left 4x4 and the right 4x4 separately) to illustrate these two methods.
Segmented load or store
In this way, we can use the `vssseg4e16.v` instruction to store each row of the original matrix into memory by columns, and then read them back by rows. Since we are transposing a 4x8 matrix, we also need to use `vslide` to combine the contents of the two registers together.
Code Block |
---|
// Using extra loads and stores, and use vslide to combine them .macro TRANSPOSE4x8_16 buf, bstride, v0, v1, v2, v3, t0, t1, t2, t3 vssseg4e16.v \v0, (\buf), \bstride vsetivli zero, 4, e16, mf2, ta, ma vle16.v \v0, (\buf) add \buf, \buf, \bstride vle16.v \v1, (\buf) add \buf, \buf, \bstride vle16.v \v2, (\buf) add \buf, \buf, \bstride vle16.v \v3, (\buf) add \buf, \buf, \bstride vle16.v \t0, (\buf) add \buf, \buf, \bstride vle16.v \t1, (\buf) add \buf, \buf, \bstride vle16.v \t2, (\buf) add \buf, \buf, \bstride vle16.v \t3, (\buf) add \buf, \buf, \bstride vsetivli zero, 2, e64, m1, tu, ma vslideup.vi \v0, \t0, 1 vslideup.vi \v1, \t1, 1 vslideup.vi \v2, \t2, 1 vslideup.vi \v3, \t3, 1 .endm // under VLEN=128 function transpose4x8_16_one vsetivli zero, 8, e16, m1, ta, ma mv t0, a0 vl4re16.v v0, (a0) li t1, 8 TRANSPOSE4x8_16 t0, t1, v0, v1, v2, v3, v8, v9, v10, v11 vs4r.v v0, (a0) ret endfunc |
...
For creating index by hand, the idea is to set the index for gathering vector N
to vector N
to (i&3)*vl+(i&~3u)+N
, where where i
is is the element index obtained by by vid.v.
Code Block |
---|
// Using vrgather with index created by hand .macro TRANSPOSE4x8_16_vrgather v0, v1, v2, v3, t0, t1, t2, t3, t4, t5, t6, t7, s0 vsetivli zero, 8, e16, m1, ta, ma vid.v \t0 li \s0, 8 vand.vi \t1, \t0, 3 vmul.vx \t1, \t1, \s0 vand.vi \t0, \t0, -4 vadd.vv \t4, \t1, \t0 vadd.vi \t5, \t4, 1 vadd.vi \t6, \t4, 2 vadd.vi \t7, \t4, 3 li \s0, 32 vsetvli zero, \s0, e16, m4, ta, ma vrgatherei16.vv \t0, \v0, \t4 vmv.v.v \v0, \t0 .endm // under VLEN=128 function transpose4x8_16_two vl4re16.v v0, (a0) TRANSPOSE4x8_16_vrgather v0, v1, v2, v3, v8, v9, v10, v11, v12, v13, v14, v15, t0 vs4r.v v0, (a0) ret endfunc |
...
This is also one of the main reasons why we want to add instructions similar to `trn1` and `trn2` in RVV.
Vnsrl
Olaf pointed out a new method to achieve matrix transposition, using the vnsrl instruction in RVV along with vslide instructions to achieve the effect of zip1 and zip2 in AArch64. Olaf provided detailed information for this method, and we are very grateful for his work. Below is an approach that works with VLEN=128:
Code Block |
---|
# VLEN=128 transpose one 4x4 matrix of 16-bit elements stored in 4 vreg: # a b c d a e i m # e f g h -----\ b f j n # i j k l -----/ c g k o # m n o p d h l p ## setup code: # li t1, 32 vsetvli t0, x0, e32, m1, ta, ma vslideup.vi v0, v1, 2 vslideup.vi v2, v3, 2 vmv1r.v v1, v2 # v0: a b c d e f g h # v1: i j k l m n o p vnsrl.wi v4, v0, 0 vnsrl.wx v6, v0, t1 # v4: a b e f i j m n # v6: c d g h k l o p vsetvli t0, x0, e16, mf2, ta, ma vnsrl.wi v0, v4, 0 vnsrl.wi v1, v4, 16 vnsrl.wi v2, v6, 0 vnsrl.wi v3, v6, 16 # v0: a e i m # v1: b f j n # v2: c g k o # v3: d h l p |
Proposal
2. Absolute difference instructions
Introduction
x264 need widening absolute difference accumulate operations which is 5%~6% in both x264 running time and specCPU 525.x264_r.
https://wiki.videolan.org/X264_asm_intro/#Example_2:_pixel_sad
Implementation in other ISAs
AArch64
AArch64 has a few different instructions based on the signedness and data type of input and output to calculate absolute differences
- SABD / UABD - signed / unsigned absolute difference
- SABDL / UABDL - signed / unsigned absolute difference (double-width result)
- SABA / UABA - signed / unsigned absolute difference and add
- SABAL/ UABAL - signed / unsigned absolute difference(double-width result) and add
x86
Compute sum of absolute difference: psadbw
Implementation in RISCV64
need 3~4 instructions to implement
...
vtrn1.vv:Interleave alternating even elements from the first and second source vectors and place in elements of the destination vector (elements from the first source vector place in the even index positions, and the elements from the second vector place in the odd index positions).
vtrn2.vv:Interleave alternating odd elements from the first and second source vectors and place in elements of the destination vector (elements from the first source vector place in the even index positions, and the elements from the second vector place in the odd index positions).
Code Block |
---|
vtrn1.vv vd, vs2, vs1 # vd[2i] = vs1[2i] , vd[2i+1] = vs2[2i]
vtrn2.vv vd, vs2, vs1 # vd+1[2i] = vs1[2i+1]) , vd+1[2i+1] = vs2[2i+1] |
We implemented the instructions on GEM5 and evaluate the performance gain.
Performance of Transpose benchmarks
load every row to a separate register
do in-register transpose
store back to the memory
Code Block |
---|
function transpose4x8_16_vssseg
vsetivli zero, 8, e16, mf2, ta, ma
mv t0, a0
mv t1, a0
vle16.v v0, (a0)
addi t1, t1, 16
vle16.v v1, (t1)
addi t1, t1, 16
vle16.v v2, (t1)
addi t1, t1, 16
vle16.v v3, (t1)
TRANSPOSE4x8_16 t0, t2, v0, v1, v2, v3, v8, v9, v10, v11
vsetivli zero, 8, e16, mf2, ta, ma
vse16.v v0, (a0)
addi a0, a0, 16
vse16.v v1, (a0)
addi a0, a0, 16
vse16.v v2, (a0)
addi a0, a0, 16
vse16.v v3, (a0)
ret
endfunc |
The results of different transpose implementations are as follows:
Transpose benchmarks | Cycles |
TRNS_4x4_16_VSSSEG | 14 |
TRNS_4x4_16_VRGATHER | 17 |
TRNS_4x4_16_VNSRL | 18 |
TRNS_4x4_16_VTRN_Extension | 15 |
TRNS_4x8_16_VSSSEG | 64 |
TRNS_4x8_16_VRGATHER | 19 |
TRNS_4x8_16_VTRN_MACRO | 25 |
TRNS_4x8_16_VTRN_Extension | 15 |
Performance of SATD functions in x264
According to the test on GEM5, a 40% gain can be achieved for larger SATD functions.
...
2. Absolute difference instructions
Introduction
x264 need widening absolute difference accumulate operations which is 5%~6% in both x264 running time and specCPU 525.x264_r.
https://wiki.videolan.org/X264_asm_intro/#Example_2:_pixel_sad
Implementation in other ISAs
AArch64
AArch64 has a few different instructions based on the signedness and data type of input and output to calculate absolute differences
SABD / UABD - signed / unsigned absolute difference
SABDL / UABDL - signed / unsigned absolute difference (double-width result)
SABA / UABA - signed / unsigned absolute difference and add
SABAL/ UABAL - signed / unsigned absolute difference(double-width result) and add
x86
Compute sum of absolute difference: psadbw
Implementation in RISCV64
need 3~4 instructions to implement
Code Block |
---|
.macro uabd d0, s0, s1, t0 vmaxu.vv \d0, \s0, \s1 vminu.vv \t0, \s0, \s1 vsub.vv \d0, \d0, \t0 .endm .macro sabd d0, s0, s1, t0 vmax.vv \d0, \s0, \s1 vmin.vv \t0, \s0, \s1 vsub.vv \d0, \d0, \t0 .endm .macro uabal d0, s0, s1, t0, t1 vmaxu.vv \t1, \s0, \s1 vmaxuvminu.vv \d0t0, \s0, \s1 vminuvsub.vv \t0, \s0t1, \s1t0 vsubvwaddu.vvwv \d0, \d0, \t0 .endm .macro sabduabdl d0, s0, s1, t0, t1 vmaxvmaxu.vv \d0t1, \s0, \s1 vminvminu.vv \t0, \s0, \s1 vsubvwsubu.vv \d0, \d0t1, \t0 .endm |
Proposal
Vector Single-Width Signed/Unsigned Integer Absolute Difference
Code Block |
---|
# Unsigned Absolute Difference. vabdu.macrovv uabal d0vd, s0vs2, s1vs1, t0,vm t1 vmaxu.vv \t1, \s0, \s1 vminu.vv \t0, \s0, \s1 vsub.vv \t0, \t1, \t0 vwaddu.wv \d0, \d0, \t0 .endm .macro uabdl d0, s0, s1, t0, t1 vmaxu.vv \t1, \s0, \s1 vminu.vv \t0, \s0, \s1 vwsubu.vv \d0, \t1, \t0 .endm # vd[i] = abs(unsigned(vs2[i]) - unsigned(vs1[i])) vabdu.vx vd, vs2, rs1, vm # vd[i] = abs(unsigned(vs2[i]) - unsigned(x[rs1])) vabdu.vi vd, vs2, imm, vm # vd[i] = abs(unsigned(vs2[i]) - unsigned(imm)) |
Performance of SAD functions
According to the test on GEM5, a 30% gain can be achieved for larger SAD functions.
...
3. Zero-extended vmv.x.s
Introduction
The vmv.x.s instruction copies a single SEW-wide element from index 0 of the source vector register to a destination
integer register. If SEW > XLEN, the least-signi cant XLEN bits are transferred and the upper SEW-XLEN bits are ignored. If
SEW < XLEN, the value is sign-extended to XLEN bits.
It is very common to move a uint16_t vector to a scalar register.
Implementation in RISCV64
Code Block |
---|
//uint16_t with zbb extension vsetivli zero, 1, e16, m1, ta, ma vmv.x.s a1, v1 zext.h a1, a1 |
4. Rounded Shift Right Narrow
Introduction
RVV 1.0 has instructions to -
shift + scaling: rssra
shift + narrow: vnsrl
clip + narrow: vnclip
But does not have "shift + scaling + narrow" instructions
Implementation in RISCV64
Code Block |
---|
// AArch64 implementation
rshrn v20.8b, v20.8h, #3
rshrn2 v20.16b, v21.8h, #3
// RISCV64 implementation
vsetivli zero, 8, e16, m1, ta, ma
vssrl.vi v20, v20, 3
vssrl.vi v21, v21, 3
vsetivli zero, 8, e8, mf2, ta, ma
vncvt.x.x.w v20, v20
vncvt.x.x.w v21, v21
vsetivli zero, 16, e8, m1, ta, ma
vslideup.vi v20, v21, 8 |
5. Signed saturate and Narrow to Unsigned
Introduction
Implementation in RISCV64
...