While adapting the H264 standard encoder x264 for RISC-V, we've identified several operations that are challenging to implement efficiently with existing RVV instructions. In some cases, implementations require too many instructions, potentially impacting encoder performance.
We're proposing new instructions for video encoding and decoding to boost RISC-V's performance in this domain. Our current focus on encoding for limited format standards may introduce some bias, so we're gathering instruction requirements here.
If you have new relevant needs, please share them. We may not have found the best RVV implementations, so if you have better solutions, we're open to discussion.
This is an open collaboration. All ideas and contributions are valuable as we work together to enhance RISC-V's video codec capabilities.
...
Collection list
- vector transpose instruction
- absolute difference instructions
- zero-extended vmv.x.s
Vector transpose instructions
Intro
In x264, matrix transpose instructions are primarily used in two aspects: one is to achieve matrix transposition, and the other is to achieve permutation between vectors. Both uses are quite frequent.
...
While adapting the H264 standard encoder x264 for RISC-V, we've identified several operations that are challenging to implement efficiently with existing RVV instructions. In some cases, implementations require too many instructions, potentially impacting encoder performance.
We're proposing new instructions for video encoding and decoding to boost RISC-V's performance in this domain. Our current focus on encoding for limited format standards may introduce some bias, so we're gathering instruction requirements here.
If you have new relevant needs, please share them. We may not have found the best RVV implementations, so if you have better solutions, we're open to discussion.
This is an open collaboration. All ideas and contributions are valuable as we work together to enhance RISC-V's video codec capabilities.
Contact Information
Collection list
- vector transpose instruction
- absolute difference instructions
- zero-extended vmv.x.s
- Rounded Shift Right Narrow
- Signed saturate and Narrow to Unsigned
Vector transpose instructions
Intro
In x264, matrix transpose instructions are primarily used in two aspects: one is to achieve matrix transposition, and the other is to achieve permutation between vectors. Both uses are quite frequent.
In scenarios within x264 where matrix transposition is required, each row of the matrix is individually placed into a register. After the transposition operation, each row of the transposed matrix is placed into a separate register. The matrix transposition discussed in this wiki is carried out in this context.
...
Using RISC-V RVV, we have discovered two methods to perform matrix transposition(thanks camel-cdr for the assistance provided):
- Using segmented load or store
- Using vrgather
- Using vnsrl
Here, we use the example of transposing a 4x8 (2x4x4) matrix (transposing the left 4x4 and the right 4x4 separately) to illustrate these two methods.
Segmented load or store
In this way, we can use the `vssseg4e16.v` instruction to store each row of the original matrix into memory by columns, and then read them back by rows. Since we are transposing a 4x8 matrix, we also need to use `vslide` to combine the contents of the two registers together.
Code Block |
---|
// Using extra loads and stores, and use vslide to combine them .macro TRANSPOSE4x8_16 buf, bstride, v0, v1, v2, v3, t0, t1, t2, t3 vssseg4e16.v \v0, (\buf), \bstride vsetivli zero, 4, e16, mf2, ta, ma, e16, mf2, ta, ma vle16.v \v0, (\buf) add \buf, \buf, \bstride vle16.v \v1, (\buf) add \buf, \buf, \bstride vle16.v \v0v2, (\buf) add \buf, \buf, \bstride vle16.v \v1v3, (\buf) add \buf, \buf, \bstride vle16.v \v2t0, (\buf) add \buf, \buf, \bstride vle16.v \v3t1, (\buf) add \buf, \buf, \bstride vle16.v \t0t2, (\buf) add \buf, \buf, \bstride vle16.v \t1t3, (\buf) add \buf, \buf, \bstride vsetivli zero, 2, e64, m1, tu, ma vslideup.vi \v0, \t0, 1 vle16.v vslideup.vi \t2v1, (\buf)\t1, 1 addvslideup.vi \bufv2, \buft2, \bstride1 vle16vslideup.vvi \t3v3, (\buf)t3, 1 .endm // add \buf, \buf, \bstrideunder VLEN=128 function transpose4x8_16_one vsetivli zero, 28, e64e16, m1, tuta, ma mv vslideup.vi \v0, \t0, 1t0, a0 vl4re16.v v0, (a0) li vslideup.vi \v1, \t1, 18 vslideup.vi \v2TRANSPOSE4x8_16 t0, \t2t1, 1v0, v1, v2, v3, vslideup.vi \v3, \t3, 1 .endm // under VLEN=128 function transpose4x8_16_onev8, v9, v10, v11 vs4r.v v0, (a0) vsetivli zero, 8, e16, m1, ta, ma mv t0, a0 vl4re16.v v0, (a0) li t1, 8 TRANSPOSE4x8_16 t0, t1,ret endfunc |
The drawback of this method is that we need to access memory, which certainly does not have the upper limit of pure register operations. Additionally, we always need to have a buffer space, and sometimes we need to protect its contents from being corrupted (as in dav1d, which would require more instructions).
Vrgather
`vrgather` can reorganize the elements in a register group based on an index. There are two ways to create the index: one is to create it manually, and the other is to read it from memory.
For creating index by hand, the idea is to set the index for gathering vector N
to (i&3)*vl+(i&~3u)+N
, where i
is the element index obtained by vid.v.
Code Block |
---|
// Using vrgather with index created by hand .macro TRANSPOSE4x8_16_vrgather v0, v1, v2, v3, v8t0, v9t1, v10t2, v11t3, t4, t5, t6, t7, vs4r.vs0 v0, (a0) ret endfunc |
The drawback of this method is that we need to access memory, which certainly does not have the upper limit of pure register operations. Additionally, we always need to have a buffer space, and sometimes we need to protect its contents from being corrupted (as in dav1d, which would require more instructions).
Vrgather
`vrgather` can reorganize the elements in a register group based on an index. There are two ways to create the index: one is to create it manually, and the other is to read it from memory.
For creating index by hand, the idea is to set the index for gathering vector N
to (i&3)*vl+(i&~3u)+N
, where i
is the element index obtained by vid.v.
Code Block |
---|
// Using vrgather with index created by hand .macro TRANSPOSE4x8_16_vrgather v0, v1, v2, v3, t0, t1, t2, t3, t4, t5, t6, t7, s0 vsetivli zero, 8, e16, m1, ta, ma vid.v \t0 li vsetivli zero, 8, e16, m1, ta, ma vid.v \t0 li \s0, 8 vand.vi \t1, \t0, 3 vmul.vx \t1, \t1, \s0 vand.vi \t0, \t0, -4 vadd.vv \t4, \t1, \t0 vadd.vi \t5, \t4, 1 vadd.vi \t6, \t4, 2 vadd.vi \t7, \t4, 3 \s0, 8 li vand.vi \t1, \t0s0, 332 vmul.vxvsetvli zero, \t1s0, \t1e16, \s0m4, ta, ma vand.vi vrgatherei16.vv \t0, \t0v0, -4\t4 vadd.vvvmv.v.v \t4v0, \t1, \t0 vadd.viendm // under \t5, \t4, 1VLEN=128 function transpose4x8_16_two vaddvl4re16.viv v0, (a0) \t6, \t4, 2 TRANSPOSE4x8_16_vrgather v0, vadd.vi \t7, \t4, 3 v1, v2, v3, v8, v9, v10, v11, v12, v13, v14, v15, t0 vs4r.v liv0, (a0) ret endfunc |
Alternatively, we can read the index from memory.
Code Block |
---|
const \s0scan4x8_frame, 32align=8 vsetvli.half 0, 8, zero16, \s024, e164, m412, ta20, ma28 vrgatherei16.vv \t0half 1, \v09, \t417, 25, 5, 13, vmv.v.v21, 29 \v0, \t0 .endm.half // under VLEN=128 function transpose4x8_16_two vl4re16.v v0, (a0) TRANSPOSE4x8_16_vrgather v0, v1, v2, v3, v8, v9, v10, v11, v12, v13, v14, v15, t02, 10, 18, 26, 6, 14, 22, 30 .half 3, 11, 19, 27, 7, 15, 23, 31 endconst // under VLEN=128 function transpose4x8_16_three vs4rvl4re16.v v0, (a0) ret endfunc |
Alternatively, we can read the index from memory.
Code Block |
---|
const movrel t0, scan4x8_frame, align=8 vl4re16.halfv 0v4, 8, 16, 24, 4, 12, 20, 28(t0) li .half 1t1, 9, 17, 25, 5, 13, 21, 29 32 vsetvli zero, t1, e16, m4, ta, ma .halfvrgatherei16.vv v8, v0, v4 2, 10, 18, 26, 6, 14, 22, 30 .half 3, 11, 19, 27, 7, 15, 23, 31 endconst // under VLEN=128 function transpose4x8_16_three vl4re16.v v0, (a0) movrel t0, scan4x8_frame vl4re16.v v4, (t0) li vs4r.v v8, (a0) ret endfunc |
Based on our current results, `vrgather` is much slower than segmented load/store (vsseg: 0.277785 seconds, vrgather.vv: 1.545038 seconds). However, we believe that segmented load/store has significant potential for improvement, as it is not a pure in-register operation.
Another issue is that in the hot functions of x264, specifically the SATD series of functions, the AArch64 implementation extensively uses `trn1` and `trn2` operations. These operations can simplify calculations and improve SIMD performance. However, currently performing such operations in RVV is quite expensive.
Code Block |
---|
// Each vtrn macro simulate two instructions in aarch64: trn1 and trn2 .macro vtrn_8h d0, d1, s0, s1, t0, t1, t3 vsetivli zero, 4, e32, t1m1, 32ta, ma vsetvli zero, t1, e16, m4, ta, ma vsll.vi vrgatherei16.vv v8, v0, v4\t3, \s0, 16 vsrl.vi vs4r.v v8\t1, (a0) \s1, 16 ret endfunc |
Based on our current results, `vrgather` is much slower than segmented load/store (vsseg: 0.277785 seconds, vrgather.vv: 1.545038 seconds). However, we believe that segmented load/store has significant potential for improvement, as it is not a pure in-register operation.
Another issue is that in the hot functions of x264, specifically the SATD series of functions, the AArch64 implementation extensively uses `trn1` and `trn2` operations. These operations can simplify calculations and improve SIMD performance. However, currently performing such operations in RVV is quite expensive.
Code Block |
---|
// Each vtrn macro simulate two instructions in aarch64: trn1 and trn2 .macro vtrn_8h d0, d1, s0, s1, t0, t1, t3 vsrl.vi \t0, \s0, 16 vsll.vi \d0, \s1, 16 vsll.vi \d1, \t1, 16 vsrl.vi \t3, \t3, 16 vsetivli zero, 48, e32e16, m1, ta, ma vsllvor.vivv \t3d0, \s0d0, 16\t3 vsrlvor.vivv \t1d1, \s1d1, 16 vsrl.vi\t0 .endm .macro vtrn_4s d0, d1, s0, s1, \t0, \s0t1, 16t3 vsetivli vsll.vi zero, \d02, e64, m1, \s1, 16ta, ma li vsll.vi \d1, \t1t5, 1632 vsrlvsll.vivx \t3, \t3s0, 16t5 vsetivlivsrl.vx zero, 8\t1, e16, m1, ta\s1, mat5 vorvsrl.vvvx \d0t0, \d0s0, \t3t5 vorvsll.vvvx \d1d0, \d1s1, \t0 .endm .macro vtrn_4s d0, d1, s0, s1, t0, t1, t3 t5 vsll.vx \d1, \t1, t5 vsetivlivsrl.vx zero, 2\t3, e64\t3, m1,t5 ta, ma vsetivli li zero, 4, e32, t5m1, ta, 32ma vsllvor.vxvv \t3d0, \s0d0, t5\t3 vsrlvor.vxvv \t1d1, \s1d1, t5 vsrl.vx \t0, \s0, t5 vsll.vx \d0, \s1, t5 vsll.vx \t0 .endm |
This is also one of the main reasons why we want to add instructions similar to `trn1` and `trn2` in RVV.
Vnsrl
Olaf pointed out a new method to achieve matrix transposition, using the vnsrl instruction in RVV along with vslide instructions to achieve the effect of zip1 and zip2 in AArch64. Olaf provided detailed information for this method, and we are very grateful for his work. Below is an approach that works with VLEN=128:
Code Block |
---|
# VLEN=128 transpose one 4x4 matrix of 16-bit elements stored in 4 vreg: # a b c d a \d1, \t1, t5 e i m # e vsrl.vxf g h -----\ b f j \t3, \t3, t5n # i j k vsetivlil -----/ c g k o zero,# 4, e32, m1,m ta,n mao p vor.vv d h l p \d0, \d0, \t3 ## setup vor.vvcode: # li t1, 32 vsetvli t0, x0, e32, \d1m1, \d1ta, \t0ma .endm |
This is also one of the main reasons why we want to add instructions similar to `trn1` and `trn2` in RVV.
Vnsrl
Olaf pointed out a new method to achieve matrix transposition, using the vnsrl instruction in RVV along with vslide instructions to achieve the effect of zip1 and zip2 in AArch64. Olaf provided detailed information for this method, and we are very grateful for his work. Below is an approach that works with VLEN=128:
Code Block |
---|
# VLEN=128 transpose one 4x4 matrix of 16-bit elements stored in 4 vreg:
# a b c d a e i m
# e f g h -----\ b f j n
# i j k l -----/ c g k o
# m n o p d h l p
## setup code:
# li t1, 32
vsetvli t0, x0, e32, m1, ta, ma
vslideup.vi v0, v1, 2
vslideup.vi v2, v3, 2
vmv1r.v v1, v2
# v0: a b c d e f g h
# v1: i j k l m n o p
vnsrl.wi v4, v0, 0
vnsrl.wx v6, v0, t1
# v4: a b e f i j m n
# v6: c d g h k l o p
vsetvli t0, x0, e16, mf2, ta, ma
vnsrl.wi v0, v4, 0
vnsrl.wi v1, v4, 16
vnsrl.wi v2, v6, 0
vnsrl.wi v3, v6, 16
# v0: a e i m
# v1: b f j n
# v2: c g k o
# v3: d h l p |
Absolute difference instructions
Intro
x264 need widening absolute difference accumulate operations which is 5%~6% in both x264 running time and specCPU 525.x264_r.
https://wiki.videolan.org/X264_asm_intro/#Example_2:_pixel_sad
Implementation in other ISAs
Aarch64
absolute difference: sabd/sabd
absolute difference(double-width result): sabdl/uabdl
add absolute difference: saba/uaba
add absolute difference(double-width result): sabal/uabal
x86
compute sum of absoulte difference: psadbw
Implementation in RISCV64
need 3~4 instructions to implement
Code Block |
---|
.macro uabd d0, s0, s1, t0
vmaxu.vv \d0, \s0, \s1
vminu.vv \t0, \s0, \s1
vsub.vv \d0, \d0, \t0
.endm
.macro sabd d0, s0, s1, t0
vmax.vv \d0, \s0, \s1
vmin.vv \t0, \s0, \s1
vsub.vv \d0, \d0, \t0
.endm
.macro uabal d0, s0, s1, t0, t1
vmaxu.vv \t1, \s0, \s1
vminu.vv \t0, \s0, \s1
vsub.vv \t0, \t1, \t0
vwaddu.wv \d0, \d0, \t0
.endm
.macro uabdl d0, s0, s1, t0, t1
vmaxu.vv \t1, \s0, \s1
vminu.vv \t0, \s0, \s1
vwsubu.vv \d0, \t1, \t0
.endm |
Zero-extended vmv.x.s
Intro
The vmv.x.s instruction copies a single SEW-wide element from index 0 of the source vector register to a destination
integer register. If SEW > XLEN, the least-signi cant XLEN bits are transferred and the upper SEW-XLEN bits are ignored. If
SEW < XLEN, the value is sign-extended to XLEN bits.
Implementation in RISCV64
Code Block |
---|
//uint16_t with zbb extension vsetivli zero, 1vslideup.vi v0, v1, 2 vslideup.vi v2, v3, 2 vmv1r.v v1, v2 # v0: a b c d e f g h # v1: i j k l m n o p vnsrl.wi v4, v0, 0 vnsrl.wx v6, v0, t1 # v4: a b e f i j m n # v6: c d g h k l o p vsetvli t0, x0, e16, mf2, ta, ma vnsrl.wi v0, v4, 0 vnsrl.wi v1, v4, 16 vnsrl.wi v2, v6, 0 vnsrl.wi v3, v6, 16 # v0: a e i m # v1: b f j n # v2: c g k o # v3: d h l p |
Absolute difference instructions
Intro
x264 need widening absolute difference accumulate operations which is 5%~6% in both x264 running time and specCPU 525.x264_r.
https://wiki.videolan.org/X264_asm_intro/#Example_2:_pixel_sad
Implementation in other ISAs
Aarch64
absolute difference: sabd/sabd
absolute difference(double-width result): sabdl/uabdl
add absolute difference: saba/uaba
add absolute difference(double-width result): sabal/uabal
x86
compute sum of absoulte difference: psadbw
Implementation in RISCV64
need 3~4 instructions to implement
Code Block |
---|
.macro uabd d0, s0, s1, t0
vmaxu.vv \d0, \s0, \s1
vminu.vv \t0, \s0, \s1
vsub.vv \d0, \d0, \t0
.endm
.macro sabd d0, s0, s1, t0
vmax.vv \d0, \s0, \s1
vmin.vv \t0, \s0, \s1
vsub.vv \d0, \d0, \t0
.endm
.macro uabal d0, s0, s1, t0, t1
vmaxu.vv \t1, \s0, \s1
vminu.vv \t0, \s0, \s1
vsub.vv \t0, \t1, \t0
vwaddu.wv \d0, \d0, \t0
.endm
.macro uabdl d0, s0, s1, t0, t1
vmaxu.vv \t1, \s0, \s1
vminu.vv \t0, \s0, \s1
vwsubu.vv \d0, \t1, \t0
.endm |
Zero-extended vmv.x.s
Intro
The vmv.x.s instruction copies a single SEW-wide element from index 0 of the source vector register to a destination
integer register. If SEW > XLEN, the least-signi cant XLEN bits are transferred and the upper SEW-XLEN bits are ignored. If
SEW < XLEN, the value is sign-extended to XLEN bits.
Implementation in RISCV64
Code Block |
---|
//uint16_t with zbb extension
vsetivli zero, 1, e16, m1, ta, ma
vmv.x.s a1, v1
zext.h a1, a1 |
Rounded Shift Right Narrow
Intro
Now RVV has:
Implementation in RISCV64
Code Block |
---|
// AArch64 implementation
rshrn v20.8b, v20.8h, #3
rshrn2 v20.16b, v21.8h, #3
// RISCV64 implementation
vsetivli zero, 8, e16, m1, ta, ma
vssrl.vi v20, v20, 3
vssrl.vi v21, v21, 3
vsetivli zero, 8, e8, mf2, ta, ma
vncvt.x.x.w v20, v20
vncvt.x.x.w v21, v21
vsetivli zero, 16, e8, m1, ta, ma
vslideup.vi v20, v21, 8 |
Signed saturate and Narrow to Unsigned
Implementation in RISCV64
Code Block |
---|
// AArch64 implementation sqxtun v0.8b, v0.8h // RISCV64 implementation vsetivli zero, 4, e16, m1, ta, ma vmv.x.s a1, v1 zext.h a1, a1vmax.vx v0, v0, zero vsetivli zero, 4, e8, mf2, ta, ma vnclipu.wi v4, v0, 0 |