While adapting the H264 standard encoder x264 for RISC-V, we've identified several operations that are challenging to implement efficiently with existing RVV instructions. In some cases, implementations require too many instructions, potentially impacting encoder performance.
We're proposing new instructions for video encoding and decoding to boost RISC-V's performance in this domain. Our current focus on encoding for limited format standards may introduce some bias, so we're gathering instruction requirements here.
If you have new relevant needs, please share them. We may not have found the best RVV implementations, so if you have better solutions, we're open to discussion.
This is an open collaboration. All ideas and contributions are valuable as we work together to enhance RISC-V's video codec capabilities.
Vector transpose instructions
Intro
In x264, matrix transpose instructions are primarily used in two aspects: one is to achieve matrix transposition, and the other is to achieve permutation between vectors. Both uses are quite frequent.
Implementation in other ISAs
In other ISAs, matrix transposition is usually implemented in two ways. Below, we will introduce these methods using aarch64 and loongarch as examples. The implementation in x86 is similar to loongarch, while the implementation in ARM is similar to aarch64.
Aarch64
In aarch64, there are trn1
and trn2
instructions. By combining one trn1
and one trn2
, multiple 2x2 matrix transpositions can be completed between two vector registers. Larger matrix transpositions can be achieved by repeatedly calling 2x2 matrix transpositions of different scales. The aarch64's transpose macro implementation in x264 is as follows:
Code Block |
---|
.macro transpose t1, t2, s1, s2
trn1 \t1, \s1, \s2
trn2 \t2, \s1, \s2
.endm
.macro transpose4x4.h v0, v1, v2, v3, t0, t1, t2, t3
transpose \t0\().2s, \t2\().2s, \v0\().2s, \v2\().2s
transpose \t1\().2s, \t3\().2s, \v1\().2s, \v3\().2s
transpose \v0\().4h, \v1\().4h, \t0\().4h, \t1\().4h
transpose \v2\().4h, \v3\().4h, \t2\().4h, \t3\().4h
.endm
.macro transpose4x8.h v0, v1, v2, v3, t0, t1, t2, t3
transpose \t0\().4s, \t2\().4s, \v0\().4s, \v2\().4s
transpose \t1\().4s, \t3\().4s, \v1\().4s, \v3\().4s
transpose \v0\().8h, \v1\().8h, \t0\().8h, \t1\().8h
transpose \v2\().8h, \v3\().8h, \t2\().8h, \t3\().8h
.endm
.macro transpose8x8.h r0, r1, r2, r3, r4, r5, r6, r7, r8, r9
trn1 \r8\().8h, \r0\().8h, \r1\().8h
trn2 \r9\().8h, \r0\().8h, \r1\().8h
trn1 \r1\().8h, \r2\().8h, \r3\().8h
trn2 \r3\().8h, \r2\().8h, \r3\().8h
trn1 \r0\().8h, \r4\().8h, \r5\().8h
trn2 \r5\().8h, \r4\().8h, \r5\().8h
trn1 \r2\().8h, \r6\().8h, \r7\().8h
trn2 \r7\().8h, \r6\().8h, \r7\().8h
trn1 \r4\().4s, \r0\().4s, \r2\().4s
trn2 \r2\().4s, \r0\().4s, \r2\().4s
trn1 \r6\().4s, \r5\().4s, \r7\().4s
trn2 \r7\().4s, \r5\().4s, \r7\().4s
trn1 \r5\().4s, \r9\().4s, \r3\().4s
trn2 \r9\().4s, \r9\().4s, \r3\().4s
trn1 \r3\().4s, \r8\().4s, \r1\().4s
trn2 \r8\().4s, \r8\().4s, \r1\().4s
trn1 \r0\().2d, \r3\().2d, \r4\().2d
trn2 \r4\().2d, \r3\().2d, \r4\().2d
trn1 \r1\().2d, \r5\().2d, \r6\().2d
trn2 \r5\().2d, \r5\().2d, \r6\().2d
trn2 \r6\().2d, \r8\().2d, \r2\().2d
trn1 \r2\().2d, \r8\().2d, \r2\().2d
trn1 \r3\().2d, \r9\().2d, \r7\().2d
trn2 \r7\().2d, \r9\().2d, \r7\().2d
.endm |
Here, transpose4x4.h
and transpose4x8.h
achieve fast transpositions of 4x4 and 4x8 (2x4x4) matrices by repeatedly calling the transpose macro.
Loongarch
In loongarch, matrix transposition is implemented using the Interleave method.
- vilvl (Vector Interleave Low)
- vilvh (Vector Interleave High)
The Loongarch's 4x4 transpose macro implementation in x264 is as follows:
Code Block |
---|
/*
* Description : Transpose 4x4 block with word elements in vectors
* Arguments : Inputs - in0, in1, in2, in3
* Outputs - out0, out1, out2, out3
* Details :
* Example :
* 1, 2, 3, 4 1, 5, 9,13
* 5, 6, 7, 8 to 2, 6,10,14
* 9,10,11,12 =====> 3, 7,11,15
* 13,14,15,16 4, 8,12,16
*/
.macro LSX_TRANSPOSE4x4_W in0, in1, in2, in3, out0, out1, out2, out3, \
tmp0, tmp1
vilvl.w \tmp0, \in1, \in0
vilvh.w \out1, \in1, \in0
vilvl.w \tmp1, \in3, \in2
vilvh.w \out3, \in3, \in2
vilvl.d \out0, \tmp1, \tmp0
vilvl.d \out2, \out3, \out1
vilvh.d \out3, \out3, \out1
vilvh.d \out1, \tmp1, \tmp0
.endm |
By performing multiple interleaved instrutions, matrix transposition can be achieved. Here is the value change of each register during the process of 4x4 matrix transposition using the Interleave method:
Code Block |
---|
# input
in0: [a0 a1 a2 a3]
in1: [b0 b1 b2 b3]
in2: [c0 c1 c2 c3]
in3: [d0 d1 d2 d3]
vilvl.w \tmp0, \in1, \in0
// tmp0: [a0 b0 a1 b1]
vilvh.w \out1, \in1, \in0
// out1: [a2 b2 a3 b3]
vilvl.w \tmp1, \in3, \in2
// tmp1: [c0 d0 c1 d1]
vilvh.w \out3, \in3, \in2
// out3: [c2 d2 c3 d3]
vilvl.d \out0, \tmp1, \tmp0
// out0: [a0 b0 c0 d0]
vilvl.d \out2, \out3, \out1
// out2: [a2 b2 c2 d2]
vilvh.d \out3, \out3, \out1
// out3: [a3 b3 c3 d3]
vilvh.d \out1, \tmp1, \tmp0
// out1: [a1 b1 c1 d1]
# output
out0: [a0 b0 c0 d0]
out1: [a1 b1 c1 d1]
out2: [a2 b2 c2 d2]
out3: [a3 b3 c3 d3] |
...
While adapting the H264 standard encoder x264 for RISC-V, we've identified several operations that are challenging to implement efficiently with existing RVV instructions. In some cases, implementations require too many instructions, potentially impacting encoder performance.
We're proposing new instructions for video encoding and decoding to boost RISC-V's performance in this domain. Our current focus on encoding for limited format standards may introduce some bias, so we're gathering instruction requirements here.
If you have new relevant needs, please share them. We may not have found the best RVV implementations, so if you have better solutions, we're open to discussion.
This is an open collaboration. All ideas and contributions are valuable as we work together to enhance RISC-V's video codec capabilities.
Vector transpose instructions
Intro
In x264, matrix transpose instructions are primarily used in two aspects: one is to achieve matrix transposition, and the other is to achieve permutation between vectors. Both uses are quite frequent.
In scenarios within x264 where matrix transposition is required, each row of the matrix is individually placed into a register. After the transposition operation, each row of the transposed matrix is placed into a separate register. The matrix transposition discussed in this wiki is carried out in this context.
Implementation in other ISAs
In other ISAs, matrix transposition is usually implemented in two ways. Below, we will introduce these methods using aarch64 and loongarch as examples. The implementation in x86 is similar to loongarch, while the implementation in ARM is similar to aarch64.
Aarch64
In aarch64, there are trn1
and trn2
instructions. By combining one trn1
and one trn2
, multiple 2x2 matrix transpositions can be completed between two vector registers. Larger matrix transpositions can be achieved by repeatedly calling 2x2 matrix transpositions of different scales. The aarch64's transpose macro implementation in x264 is as follows:
Code Block |
---|
.macro transpose t1, t2, s1, s2
trn1 \t1, \s1, \s2
trn2 \t2, \s1, \s2
.endm
.macro transpose4x4.h v0, v1, v2, v3, t0, t1, t2, t3
transpose \t0\().2s, \t2\().2s, \v0\().2s, \v2\().2s
transpose \t1\().2s, \t3\().2s, \v1\().2s, \v3\().2s
transpose \v0\().4h, \v1\().4h, \t0\().4h, \t1\().4h
transpose \v2\().4h, \v3\().4h, \t2\().4h, \t3\().4h
.endm
.macro transpose4x8.h v0, v1, v2, v3, t0, t1, t2, t3
transpose \t0\().4s, \t2\().4s, \v0\().4s, \v2\().4s
transpose \t1\().4s, \t3\().4s, \v1\().4s, \v3\().4s
transpose \v0\().8h, \v1\().8h, \t0\().8h, \t1\().8h
transpose \v2\().8h, \v3\().8h, \t2\().8h, \t3\().8h
.endm
.macro transpose8x8.h r0, r1, r2, r3, r4, r5, r6, r7, r8, r9
trn1 \r8\().8h, \r0\().8h, \r1\().8h
trn2 \r9\().8h, \r0\().8h, \r1\().8h
trn1 \r1\().8h, \r2\().8h, \r3\().8h
trn2 \r3\().8h, \r2\().8h, \r3\().8h
trn1 \r0\().8h, \r4\().8h, \r5\().8h
trn2 \r5\().8h, \r4\().8h, \r5\().8h
trn1 \r2\().8h, \r6\().8h, \r7\().8h
trn2 \r7\().8h, \r6\().8h, \r7\().8h
trn1 \r4\().4s, \r0\().4s, \r2\().4s
trn2 \r2\().4s, \r0\().4s, \r2\().4s
trn1 \r6\().4s, \r5\().4s, \r7\().4s
trn2 \r7\().4s, \r5\().4s, \r7\().4s
trn1 \r5\().4s, \r9\().4s, \r3\().4s
trn2 \r9\().4s, \r9\().4s, \r3\().4s
trn1 \r3\().4s, \r8\().4s, \r1\().4s
trn2 \r8\().4s, \r8\().4s, \r1\().4s
trn1 \r0\().2d, \r3\().2d, \r4\().2d
trn2 \r4\().2d, \r3\().2d, \r4\().2d
trn1 \r1\().2d, \r5\().2d, \r6\().2d
trn2 \r5\().2d, \r5\().2d, \r6\().2d
trn2 \r6\().2d, \r8\().2d, \r2\().2d
trn1 \r2\().2d, \r8\().2d, \r2\().2d
trn1 \r3\().2d, \r9\().2d, \r7\().2d
trn2 \r7\().2d, \r9\().2d, \r7\().2d
.endm |
Here, transpose4x4.h
and transpose4x8.h
achieve fast transpositions of 4x4 and 4x8 (2x4x4) matrices by repeatedly calling the transpose macro.
Loongarch
In loongarch, matrix transposition is implemented using the Interleave method.
- vilvl (Vector Interleave Low)
- vilvh (Vector Interleave High)
The Loongarch's 4x4 transpose macro implementation in x264 is as follows:
Code Block |
---|
/*
* Description : Transpose 4x4 block with word elements in vectors
* Arguments : Inputs - in0, in1, in2, in3
* Outputs - out0, out1, out2, out3
* Details :
* Example :
* 1, 2, 3, 4 1, 5, 9,13
* 5, 6, 7, 8 to 2, 6,10,14
* 9,10,11,12 =====> 3, 7,11,15
* 13,14,15,16 4, 8,12,16
*/
.macro LSX_TRANSPOSE4x4_W in0, in1, in2, in3, out0, out1, out2, out3, \
tmp0, tmp1
vilvl.w \tmp0, \in1, \in0
vilvh.w \out1, \in1, \in0
vilvl.w \tmp1, \in3, \in2
vilvh.w \out3, \in3, \in2
vilvl.d \out0, \tmp1, \tmp0
vilvl.d \out2, \out3, \out1
vilvh.d \out3, \out3, \out1
vilvh.d \out1, \tmp1, \tmp0
.endm |
By performing multiple interleaved instrutions, matrix transposition can be achieved. Here is the value change of each register during the process of 4x4 matrix transposition using the Interleave method:
Code Block |
---|
# input
in0: [a0 a1 a2 a3]
in1: [b0 b1 b2 b3]
in2: [c0 c1 c2 c3]
in3: [d0 d1 d2 d3]
vilvl.w \tmp0, \in1, \in0
// tmp0: [a0 b0 a1 b1]
vilvh.w \out1, \in1, \in0
// out1: [a2 b2 a3 b3]
vilvl.w \tmp1, \in3, \in2
// tmp1: [c0 d0 c1 d1]
vilvh.w \out3, \in3, \in2
// out3: [c2 d2 c3 d3]
vilvl.d \out0, \tmp1, \tmp0
// out0: [a0 b0 c0 d0]
vilvl.d \out2, \out3, \out1
// out2: [a2 b2 c2 d2]
vilvh.d \out3, \out3, \out1
// out3: [a3 b3 c3 d3]
vilvh.d \out1, \tmp1, \tmp0
// out1: [a1 b1 c1 d1]
# output
out0: [a0 b0 c0 d0]
out1: [a1 b1 c1 d1]
out2: [a2 b2 c2 d2]
out3: [a3 b3 c3 d3] |
Implementation in RISCV64
Using RISC-V RVV, we have discovered two methods to perform matrix transposition(thanks camel-cdr for the assistance provided):
- Using segmented load or store
- Using vrgather
Here, we use the example of transposing a 4x8 (2x4x4) matrix (transposing the left 4x4 and the right 4x4 separately) to illustrate these two methods.
Segmented load or store
In this way, we can use the `vssseg4e16.v` instruction to store each row of the original matrix into memory by columns, and then read them back by rows. Since we are transposing a 4x8 matrix, we also need to use `vslide` to combine the contents of the two registers together.
Code Block |
---|
// Using extra loads and stores, and use vslide to combine them
.macro TRANSPOSE4x8_16 buf, bstride, v0, v1, v2, v3, t0, t1, t2, t3
vssseg4e16.v \v0, (\buf), \bstride
vsetivli zero, 4, e16, mf2, ta, ma
vle16.v \v0, (\buf)
add \buf, \buf, \bstride
vle16.v \v1, (\buf)
add \buf, \buf, \bstride
vle16.v \v2, (\buf)
add \buf, \buf, \bstride
vle16.v \v3, (\buf)
add \buf, \buf, \bstride
vle16.v \t0, (\buf)
add \buf, \buf, \bstride
vle16.v \t1, (\buf)
add \buf, \buf, \bstride
vle16.v \t2, (\buf)
add \buf, \buf, \bstride
vle16.v \t3, (\buf)
add \buf, \buf, \bstride
vsetivli zero, 2, e64, m1, tu, ma
vslideup.vi \v0, \t0, 1
vslideup.vi \v1, \t1, 1
vslideup.vi \v2, \t2, 1
vslideup.vi \v3, \t3, 1
.endm
function transpose4x8_16_one
vsetivli zero, 8, e16, m1, ta, ma
mv t0, a0
vl4re16.v v0, (a0)
li t1, 8
TRANSPOSE4x8_16 t0, t1, v0, v1, v2, v3, v8, v9, v10, v11
vs4r.v v0, (a0)
ret
endfunc |
The drawback of this method is that we need to access memory, which certainly does not have the upper limit of pure register operations. Additionally, we always need to have a buffer space, and sometimes we need to protect its contents from being corrupted (as in dav1d, which would require more instructions).
Vrgather
`vrgather` can reorganize the elements in a register group based on an index. There are two ways to create the index: one is to create it manually, and the other is to read it from memory.
For creating index by hand, the idea is to set the index for gathering vector N
to (i&3)*vl+(i&~3u)+N
, where i
is the element index obtained by vid.v.
Code Block |
---|
// Using vrgather with index created by hand
.macro TRANSPOSE4x8_16_vrgather v0, v1, v2, v3, t0, t1, t2, t3, t4, t5, t6, t7, s0
vsetivli zero, 8, e16, m1, ta, ma
vid.v \t0
li \s0, 8
vand.vi \t1, \t0, 3
vmul.vx \t1, \t1, \s0
vand.vi \t0, \t0, -4
vadd.vv \t4, \t1, \t0
vadd.vi \t5, \t4, 1
vadd.vi \t6, \t4, 2
vadd.vi \t7, \t4, 3
li \s0, 32
vsetvli zero, \s0, e16, m4, ta, ma
vrgatherei16.vv \t0, \v0, \t4
vmv.v.v \v0, \t0
.endm
function transpose4x8_16_two
vl4re16.v v0, (a0)
TRANSPOSE4x8_16_vrgather v0, v1, v2, v3, v8, v9, v10, v11, v12, v13, v14, v15, t0
vs4r.v v0, (a0)
ret
endfunc |
Alternatively, we can read the index from memory.
Code Block |
---|
const scan4x8_frame, align=8
.half 0, 8, 16, 24, 4, 12, 20, 28
.half 1, 9, 17, 25, 5, 13, 21, 29
.half 2, 10, 18, 26, 6, 14, 22, 30
.half 3, 11, 19, 27, 7, 15, 23, 31
endconst
function transpose4x8_16_three
vl4re16.v v0, (a0)
movrel t0, scan4x8_frame
vl4re16.v v4, (t0)
li t1, 32
vsetvli zero, t1, e16, m4, ta, ma
vrgatherei16.vv v8, v0, v4
vs4r.v v8, (a0)
ret
endfunc |
Based on our current results, `vrgather` is much slower than segmented load/store (vsseg: 0.277785 seconds, vrgather.vv: 1.545038 seconds). However, we believe that segmented load/store has significant potential for improvement, as it is not a pure in-register operation.
Another issue is that in the hot functions of x264, specifically the SATD series of functions, the AArch64 implementation extensively uses `trn1` and `trn2` operations. These operations can simplify calculations and improve SIMD performance. However, currently performing such operations in RVV is quite expensive.
Code Block |
---|
// Each vtrn macro simulate two instructions in aarch64: trn1 and trn2
.macro vtrn_8h d0, d1, s0, s1, t0, t1, t3
vsetivli zero, 4, e32, m1, ta, ma
vsll.vi \t3, \s0, 16
vsrl.vi \t1, \s1, 16
vsrl.vi \t0, \s0, 16
vsll.vi \d0, \s1, 16
vsll.vi \d1, \t1, 16
vsrl.vi \t3, \t3, 16
vsetivli zero, 8, e16, m1, ta, ma
vor.vv \d0, \d0, \t3
vor.vv \d1, \d1, \t0
.endm
.macro vtrn_4s d0, d1, s0, s1, t0, t1, t3
vsetivli zero, 2, e64, m1, ta, ma
li t5, 32
vsll.vx \t3, \s0, t5
vsrl.vx \t1, \s1, t5
vsrl.vx \t0, \s0, t5
vsll.vx \d0, \s1, t5
vsll.vx \d1, \t1, t5
vsrl.vx \t3, \t3, t5
vsetivli zero, 4, e32, m1, ta, ma
vor.vv \d0, \d0, \t3
vor.vv \d1, \d1, \t0
.endm |
This is also one of the main reasons why we want to add instructions similar to `trn1` and `trn2` in RVV.