While adapting the H264 standard encoder x264 for RISC-V, we've identified several operations that are challenging to implement efficiently with existing RVV instructions. In some cases, implementations require too many instructions, potentially impacting encoder performance.
We're proposing new instructions for video encoding and decoding to boost RISC-V's performance in this domain. Our current focus on encoding for limited format standards may introduce some bias, so we're gathering instruction requirements here.
If you have new relevant needs, please share them. We may not have found the best RVV implementations, so if you have better solutions, we're open to discussion.
This is an open collaboration. All ideas and contributions are valuable as we work together to enhance RISC-V's video codec capabilities.
Vector transpose instructions
Intro
In x264, matrix transpose instructions are primarily used in two aspects: one is to achieve matrix transposition, and the other is to achieve permutation between vectors. Both uses are quite frequent.
Implementation in other ISAs
In other ISAs, matrix transposition is usually implemented in two ways. Below, we will introduce these methods using aarch64 and loongarch as examples. The implementation in x86 is similar to loongarch, while the implementation in ARM is similar to aarch64.
Aarch64
In aarch64, there are trn1
and trn2
instructions. By combining one trn1
and one trn2
, multiple 2x2 matrix transpositions can be completed between two vector registers. Larger matrix transpositions can be achieved by repeatedly calling 2x2 matrix transpositions of different scales. The aarch64's transpose macro implementation in x264 is as follows:
.macro transpose t1, t2, s1, s2 trn1 \t1, \s1, \s2 trn2 \t2, \s1, \s2 .endm .macro transpose4x4.h v0, v1, v2, v3, t0, t1, t2, t3 transpose \t0\().2s, \t2\().2s, \v0\().2s, \v2\().2s transpose \t1\().2s, \t3\().2s, \v1\().2s, \v3\().2s transpose \v0\().4h, \v1\().4h, \t0\().4h, \t1\().4h transpose \v2\().4h, \v3\().4h, \t2\().4h, \t3\().4h .endm .macro transpose4x8.h v0, v1, v2, v3, t0, t1, t2, t3 transpose \t0\().4s, \t2\().4s, \v0\().4s, \v2\().4s transpose \t1\().4s, \t3\().4s, \v1\().4s, \v3\().4s transpose \v0\().8h, \v1\().8h, \t0\().8h, \t1\().8h transpose \v2\().8h, \v3\().8h, \t2\().8h, \t3\().8h .endm .macro transpose8x8.h r0, r1, r2, r3, r4, r5, r6, r7, r8, r9 trn1 \r8\().8h, \r0\().8h, \r1\().8h trn2 \r9\().8h, \r0\().8h, \r1\().8h trn1 \r1\().8h, \r2\().8h, \r3\().8h trn2 \r3\().8h, \r2\().8h, \r3\().8h trn1 \r0\().8h, \r4\().8h, \r5\().8h trn2 \r5\().8h, \r4\().8h, \r5\().8h trn1 \r2\().8h, \r6\().8h, \r7\().8h trn2 \r7\().8h, \r6\().8h, \r7\().8h trn1 \r4\().4s, \r0\().4s, \r2\().4s trn2 \r2\().4s, \r0\().4s, \r2\().4s trn1 \r6\().4s, \r5\().4s, \r7\().4s trn2 \r7\().4s, \r5\().4s, \r7\().4s trn1 \r5\().4s, \r9\().4s, \r3\().4s trn2 \r9\().4s, \r9\().4s, \r3\().4s trn1 \r3\().4s, \r8\().4s, \r1\().4s trn2 \r8\().4s, \r8\().4s, \r1\().4s trn1 \r0\().2d, \r3\().2d, \r4\().2d trn2 \r4\().2d, \r3\().2d, \r4\().2d trn1 \r1\().2d, \r5\().2d, \r6\().2d trn2 \r5\().2d, \r5\().2d, \r6\().2d trn2 \r6\().2d, \r8\().2d, \r2\().2d trn1 \r2\().2d, \r8\().2d, \r2\().2d trn1 \r3\().2d, \r9\().2d, \r7\().2d trn2 \r7\().2d, \r9\().2d, \r7\().2d .endm
Here, transpose4x4.h
and transpose4x8.h
achieve fast transpositions of 4x4 and 4x8 (2x4x4) matrices by repeatedly calling the transpose macro.
Loongarch
In loongarch, matrix transposition is implemented using the Interleave method.
- vilvl (Vector Interleave Low)
- vilvh (Vector Interleave High)
The Loongarch's 4x4 transpose macro implementation in x264 is as follows:
/* * Description : Transpose 4x4 block with word elements in vectors * Arguments : Inputs - in0, in1, in2, in3 * Outputs - out0, out1, out2, out3 * Details : * Example : * 1, 2, 3, 4 1, 5, 9,13 * 5, 6, 7, 8 to 2, 6,10,14 * 9,10,11,12 =====> 3, 7,11,15 * 13,14,15,16 4, 8,12,16 */ .macro LSX_TRANSPOSE4x4_W in0, in1, in2, in3, out0, out1, out2, out3, \ tmp0, tmp1 vilvl.w \tmp0, \in1, \in0 vilvh.w \out1, \in1, \in0 vilvl.w \tmp1, \in3, \in2 vilvh.w \out3, \in3, \in2 vilvl.d \out0, \tmp1, \tmp0 vilvl.d \out2, \out3, \out1 vilvh.d \out3, \out3, \out1 vilvh.d \out1, \tmp1, \tmp0 .endm
By performing multiple interleaved instrutions, matrix transposition can be achieved. Here is the value change of each register during the process of 4x4 matrix transposition using the Interleave method:
# input in0: [a0 a1 a2 a3] in1: [b0 b1 b2 b3] in2: [c0 c1 c2 c3] in3: [d0 d1 d2 d3] vilvl.w \tmp0, \in1, \in0 // tmp0: [a0 b0 a1 b1] vilvh.w \out1, \in1, \in0 // out1: [a2 b2 a3 b3] vilvl.w \tmp1, \in3, \in2 // tmp1: [c0 d0 c1 d1] vilvh.w \out3, \in3, \in2 // out3: [c2 d2 c3 d3] vilvl.d \out0, \tmp1, \tmp0 // out0: [a0 b0 c0 d0] vilvl.d \out2, \out3, \out1 // out2: [a2 b2 c2 d2] vilvh.d \out3, \out3, \out1 // out3: [a3 b3 c3 d3] vilvh.d \out1, \tmp1, \tmp0 // out1: [a1 b1 c1 d1] # output out0: [a0 b0 c0 d0] out1: [a1 b1 c1 d1] out2: [a2 b2 c2 d2] out3: [a3 b3 c3 d3]