Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

We're proposing new instructions for video encoding and decoding to boost RISC-V's performance in this domain. Our current focus on encoding for limited format standards may introduce some bias, so we're gathering instruction requirements here.

If you have new relevant needs, please share them. We may not have found the best RVV implementations, so if you have better solutions, we're open to discussion.

This is an open collaboration. All ideas and contributions are valuable as we work together to enhance RISC-V's video codec capabilities.

...

Vector transpose instructions

Intro

In x264, matrix transpose instructions are primarily used in two aspects: one is to achieve matrix transposition, and the other is to achieve permutation between vectors. Both uses are quite frequent.

In scenarios within x264 where matrix transposition is required, each row of the matrix is individually placed into a register. After the transposition operation, each row of the transposed matrix is placed into a separate register. The matrix transposition discussed in this wiki is carried out in this context.

Implementation in other ISAs

In other ISAs, matrix transposition is usually implemented in two ways. Below, we will introduce these methods using aarch64 and loongarch as examples. The implementation in x86 is similar to loongarch, while the implementation in ARM is similar to aarch64.

Aarch64

In aarch64, there are trn1 and trn2 instructions. By combining one trn1 and one trn2, multiple 2x2 matrix transpositions can be completed between two vector registers. Larger matrix transpositions can be achieved by repeatedly calling 2x2 matrix transpositions of different scales. The aarch64's transpose macro implementation in x264 is as follows:

...

While porting the H264 standard encoder x264 to RISC-V, we've identified several operations that are challenging to implement efficiently with existing RVV instructions. In some cases, implementations require too many instructions and or transfer to / from memory, potentially impacting encoder performance.

On this page, we would like to document these operations to -

  • Summarize the the need for these operations - both for H264 but hopefully for other multi-media projects too

  • Contrast with existing support on other architectures

  • Be a basis for discussion about efficient implementations - both in software and hardware.

For operations that cannot be efficiently implemented in RISC-V, we would like to propose new instructions for video encoding and decoding to boost RISC-V's performance in this domain. We hope that experience from across the broader multimedia projects / codec ecosystem can help guide improvements to RISC-V.

Please do reach out to the members below or the RISE Systems Libraries WG if you have suggestions for better implementations of the operations supported here. Also, if you have come across operations that you feel are needed for multimedia workloads but not supported well today.

This is an open collaboration. All ideas and contributions are valuable as we work together to enhance RISC-V's video codec capabilities.


Contact Information

Collection list

  1. Vector transpose

  2. Absolute difference

  3. Zero-extended vmv.x.s

  4. Rounded Shift Right Narrow

  5. Signed saturate and Narrow to Unsigned

1. Vector transpose instructions

Introduction

In x264, matrix transpose instructions are primarily used in two aspects: one is to achieve matrix transposition, and the other is to achieve permutation between vectors. Both uses are quite frequent.

In scenarios within x264 where matrix transposition is required, each row of the matrix is individually placed into a register. After the transposition operation, each row of the transposed matrix is placed into a separate register. The matrix transposition discussed in this wiki is carried out in this context.

Implementation in other ISAs

In other ISAs, matrix transposition is usually implemented in two ways. Below, we will introduce these methods using aarch64 and loongarch as examples. The implementation in x86 is similar to loongarch, while the implementation in ARM is similar to aarch64.

AArch64

In aarch64, there are trn1 and trn2 instructions. By combining one trn1 and one trn2, multiple 2x2 matrix transpositions can be completed between two vector registers. Larger matrix transpositions can be achieved by repeatedly calling 2x2 matrix transpositions of different scales. The aarch64's transpose macro implementation in x264 is as follows:

Code Block
.macro transpose t1, t2, s1, s2
    trn1        \t1,  \s1,  \s2
    trn2        \t2,  \s1,  \s2
.endm

.macro transpose4x4.h v0, v1, v2, v3, t0, t1, t2, t3
    transpose   \t0\().2s,  \t2\().2s,  \v0\().2s,  \v2\().2s
    transpose   \t1\().2s,  \t3\().2s,  \v1\().2s,  \v3\().2s
    transpose   \v0\().4h,  \v1\().4h,  \t0\().4h,  \t1\().4h
    transpose   \v2\().4h,  \v3\().4h,  \t2\().4h,  \t3\().4h
.endm

.macro transpose4x8.h v0, v1, v2, v3, t0, t1, t2, t3
    transpose   \t0\().4s,  \t2\().4s,  \v0\().4s,  \v2\().4s
    transpose   \t1\().4s,  \t3\().4s,  \v1\().4s,  \v3\().4s
    transpose   \v0\().8h,  \v1\().8h,  \t0\().8h,  \t1\().8h
    transpose   \v2\().8h,  \v3\().8h,  \t2\().8h,  \t3\().8h
.endm

.macro transpose8x8.h r0, r1, r2, r3, r4, r5, r6, r7, r8, r9
    trn1        \r8\().8h,  \r0\().8h,  \r1\().8h
    trn2        \r9\().8h,  \r0\().8h,  \r1\().8h
    trn1        \r1\().8h,  \r2\().8h,  \r3\().8h
    trn2        \r3\().8h,  \r2\().8h,  \r3\().8h
    trn1        \r0\().8h,  \r4\().8h,  \r5\().8h
    trn2        \r5\().8h,  \r4\().8h,  \r5\().8h
    trn1        \r2\().8h,  \r6\().8h,  \r7\().8h
    trn2        \r7\().8h,  \r6\().8h,  \r7\().8h

    trn1        \r4\().4s,  \r0\().4s,  \r2\().4s
    trn2        \r2\().4s,  \r0\().4s,  \r2\().4s
    trn1        \r6\().4s,  \r5\().4s,  \r7\().4s
    trn2        \r7\().4s,  \r5\().4s,  \r7\().4s
    trn1        \r5\().4s,  \r9\().4s,  \r3\().4s
    trn2        \r9\().4s,  \r9\().4s,  \r3\().4s
    trn1        \r3\().4s,  \r8\().4s,  \r1\().4s
    trn2        \r8\().4s,  \r8\().4s,  \r1\().4s

    trn1        \r0\().2d,  \r3\().2d,  \r4\().2d
    trn2        \r4\().2d,  \r3\().2d,  \r4\().2d

    trn1        \r1\().2d,  \r5\().2d,  \r6\().2d
    trn2        \r5\().2d,  \r5\().2d,  \r6\().2d

    trn2        \r6\().2d,  \r8\().2d,  \r2\().2d
    trn1        \r2\().2d,  \r8\().2d,  \r2\().2d

    trn1        \r3\().2d,  \r9\().2d,  \r7\().2d
    trn2        \r7\().2d,  \r9\().2d,  \r7\().2d
.endm

Here, transpose4x4.h and transpose4x8.h achieve fast transpositions of 4x4 and 4x8 (2x4x4) matrices by repeatedly calling the transpose macro.

Loongarch

In loongarch, matrix transposition is implemented using the Interleave method.

  • vilvl (Vector Interleave Low)

  • vilvh (Vector Interleave High)

The Loongarch's 4x4 transpose macro implementation in x264 is as follows:

Code Block
/*
 * Description : Transpose 4x4 block with word elements in vectors
 * Arguments   : Inputs  - in0, in1, in2, in3
 *               Outputs - out0, out1, out2, out3
 * Details     :
 * Example     :
 *               1, 2, 3, 4            1, 5, 9,13
 *               5, 6, 7, 8    to      2, 6,10,14
 *               9,10,11,12  =====>    3, 7,11,15
 *              13,14,15,16            4, 8,12,16
 */
.macro LSX_TRANSPOSE4x4_W in0, in1, in2, in3, out0, out1, out2, out3, \
                          tmp0, tmp1

    vilvl.w    \tmp0,   \in1,    \in0
    vilvh.w    \out1,   \in1,    \in0
    vilvl.w    \tmp1,   \in3,    \in2
    vilvh.w    \out3,   \in3,    \in2

    vilvl.d    \out0,   \tmp1,   \tmp0
    vilvl.d    \out2,   \out3,   \out1
    vilvh.d    \out3,   \out3,   \out1
    vilvh.d    \out1,   \tmp1,   \tmp0
.endm

By performing multiple interleaved instrutions, matrix transposition can be achieved. Here is the value change of each register during the process of 4x4 matrix transposition using the Interleave method:

Code Block
# input
in0: [a0 a1 a2 a3]
in1: [b0 b1 b2 b3]
in2: [c0 c1 c2 c3]
in3: [d0 d1 d2 d3] 

    vilvl.w    \tmp0,   \in1,    \in0
// tmp0: [a0 b0 a1 b1]
    vilvh.w    \out1,   \in1,    \in0
// out1: [a2 b2 a3 b3]
    vilvl.w    \tmp1,   \in3,    \in2
// tmp1: [c0 d0 c1 d1]
    vilvh.w    \out3,   \in3,    \in2
// out3: [c2 d2 c3 d3]
    vilvl.d    \out0,   \tmp1,   \tmp0
// out0: [a0 b0 c0 d0]
    vilvl.d    \out2,   \out3,   \out1
// out2: [a2 b2 c2 d2]
    vilvh.d    \out3,   \out3,   \out1
// out3: [a3 b3 c3 d3]
    vilvh.d    \out1,   \tmp1,   \tmp0
// out1: [a1 b1 c1 d1]

# output
out0: [a0 b0 c0 d0]
out1: [a1 b1 c1 d1]
out2: [a2 b2 c2 d2]
out3: [a3 b3 c3 d3]

These two instructions in LoongArch are essentially the same as zip1 and zip2 in AArch64. Similarly, the punpckl / h instructions in x86 exhibit the same behavior. In x264, x86 also uses punpckl / h for matrix transposition.

Implementation in RISCV64

Using RISC-V RVV, we have discovered two methods to perform matrix transposition(thanks camel-cdr for the assistance provided):

  • Using segmented load or store

  • Using vrgather

  • Using vnsrl

Here, we use the example of transposing a 4x8 (2x4x4) matrix (transposing the left 4x4 and the right 4x4 separately) to illustrate these two methods.

Segmented load or store

In this way, we can use the `vssseg4e16.v` instruction to store each row of the original matrix into memory by columns, and then read them back by rows. Since we are transposing a 4x8 matrix, we also need to use `vslide` to combine the contents of the two registers together.

Code Block
// Using extra loads and stores, and use vslide to combine them
.macro TRANSPOSE4x8_16 buf, bstride, v0, v1, v2, v3, t0, t1, t2, t3
    transpose   \t0\().4s,  \t2\().4s,  \v0\().4s,  \v2\().4svssseg4e16.v \v0, (\buf), \bstride
    vsetivli zero, 4, e16, mf2, ta, ma
    vle16.v \v0, (\buf)
    transposeadd \buf, \buf, \t1\().4s,  \t3\().4s,  \v1\().4s,  \v3\().4sbstride
    vle16.v \v1, (\buf)
    transposeadd \buf, \buf, \v0\().8h,  \v1\().8h,  \t0\().8h,  \t1\().8hbstride
    vle16.v \v2, (\buf)
    transposeadd \buf, \buf, \v2\().8h,  \v3\().8h,  \t2\().8h,  \t3\().8h
.endm

.macro transpose8x8.h r0, r1, r2, r3, r4, r5, r6, r7, r8, r9
    trn1  bstride
    vle16.v \v3, (\buf)
    add \buf, \buf, \bstride
    vle16.v \t0, (\buf)
    add \buf, \buf, \bstride
    vle16.v \t1, (\buf)
    add \r8\().8h,  \r0\().8h,  \r1\().8hbuf, \buf, \bstride
    vle16.v \t2, (\buf)
    trn2add \buf, \buf, \bstride
    \r9\().8h,  \r0\().8h,  \r1\().8hvle16.v \t3, (\buf)
     trn1 add \buf, \buf, \bstride
    vsetivli  \r1\().8hzero, 2,  \r2\().8he64, m1,  \r3\().8htu, ma
     trn2  vslideup.vi \v0, \t0, 1
    vslideup.vi \r3\().8hv1,  \r2\().8h,  \r3\().8h\t1, 1
    trn1  vslideup.vi \v2, \t2, 1
    vslideup.vi \r0\().8hv3,  \r4\().8h\t3, 1
\r5\().8hendm

// under  trn2VLEN=128
function transpose4x8_16_one
    vsetivli zero,  \r5\().8h8, e16,  \r4\().8hm1, ta,  \r5\().8hma
    mv  trn1        \r2\().8h,  \r6\().8h,  \r7\().8h
    trn2t0, a0
    vl4re16.v   v0, (a0)
    li    \r7\().8h,  \r6\().8h,  \r7\().8h  t1, 8
  trn1  TRANSPOSE4x8_16 t0, t1, v0, v1,  \r4\().4s,  \r0\().4s,  \r2\().4s
    trn2  v2, v3, v8, v9, v10, v11
    vs4r.v   v0, (a0)
     \r2\().4s,  \r0\().4s,  \r2\().4s
    trn1        \r6\().4s,  \r5\().4s,  \r7\().4s
    trn2        \r7\().4s,  \r5\().4s,  \r7\().4s
    trn1        \r5\().4s,  \r9\().4s,  \r3\().4s
    trn2        \r9\().4s,  \r9\().4s,  \r3\().4s
    trn1ret
endfunc

The drawback of this method is that we need to access memory, which certainly does not have the upper limit of pure register operations. Additionally, we always need to have a buffer space, and sometimes we need to protect its contents from being corrupted (as in dav1d, which would require more instructions).

Vrgather

`vrgather` can reorganize the elements in a register group based on an index. There are two ways to create the index: one is to create it manually, and the other is to read it from memory.

For creating index by hand, the idea is to set the index for gathering vector N to (i&3)*vl+(i&~3u)+N, where i is the element index obtained by vid.v.

Code Block
// Using vrgather with index created by hand
.macro TRANSPOSE4x8_16_vrgather v0, v1, v2, v3, t0, t1, t2, t3, t4, t5, t6, t7, s0
    vsetivli    zero, 8, e16, m1, ta, ma
    vid.v       \t0
    li          \r3\().4s,  \r8\().4s,  \r1\().4s
s0, 8
    vand.vi     \t1, \t0, 3
   trn2 vmul.vx     \t1, \t1, \r8\().4s,  \r8\().4s,  \r1\().4s
s0
    vand.vi     \t0, \t0, -4

   trn1 vadd.vv       \r0\().2d,  \r3\().2d,  \r4\().2d\t4, \t1, \t0
    trn2vadd.vi     \t5, \t4,  \r4\().2d,  \r3\().2d,  \r4\().2d1
    vadd.vi     \t6, \t4, trn12
    vadd.vi   \r1\().2d,  \r5\().2dt7,  \r6\().2d
\t4, 3
   trn2 
    li  \r5\().2d,  \r5\().2d,  \r6\().2d    \s0, 32
trn2    vsetvli    \r6\().2d,  \r8\().2d,  \r2\().2d
    trn1        \r2\().2d,  \r8\().2d,  \r2\().2d

    trn1        \r3\().2d,  \r9\().2d,  \r7\().2d
    trn2        \r7\().2d,  \r9\().2d,  \r7\().2d
.endm

Here, transpose4x4.h and transpose4x8.h achieve fast transpositions of 4x4 and 4x8 (2x4x4) matrices by repeatedly calling the transpose macro.

Loongarch

In loongarch, matrix transposition is implemented using the Interleave method.

  • vilvl (Vector Interleave Low)
  • vilvh (Vector Interleave High)

The Loongarch's 4x4 transpose macro implementation in x264 is as follows:

Code Block
/*
 * Description : Transpose 4x4 block with word elements in vectors
 * Arguments   : Inputs  - in0, in1, in2, in3
 *               Outputs - out0, out1, out2, out3
 * Details     :
 * Example     :
 *               1, 2, 3, 4            1, 5, 9,13
 *               5, 6, 7, 8    to      2, 6,10,14
 *     zero, \s0, e16, m4, ta, ma
    vrgatherei16.vv \t0, \v0, \t4
    vmv.v.v     \v0, \t0
.endm

// under VLEN=128
function transpose4x8_16_two
    vl4re16.v   v0, (a0)
    TRANSPOSE4x8_16_vrgather v0, v1, v2, v3, v8, v9, v10, v11, v12, v13, v14, v15, t0
    vs4r.v   v0, (a0)
    ret
endfunc

Alternatively, we can read the index from memory.

Code Block
const   scan4x8_frame, align=8
    .half   0, 8, 16, 24, 4, 12, 20, 28
    .half   1, 9, 17, 25, 5, 13, 21, 29
    .half   2, 10, 18, 26, 6, 14, 22, 30
    .half   3, 11, 19, 27, 7, 15, 23, 31
endconst

// under VLEN=128
function transpose4x8_16_three
    vl4re16.v   v0, (a0) 
    movrel      t0, scan4x8_frame
    vl4re16.v   v4, (t0)

    li          9,10,11,12  =====>    3, 7,11,15
 *t1, 32
    vsetvli 	zero, t1, e16, m4, ta, ma
    vrgatherei16.vv v8, v0, v4

    13,14,15,16vs4r.v      v8, (a0)
    4, 8,12,16
 */
.macro LSX_TRANSPOSE4x4_W in0, in1, in2, in3, out0, out1, out2, out3, \
                          tmp0, tmp1

    vilvl.w    \tmp0,   \in1,    \in0
    vilvh.w    \out1,   \in1,    \in0
    vilvl.w    \tmp1,   \in3,    \in2
    vilvh.w    \out3,   \in3,    \in2

    vilvl.d    \out0,   \tmp1,   \tmp0
    vilvl.d    \out2,   \out3,   \out1
    vilvh.d    \out3,   \out3,   \out1
    vilvh.d    \out1,   \tmp1,   \tmp0
.endm

By performing multiple interleaved instrutions, matrix transposition can be achieved. Here is the value change of each register during the process of 4x4 matrix transposition using the Interleave method:

Code Block
# input
in0: [a0 a1 a2 a3]
in1: [b0 b1 b2 b3]
in2: [c0 c1 c2 c3]
in3: [d0 d1 d2 d3] 

    vilvl.w    \tmp0,   \in1,    \in0
// tmp0: [a0 b0 a1 b1]
    vilvh.w    \out1,   \in1,    \in0
// out1: [a2 b2 a3 b3]
    vilvl.w    \tmp1,   \in3,    \in2
// tmp1: [c0 d0 c1 d1]
    vilvh.w    \out3,   \in3,    \in2
// out3: [c2 d2 c3 d3]
    vilvl.d    \out0,   \tmp1,   \tmp0
// out0: [a0 b0 c0 d0]
    vilvl.d    \out2,   \out3,   \out1
// out2: [a2 b2 c2 d2]
    vilvh.d    \out3,   \out3,   \out1
// out3: [a3 b3 c3 d3]
    vilvh.d    \out1,   \tmp1,   \tmp0
// out1: [a1 b1 c1 d1]

# output
out0: [a0 b0 c0 d0]
out1: [a1 b1 c1 d1]
out2: [a2 b2 c2 d2]
out3: [a3 b3 c3 d3]

Implementation in RISCV64

Using RISC-V RVV, we have discovered two methods to perform matrix transposition(thanks camel-cdr for the assistance provided):

  • Using segmented load or store
  • Using vrgather

Here, we use the example of transposing a 4x8 (2x4x4) matrix (transposing the left 4x4 and the right 4x4 separately) to illustrate these two methods.

Segmented load or store

In this way, we can use the `vssseg4e16.v` instruction to store each row of the original matrix into memory by columns, and then read them back by rows. Since we are transposing a 4x8 matrix, we also need to use `vslide` to combine the contents of the two registers together.

Code Block
// Using extra loads and stores, and use vslide to combine them
.macro TRANSPOSE4x8_16 buf, bstride, v0, v1, v2, v3, t0, t1, t2, t3
    vssseg4e16.v \v0, (\buf), \bstride
    vsetivli zero, 4, e16, mf2, ta, ma
    vle16.v \v0, (\buf)
    add \buf, \buf, \bstride
    vle16.v \v1, (\buf)
    add \buf, \buf, \bstride
    vle16.v \v2, (\buf)
    add \buf, \buf, \bstride
    vle16.v \v3, (\buf)
    add \buf, \buf, \bstride
    vle16.v \t0, (\buf)
    add \buf, \buf, \bstride
    vle16.v \t1, (\buf)
    add \buf, \buf, \bstride
    vle16.v \t2, (\buf)
    add \buf, \buf, \bstride
    vle16.v \t3, (\buf)
    add \buf, \buf, \bstride
    vsetivli zero, 2, e64, m1, tu, ma
    vslideup.vi \v0, \t0, 1
    vslideup.vi \v1, \t1, 1
    vslideup.vi \v2, \t2, 1
    vslideup.vi \v3, \t3, 1
.endm

function transpose4x8_16_one
    vsetivli zero, 8, e16, m1, ta, ma
    mv          t0, a0
    vl4re16.v   v0, (a0)
    li          t1, 8
    TRANSPOSE4x8_16 t0, t1, v0, v1, v2, v3, v8, v9, v10, v11
    vs4r.v   v0, (a0)
    ret
endfunc

The drawback of this method is that we need to access memory, which certainly does not have the upper limit of pure register operations. Additionally, we always need to have a buffer space, and sometimes we need to protect its contents from being corrupted (as in dav1d, which would require more instructions).

Vrgather

`vrgather` can reorganize the elements in a register group based on an index. There are two ways to create the index: one is to create it manually, and the other is to read it from memory.

For creating index by hand, the idea is to set the index for gathering vector N to (i&3)*vl+(i&~3u)+N, where i is the element index obtained by vid.v.

Code Block
// Using vrgather with index created by hand
.macro TRANSPOSE4x8_16_vrgather v0, v1, v2, v3, t0, t1, t2, t3, t4, t5, t6, t7, s0
    vsetivli    zero, 8, e16, m1, ta, ma
    vid.v       \t0
    li          \s0, 8
    vand.vi     \t1, \t0, 3
    vmul.vx     \t1, \t1, \s0
    vand.vi     \t0, \t0, -4

    vadd.vv     \t4, \t1, \t0
    vadd.vi     \t5, \t4, 1
    vadd.vi     \t6, \t4, 2
    vadd.vi     \t7, \t4, 3
    
    li          \s0, 32
    vsetvli    zero, \s0, e16, m4, ta, ma
    vrgatherei16.vv \t0, \v0, \t4
    vmv.v.v     \v0, \t0
.endm

function transpose4x8_16_two
    vl4re16.v   v0, (a0)
    TRANSPOSE4x8_16_vrgather v0, v1, v2, v3, v8, v9, v10, v11, v12, v13, v14, v15, t0
    vs4r.v   v0, (a0)
    ret
endfunc

Alternatively, we can read the index from memory.

Code Block
const   scan4x8_frame, align=8
    .half   0, 8, 16, 24, 4, 12, 20, 28
    .half   1, 9, 17, 25, 5, 13, 21, 29
    .half   2, 10, 18, 26, 6, 14, 22, 30
    .half   3, 11, 19, 27, 7, 15, 23, 31
endconst

function transpose4x8_16_three
    vl4re16.v   v0, (a0) 
    movrel      t0, scan4x8_frame
    vl4re16.v   v4, (t0)

    li          t1, 32
    vsetvli 	zero, t1, e16, m4, ta, ma
    vrgatherei16.vv v8, v0, v4

    vs4r.v      v8, (a0)
    ret
endfunc

Based on our current results, `vrgather` is much slower than segmented load/store (vsseg: 0.277785 seconds, vrgather.vv: 1.545038 seconds). However, we believe that segmented load/store has significant potential for improvement, as it is not a pure in-register operation.

Another issue is that in the hot functions of x264, specifically the SATD series of functions, the AArch64 implementation extensively uses `trn1` and `trn2` operations. These operations can simplify calculations and improve SIMD performance. However, currently performing such operations in RVV is quite expensive.

Code Block
// Each vtrn macro simulate two instructions in aarch64: trn1 and trn2

.macro vtrn_8h d0, d1ret
endfunc

Based on our current results, `vrgather` is much slower than segmented load/store (vsseg: 0.277785 seconds, vrgather.vv: 1.545038 seconds). However, we believe that segmented load/store has significant potential for improvement, as it is not a pure in-register operation.

Another issue is that in the hot functions of x264, specifically the SATD series of functions, the AArch64 implementation extensively uses `trn1` and `trn2` operations. These operations can simplify calculations and improve SIMD performance. However, currently performing such operations in RVV is quite expensive.

Code Block
// Each vtrn macro simulate two instructions in aarch64: trn1 and trn2

.macro vtrn_8h d0, d1, s0, s1, t0, t1, t3
    vsetivli        zero, 4, e32, m1, ta, ma
    vsll.vi         \t3, \s0, 16
    vsrl.vi         \t1, \s1, 16
    vsrl.vi         \t0, \s0, 16
    vsll.vi         \d0, \s1, 16
    vsll.vi         \d1, \t1, 16
    vsrl.vi         \t3, \t3, 16
    vsetivli        zero, 8, e16, m1, ta, ma
    vor.vv          \d0, \d0, \t3
    vor.vv          \d1, \d1, \t0
.endm

.macro vtrn_4s d0, d1, s0, s1, t0, t1, t3
    vsetivli        zero, 2, e64, m1, ta, ma
    li              t5, 32
    vsll.vx         \t3, \s0, t5
    vsrl.vx         \t1, \s1, t5
    vsrl.vx         \t0, \s0, t5
    vsll.vx         \d0, \s1, t5
    vsll.vx         \d1, \t1, t5
    vsrl.vx         \t3, \t3, t5
    vsetivli        zero, 4, e32, m1, ta, ma
    vor.vv          \d0, \d0, \t3
    vor.vv          \d1, \d1, \t0
.endm

This is also one of the main reasons why we want to add instructions similar to `trn1` and `trn2` in RVV.

Vnsrl

Olaf pointed out a new method to achieve matrix transposition, using the vnsrl instruction in RVV along with vslide instructions to achieve the effect of zip1 and zip2 in AArch64. Olaf provided detailed information for this method, and we are very grateful for his work. Below is an approach that works with VLEN=128:

Code Block
# VLEN=128 transpose one 4x4 matrix of 16-bit elements stored in 4 vreg:
#   a b c d          a e i m
#   e f g h  -----\  b f j n
#   i j k l  -----/  c g k o
#   m n o p          d h l p
    
## setup code:
# li t1, 32

vsetvli t0, x0, e32, m1, ta, ma
vslideup.vi v0, v1, 2
vslideup.vi v2, v3, 2
vmv1r.v v1, v2

# v0: a b c d e f g h
# v1: i j k l m n o p

vnsrl.wi v4, v0, 0
vnsrl.wx v6, v0, t1

# v4: a b e f i j m n
# v6: c d g h k l o p

vsetvli t0, x0, e16, mf2, ta, ma
vnsrl.wi v0, v4, 0
vnsrl.wi v1, v4, 16
vnsrl.wi v2, v6, 0
vnsrl.wi v3, v6, 16

# v0: a e i m
# v1: b f j n
# v2: c g k o
# v3: d h l p

Proposal

vtrn1.vv:Interleave alternating even elements from the first and second source vectors and place in elements of the destination vector (elements from the first source vector place in the even index positions, and the elements from the second vector place in the odd index positions).

vtrn2.vv:Interleave alternating odd elements from the first and second source vectors and place in elements of the destination vector (elements from the first source vector place in the even index positions, and the elements from the second vector place in the odd index positions).

Code Block
vtrn1.vv vd, vs2, vs1   # vd[2i] = vs1[2i] , vd[2i+1] = vs2[2i]
vtrn2.vv vd, vs2, vs1   # vd+1[2i] = vs1[2i+1]) , vd+1[2i+1] = vs2[2i+1]

We implemented the instructions on GEM5 and evaluate the performance gain.

Performance of Transpose benchmarks

  • load every row to a separate register

  • do in-register transpose

  • store back to the memory

Code Block
function transpose4x8_16_vssseg
    vsetivli zero, 8, e16, mf2, ta, ma
    mv          t0, a0
    mv          t1, a0
    vle16.v     v0, (a0)
    addi        t1, t1, 16
    vle16.v     v1, (t1)
    addi        t1, t1, 16
    vle16.v     v2, (t1)
    addi        t1, t1, 16
    vle16.v     v3, (t1)
    
    TRANSPOSE4x8_16 t0, t2, v0, v1, v2, v3, v8, v9, v10, v11
    
    vsetivli zero, 8, e16, mf2, ta, ma
    vse16.v     v0, (a0)
    addi        a0, a0, 16
    vse16.v     v1, (a0)
    addi        a0, a0, 16
    vse16.v     v2, (a0)
    addi        a0, a0, 16
    vse16.v     v3, (a0)

    ret
endfunc

The results of different transpose implementations are as follows:

Transpose benchmarks

Cycles

TRNS_4x4_16_VSSSEG

14

TRNS_4x4_16_VRGATHER

17

TRNS_4x4_16_VNSRL

18

TRNS_4x4_16_VTRN_Extension

15

TRNS_4x8_16_VSSSEG

64

TRNS_4x8_16_VRGATHER

19

TRNS_4x8_16_VTRN_MACRO

25

TRNS_4x8_16_VTRN_Extension

15

Performance of SATD functions in x264

According to the test on GEM5, a 40% gain can be achieved for larger SATD functions.

...

2. Absolute difference instructions

Introduction

x264 need widening absolute difference accumulate operations which is 5%~6% in both x264 running time and specCPU 525.x264_r.

https://wiki.videolan.org/X264_asm_intro/#Example_2:_pixel_sad

Implementation in other ISAs

AArch64

AArch64 has a few different instructions based on the signedness and data type of input and output to calculate absolute differences

  1. SABD / UABD - signed / unsigned absolute difference

  2. SABDL / UABDL - signed / unsigned absolute difference (double-width result)

  3. SABA / UABA - signed / unsigned absolute difference and add

  4. SABAL/ UABAL - signed / unsigned absolute difference(double-width result) and add

x86

Compute sum of absolute difference: psadbw

Implementation in RISCV64

need 3~4 instructions to implement 

Code Block
.macro uabd d0, s0, s1, t0
	vmaxu.vv \d0, \s0, \s1
	vminu.vv \t0, \s0, \s1
	vsub.vv \d0, \d0, \t0 
.endm

.macro sabd d0, s0, s1, t0
	vmax.vv \d0, \s0, \s1
	vmin.vv \t0, \s0, \s1
	vsub.vv \d0, \d0, \t0 
.endm

.macro uabal d0, s0, s1, t0, t1,
t3
    vsetivli        zero, 4, e32, m1, ta, ma
    vsll.vi         \t3, \s0, 16
    vsrl.vi        	vmaxu.vv \t1, \s0, \s1
	vminu.vv \t0, \s0, \s1
	vsub.vv \t0, \t1, \t0
	vwaddu.wv \d0, \d0, \t0 
.endm

.macro uabdl d0, s0, s1, t0, t1
	vmaxu.vv \t1, \s1s0, 16
    vsrl.vi         \s1
	vminu.vv \t0, \s0, 16
    vsll.vi         \d0, \s1, 16
    vsll.vi         \d1, \t1, 16
    vsrl.vi         \t3, \t3, 16
    vsetivli        zero, 8, e16, m1, ta, ma
    vor.vv          \d0, \d0, \t3
    vor.vv          \d1, \d1, \t0
.endm

.macro vtrn_4s d0, d1, s0, s1, t0, t1, t3
    vsetivli        zero, 2, e64, m1, ta, ma
    li              t5, 32
    vsll.vx         \t3, \s0, t5
    vsrl.vx         \t1, \s1, t5
    vsrl.vx         \t0, \s0, t5
    vsll.vx         \d0, \s1, t5
    vsll.vx         \d1, \t1, t5
    vsrl.vx         \t3, \t3, t5
    vsetivli        zero, 4, e32\s1
	vwsubu.vv \d0, \t1, \t0 
.endm

Proposal

  • Vector Single-Width Signed/Unsigned Integer Absolute Difference

Code Block
# Unsigned Absolute Difference.
vabdu.vv   vd, vs2, vs1, vm   # vd[i] = abs(unsigned(vs2[i]) - unsigned(vs1[i]))
vabdu.vx   vd, vs2, rs1, vm   # vd[i] = abs(unsigned(vs2[i]) - unsigned(x[rs1]))
vabdu.vi   vd, vs2, imm, vm   # vd[i] = abs(unsigned(vs2[i]) - unsigned(imm))

Performance of SAD functions

According to the test on GEM5, a 30% gain can be achieved for larger SAD functions.

...

3. Zero-extended vmv.x.s

Introduction

The vmv.x.s instruction copies a single SEW-wide element from index 0 of the source vector register to a destination

integer register. If SEW > XLEN, the least-signi cant XLEN bits are transferred and the upper SEW-XLEN bits are ignored. If

SEW < XLEN, the value is sign-extended to XLEN bits.

It is very common to move a uint16_t vector to a scalar register.

Implementation in RISCV64

Code Block
//uint16_t with zbb extension
vsetivli zero, 1, e16, m1, ta, ma
vmv.x.s a1, v1
zext.h a1, a1

4. Rounded Shift Right Narrow

Introduction

RVV 1.0 has instructions to -

  • shift + scaling:  rssra

  • shift + narrow:  vnsrl

  • clip + narrow: vnclip

But does not have "shift + scaling + narrow" instructions

Implementation in RISCV64

Code Block
// AArch64 implementation
rshrn v20.8b, v20.8h, #3
rshrn2 v20.16b, v21.8h, #3


// RISCV64 implementation
vsetivli zero, 8, e16, m1, ta, ma
vssrl.vi v20, v20, 3
vssrl.vi v21, v21, 3
vsetivli zero, 8, e8, mf2, ta, ma
vncvt.x.x.w v20, v20
vncvt.x.x.w v21, v21
vsetivli zero, 16, e8, m1, ta, ma
vslideup.vi v20, v21, 8

5. Signed saturate and Narrow to Unsigned

Introduction

Implementation in RISCV64

Code Block
// AArch64 implementation
sqxtun v0.8b, v0.8h


// RISCV64 implementation
vsetivli zero, 4, e16, m1, ta, ma
    vor.vv          \d0, \d0, \t3
    vor.vv          \d1, \d1, \t0
.endm

...

vmax.vx v0, v0, zero

vsetivli zero, 4, e8, mf2, ta, ma
vnclipu.wi v4, v0, 0