Page Comparison

While adapting the H264 standard encoder x264 for RISC-V, we've identified several operations that are challenging to implement efficiently with existing RVV instructions. In some cases, implementations require too many instructions, potentially impacting encoder performance.

...

This is an open collaboration. All ideas and contributions are valuable as we work together to enhance RISC-V's video codec capabilities.


Contact Information

...

Code Block

# input
in0: [a0 a1 a2 a3]
in1: [b0 b1 b2 b3]
in2: [c0 c1 c2 c3]
in3: [d0 d1 d2 d3] 

    vilvl.w    \tmp0,   \in1,    \in0
// tmp0: [a0 b0 a1 b1]
    vilvh.w    \out1,   \in1,    \in0
// out1: [a2 b2 a3 b3]
    vilvl.w    \tmp1,   \in3,    \in2
// tmp1: [c0 d0 c1 d1]
    vilvh.w    \out3,   \in3,    \in2
// out3: [c2 d2 c3 d3]
    vilvl.d    \out0,   \tmp1,   \tmp0
// out0: [a0 b0 c0 d0]
    vilvl.d    \out2,   \out3,   \out1
// out2: [a2 b2 c2 d2]
    vilvh.d    \out3,   \out3,   \out1
// out3: [a3 b3 c3 d3]
    vilvh.d    \out1,   \tmp1,   \tmp0
// out1: [a1 b1 c1 d1]

# output
out0: [a0 b0 c0 d0]
out1: [a1 b1 c1 d1]
out2: [a2 b2 c2 d2]
out3: [a3 b3 c3 d3]

These two instructions in LoongArch are essentially the same as zip1 and zip2 in AArch64. Similarly, the punpckl / h instructions in x86 exhibit the same behavior. In x264, x86 also uses punpckl / h for matrix transposition.

Implementation in RISCV64

Using RISC-V RVV, we have discovered two methods to perform matrix transposition(thanks camel-cdr for the assistance provided):

Using segmented load or store
Using vrgather
Using vnsrl

Here, we use the example of transposing a 4x8 (2x4x4) matrix (transposing the left 4x4 and the right 4x4 separately) to illustrate these two methods.

Segmented load or store

In this way, we can use the `vssseg4e16.v` instruction to store each row of the original matrix into memory by columns, and then read them back by rows. Since we are transposing a 4x8 matrix, we also need to use `vslide` to combine the contents of the two registers together.

Code Block

// Using extra loads and stores, and use vslide to combine them
.macro TRANSPOSE4x8_16 buf, bstride, v0, v1, v2, v3, t0, t1, t2, t3
    vssseg4e16.v \v0, (\buf), \bstride
    vsetivli zero, 4, e16, mf2, ta, ma
    vle16.v \v0, (\buf)
    add \buf, \buf, \bstride
    vle16.v \v1, (\buf)
    add \buf, \buf, \bstride
    vle16.v \v2, (\buf)
    add \buf, \buf, \bstride
    vle16.v \v3, (\buf)
    add \buf, \buf, \bstride
    vle16.v \t0, (\buf)
    add \buf, \buf, \bstride
    vle16.v \t1, (\buf)
    add \buf, \buf, \bstride
    vle16.v \t2, (\buf)
    add \buf, \buf, \bstride
    vle16.v \t3, (\buf)
    add \buf, \buf, \bstride
    vsetivli zero, 2, e64, m1, tu, ma
    vslideup.vi \v0, \t0, 1
    vslideup.vi \v1, \t1, 1
    vslideup.vi \v2, \t2, 1
    vslideup.vi \v3, \t3, 1
.endm

function transpose4x8_16_one
    vsetivli zero, 8, e16, m1, ta, ma
    mv          t0, a0
    vl4re16.v   v0, (a0)
    li          t1, 8
    TRANSPOSE4x8_16 t0, t1, v0, v1, v2, v3, v8, v9, v10, v11
    vs4r.v   v0, (a0)
    ret
endfunc

...

For creating index by hand, the idea is to set the index for gathering vector N to (i&3)*vl+(i&~3u)+N, where i is the element index obtained by vid.v.

Code Block

// Using vrgather with index created by hand
.macro TRANSPOSE4x8_16_vrgather v0, v1, v2, v3, t0, t1, t2, t3, t4, t5, t6, t7, s0
    vsetivli    zero, 8, e16, m1, ta, ma
    vid.v       \t0
    li          \s0, 8
    vand.vi     \t1, \t0, 3
    vmul.vx     \t1, \t1, \s0
    vand.vi     \t0, \t0, -4

    vadd.vv     \t4, \t1, \t0
    vadd.vi     \t5, \t4, 1
    vadd.vi     \t6, \t4, 2
    vadd.vi     \t7, \t4, 3
    
    li          \s0, 32
    vsetvli    zero, \s0, e16, m4, ta, ma
    vrgatherei16.vv \t0, \v0, \t4
    vmv.v.v     \v0, \t0
.endm

function transpose4x8_16_two
    vl4re16.v   v0, (a0)
    TRANSPOSE4x8_16_vrgather v0, v1, v2, v3, v8, v9, v10, v11, v12, v13, v14, v15, t0
    vs4r.v   v0, (a0)
    ret
endfunc

...

This is also one of the main reasons why we want to add instructions similar to `trn1` and `trn2` in RVV.

Vnsrl

Olaf pointed out a new method to achieve matrix transposition, using the vnsrl instruction in RVV along with vslide instructions to achieve the effect of zip1 and zip2 in AArch64. Olaf provided detailed information for this method, and we are very grateful for his work. Below is an approach that works with VLEN=128:

Code Block

# VLEN=128 transpose one 4x4 matrix of 16-bit elements stored in 4 vreg:
#   a b c d          a e i m
#   e f g h  -----\  b f j n
#   i j k l  -----/  c g k o
#   m n o p          d h l p
    
## setup code:
# li t1, 32

vsetvli t0, x0, e32, m1, ta, ma
vslideup.vi v0, v1, 2
vslideup.vi v2, v3, 2
vmv1r.v v1, v2

# v0: a b c d e f g h
# v1: i j k l m n o p

vnsrl.wi v4, v0, 0
vnsrl.wx v6, v0, t1

# v4: a b e f i j m n
# v6: c d g h k l o p

vsetvli t0, x0, e16, mf2, ta, ma
vnsrl.wi v0, v4, 0
vnsrl.wi v1, v4, 16
vnsrl.wi v2, v6, 0
vnsrl.wi v3, v6, 16

# v0: a e i m
# v1: b f j n
# v2: c g k o
# v3: d h l p

Versions Compared

Old Version 7

New Version 8

Key

Implementation in RISCV64

Segmented load or store

Vnsrl