While porting the H264 standard encoder x264 to RISC-V, we've identified several operations that are challenging to implement efficiently with existing RVV instructions. In some cases, implementations require too many instructions and or transfer to / from memory, potentially impacting encoder performance.

...

This is an open collaboration. All ideas and contributions are valuable as we work together to enhance RISC-V's video codec capabilities.


Contact Information

Yin Tong - yintong.ustc@bytedance.com
Jiayan Qian - qianjiayan.1@bytedance.com
Punit Agrawal - punit.agrawal@bytedance.com

...

Vector transpose
Absolute difference
Zero-extended vmv.x.s
Rounded Shift Right Narrow
Signed saturate and Narrow to Unsigned

1. Vector transpose instructions

...

Introduction

In x264, matrix transpose instructions are primarily used in two aspects: one is to achieve matrix transposition, and the other is to achieve permutation between vectors. Both uses are quite frequent.

...

Using RISC-V RVV, we have discovered two methods to perform matrix transposition(thanks camel-cdr for the assistance provided):

Using segmented load or store
Using vrgather
Using vnsrl

Here, we use the example of transposing a 4x8 (2x4x4) matrix (transposing the left 4x4 and the right 4x4 separately) to illustrate these two methods.

Segmented load or store

In this way, we can use the `vssseg4e16.v` instruction to store each row of the original matrix into memory by columns, and then read them back by rows. Since we are transposing a 4x8 matrix, we also need to use `vslide` to combine the contents of the two registers together.

Code Block

// Using extra loads and stores, and use vslide to combine them
.macro TRANSPOSE4x8_16 buf, bstride, v0, v1, v2, v3, t0, t1, t2, t3
    vssseg4e16.v \v0, (\buf), \bstride
    vsetivli zero, 4, e16, mf2, ta, ma
    vle16.v \v0, (\buf)
    add \buf, \buf, \bstride
    vle16.v \v1, (\buf)
    add \buf, \buf, \bstride
    vle16.v \v2, (\buf)
    add \buf, \buf, \bstride
    vle16.v \v3, (\buf)
    add \buf, \buf, \bstride
    vle16.v \t0, (\buf)
    add \buf, \buf, \bstride
    vle16.v \t1, (\buf)
    add \buf, \buf, \bstride
    vle16.v \t2, (\buf)
    add \buf, \buf, \bstride
    vle16.v \t3, (\buf)
    add \buf, \buf, \bstride
    vsetivli zero, 2, e64, m1, tu, ma
    vslideup.vi \v0, \t0, 1
    vslideup.vi \v1, \t1, 1
    vslideup.vi \v2, \t2, 1
    vslideup.vi \v3, \t3, 1
.endm

// under VLEN=128
function transpose4x8_16_one
    vsetivli zero, 8, e16, m1, ta, ma
    mv          t0, a0
    vl4re16.v   v0, (a0)
    li          t1, 8
    TRANSPOSE4x8_16 t0, t1, v0, v1, v2, v3, v8, v9, v10, v11
    vs4r.v   v0, (a0)
    ret
endfunc

...

For creating index by hand, the idea is to set the index for gathering vector N to (i&3)*vl+(i&~3u)+N, where i is the element index obtained by vid.v.

Code Block

// Using vrgather with index created by hand
.macro TRANSPOSE4x8_16_vrgather v0, v1, v2, v3, t0, t1, t2, t3, t4, t5, t6, t7, s0
    vsetivli    zero, 8, e16, m1, ta, ma
    vid.v       \t0
    li          \s0, 8
    vand.vi     \t1, \t0, 3
    vmul.vx     \t1, \t1, \s0
    vand.vi     \t0, \t0, -4

    vadd.vv     \t4, \t1, \t0
    vadd.vi     \t5, \t4, 1
    vadd.vi     \t6, \t4, 2
    vadd.vi     \t7, \t4, 3
    
    li          \s0, 32
    vsetvli    zero, \s0, e16, m4, ta, ma
    vrgatherei16.vv \t0, \v0, \t4
    vmv.v.v     \v0, \t0
.endm

// under VLEN=128
function transpose4x8_16_two
    vl4re16.v   v0, (a0)
    TRANSPOSE4x8_16_vrgather v0, v1, v2, v3, v8, v9, v10, v11, v12, v13, v14, v15, t0
    vs4r.v   v0, (a0)
    ret
endfunc

...

This is also one of the main reasons why we want to add instructions similar to `trn1` and `trn2` in RVV.

Vnsrl

Olaf pointed out a new method to achieve matrix transposition, using the vnsrl instruction in RVV along with vslide instructions to achieve the effect of zip1 and zip2 in AArch64. Olaf provided detailed information for this method, and we are very grateful for his work. Below is an approach that works with VLEN=128:

...

2. Absolute difference instructions

...

Introduction

x264 need widening absolute difference accumulate operations which is 5%～6% in both x264 running time and specCPU 525.x264_r.

https://wiki.videolan.org/X264_asm_intro/#Example_2:_pixel_sad

Implementation in other ISAs

Aarch64

absolute difference: sabd/sabd

other ISAs

Aarch64

Aarch64 has a few different instructions based on the signedness and data type of input and output to calculate absolute differences

SABD / UABD - signed / unsigned absolute difference
SABDL / UABDL - signed / unsigned absolute difference (double-width result)

...

add absolute difference: saba/uaba

...

SABA / UABA - signed / unsigned absolute difference and add
SABAL/ UABAL - signed / unsigned absolute difference(double-width result)

...

and add

x86

compute Compute sum of absoulte absolute difference: psadbw

Implementation in RISCV64

...

Code Block

.macro uabd d0, s0, s1, t0
	vmaxu.vv \d0, \s0, \s1
	vminu.vv \t0, \s0, \s1
	vsub.vv \d0, \d0, \t0 
.endm

.macro sabd d0, s0, s1, t0
	vmax.vv \d0, \s0, \s1
	vmin.vv \t0, \s0, \s1
	vsub.vv \d0, \d0, \t0 
.endm

.macro uabal d0, s0, s1, t0, t1
	vmaxu.vv \t1, \s0, \s1
	vminu.vv \t0, \s0, \s1
	vsub.vv \t0, \t1, \t0
	vwaddu.wv \d0, \d0, \t0 
.endm

.macro uabdl d0, s0, s1, t0, t1
	vmaxu.vv \t1, \s0, \s1
	vminu.vv \t0, \s0, \s1
	vwsubu.vv \d0, \t1, \t0 
.endm

3. Zero-extended vmv.x.s

...

Introduction

The vmv.x.s instruction copies a single SEW-wide element from index 0 of the source vector register to a destination

...

Code Block
//uint16_t with zbb extension vsetivli zero, 1, e16, m1, ta, ma vmv.x.s a1, v1 zext.h a1, a1

4. Rounded Shift Right Narrow

...

Introduction

Now RVV has:1.0 has instructions to -

shift + scaling: rssra

;

shift + narrow: vnsrl

;

clip + narrow: vnclip

;

But do does not have "shift + scaling + narrow instrcutions" instructions

Implementation in RISCV64

Code Block

// AArch64 implementation
rshrn v20.8b, v20.8h, #3
rshrn2 v20.16b, v21.8h, #3


// RISCV64 implementation
vsetivli zero, 8, e16, m1, ta, ma
vssrl.vi v20, v20, 3
vssrl.vi v21, v21, 3
vsetivli zero, 8, e8, mf2, ta, ma
vncvt.x.x.w v20, v20
vncvt.x.x.w v21, v21
vsetivli zero, 16, e8, m1, ta, ma
vslideup.vi v20, v21, 8

5. Signed saturate and Narrow to Unsigned

Introduction

Implementation in RISCV64

Code Block
// AArch64 implementation sqxtun v0.8b, v0.8h // RISCV64 implementation vsetivli zero, 4, e16, m1, ta, ma vmax.vx v0, v0, zero vsetivli zero, 4, e8, mf2, ta, ma vnclipu.wi v4, v0, 0

...

Version	Old Version 13	New Version 14
Changes made by	Punit Agrawal	Punit Agrawal
Saved on	Sept 16, 2024	Sept 16, 2024

Versions Compared

Key

1. Vector transpose instructions

Introduction

Segmented load or store

Vnsrl

2. Absolute difference instructions

Introduction

Implementation in other ISAs

Aarch64

other ISAs

Aarch64

x86

Implementation in RISCV64

3. Zero-extended vmv.x.s

Introduction

4. Rounded Shift Right Narrow

Introduction

Implementation in RISCV64

5. Signed saturate and Narrow to Unsigned

Introduction

Implementation in RISCV64

Content Comparison

Versions Compared

Key

1. Vector transpose instructions

Introduction

Segmented load or store

Vnsrl

2. Absolute difference instructions

Introduction

Implementation in other ISAs

Aarch64

other ISAs

Aarch64

x86

Implementation in RISCV64

3. Zero-extended vmv.x.s

Introduction

4. Rounded Shift Right Narrow

Introduction

Implementation in RISCV64

5. Signed saturate and Narrow to Unsigned

Introduction

Implementation in RISCV64