Content Comparison

...

Aarch64

In aarch64, there are trn1 and trn2 instructions. By combining one trn1 and one trn2, multiple 2x2 matrix transpositions can be completed between two vector registers. Larger matrix transpositions can be achieved by repeatedly calling 2x2 matrix transpositions of different scales. The aarch64's transpose macro implementation in x264 is as follows:

Code Block

.macro transpose t1, t2, s1, s2
    trn1        \t1,  \s1,  \s2
    trn2        \t2,  \s1,  \s2
.endm

.macro transpose4x4.h v0, v1, v2, v3, t0, t1, t2, t3
    transpose   \t0\().2s,  \t2\().2s,  \v0\().2s,  \v2\().2s
    transpose   \t1\().2s,  \t3\().2s,  \v1\().2s,  \v3\().2s
    transpose   \v0\().4h,  \v1\().4h,  \t0\().4h,  \t1\().4h
    transpose   \v2\().4h,  \v3\().4h,  \t2\().4h,  \t3\().4h
.endm

.macro transpose4x8.h v0, v1, v2, v3, t0, t1, t2, t3
    transpose   \t0\().4s,  \t2\().4s,  \v0\().4s,  \v2\().4s
    transpose   \t1\().4s,  \t3\().4s,  \v1\().4s,  \v3\().4s
    transpose   \v0\().8h,  \v1\().8h,  \t0\().8h,  \t1\().8h
    transpose   \v2\().8h,  \v3\().8h,  \t2\().8h,  \t3\().8h
.endm

.macro transpose8x8.h r0, r1, r2, r3, r4, r5, r6, r7, r8, r9
    trn1        \r8\().8h,  \r0\().8h,  \r1\().8h
    trn2        \r9\().8h,  \r0\().8h,  \r1\().8h
    trn1        \r1\().8h,  \r2\().8h,  \r3\().8h
    trn2        \r3\().8h,  \r2\().8h,  \r3\().8h
    trn1        \r0\().8h,  \r4\().8h,  \r5\().8h
    trn2        \r5\().8h,  \r4\().8h,  \r5\().8h
    trn1        \r2\().8h,  \r6\().8h,  \r7\().8h
    trn2        \r7\().8h,  \r6\().8h,  \r7\().8h

    trn1        \r4\().4s,  \r0\().4s,  \r2\().4s
    trn2        \r2\().4s,  \r0\().4s,  \r2\().4s
    trn1        \r6\().4s,  \r5\().4s,  \r7\().4s
    trn2        \r7\().4s,  \r5\().4s,  \r7\().4s
    trn1        \r5\().4s,  \r9\().4s,  \r3\().4s
    trn2        \r9\().4s,  \r9\().4s,  \r3\().4s
    trn1        \r3\().4s,  \r8\().4s,  \r1\().4s
    trn2        \r8\().4s,  \r8\().4s,  \r1\().4s

    trn1        \r0\().2d,  \r3\().2d,  \r4\().2d
    trn2        \r4\().2d,  \r3\().2d,  \r4\().2d

    trn1        \r1\().2d,  \r5\().2d,  \r6\().2d
    trn2        \r5\().2d,  \r5\().2d,  \r6\().2d

    trn2        \r6\().2d,  \r8\().2d,  \r2\().2d
    trn1        \r2\().2d,  \r8\().2d,  \r2\().2d

    trn1        \r3\().2d,  \r9\().2d,  \r7\().2d
    trn2        \r7\().2d,  \r9\().2d,  \r7\().2d
.endm

Here, transpose4x4.h and transpose4x8.h achieve fast transpositions of 4x4 and 4x8 (2x4x4) matrices by repeatedly calling the transpose macro.

Loongarch

In loongarch, matrix transposition is implemented using the Interleave method.

vilvl (Vector Interleave Low)

vilvh (Vector Interleave High)

The Loongarch's 4x4 transpose macro implementation in x264 is as follows:

Code Block

/*
 * Description : Transpose 4x4 block with word elements in vectors
 * Arguments   : Inputs  - in0, in1, in2, in3
 *               Outputs - out0, out1, out2, out3
 * Details     :
 * Example     :
 *               1, 2, 3, 4            1, 5, 9,13
 *               5, 6, 7, 8    to      2, 6,10,14
 *               9,10,11,12  =====>    3, 7,11,15
 *              13,14,15,16            4, 8,12,16
 */
.macro LSX_TRANSPOSE4x4_W in0, in1, in2, in3, out0, out1, out2, out3, \
                          tmp0, tmp1

    vilvl.w    \tmp0,   \in1,    \in0
    vilvh.w    \out1,   \in1,    \in0
    vilvl.w    \tmp1,   \in3,    \in2
    vilvh.w    \out3,   \in3,    \in2

    vilvl.d    \out0,   \tmp1,   \tmp0
    vilvl.d    \out2,   \out3,   \out1
    vilvh.d    \out3,   \out3,   \out1
    vilvh.d    \out1,   \tmp1,   \tmp0
.endm

By performing multiple interleaved instrutions, matrix transposition can be achieved. Here is the value change of each register during the process of 4x4 matrix transposition using the Interleave method:

Code Block

# input
in0: [a0 a1 a2 a3]
in1: [b0 b1 b2 b3]
in2: [c0 c1 c2 c3]
in3: [d0 d1 d2 d3] 

    vilvl.w    \tmp0,   \in1,    \in0
// tmp0: [a0 b0 a1 b1]
    vilvh.w    \out1,   \in1,    \in0
// out1: [a2 b2 a3 b3]
    vilvl.w    \tmp1,   \in3,    \in2
// tmp1: [c0 d0 c1 d1]
    vilvh.w    \out3,   \in3,    \in2
// out3: [c2 d2 c3 d3]
    vilvl.d    \out0,   \tmp1,   \tmp0
// out0: [a0 b0 c0 d0]
    vilvl.d    \out2,   \out3,   \out1
// out2: [a2 b2 c2 d2]
    vilvh.d    \out3,   \out3,   \out1
// out3: [a3 b3 c3 d3]
    vilvh.d    \out1,   \tmp1,   \tmp0
// out1: [a1 b1 c1 d1]

# output
out0: [a0 b0 c0 d0]
out1: [a1 b1 c1 d1]
out2: [a2 b2 c2 d2]
out3: [a3 b3 c3 d3]

Version	Old Version 3	New Version 4
Changes made by	Yin Tong	Yin Tong
Saved on	Jul 01, 2024	Jul 01, 2024

Versions Compared

Key

Aarch64

Loongarch