...
In aarch64, there are trn1
and trn2
instructions. By combining one trn1
and one trn2
, multiple 2x2 matrix transpositions can be completed between two vector registers. Larger matrix transpositions can be achieved by repeatedly calling 2x2 matrix transpositions of different scales. The aarch64's transpose macro implementation in x264 is as follows:
Code Block |
---|
.macro transpose t1, t2, s1, s2 trn1 \t1, \s1, \s2 trn2 \t2, \s1, \s2 .endm .macro transpose4x4.h v0, v1, v2, v3, t0, t1, t2, t3 transpose \t0\().2s, \t2\().2s, \v0\().2s, \v2\().2s transpose \t1\().2s, \t3\().2s, \v1\().2s, \v3\().2s transpose \v0\().4h, \v1\().4h, \t0\().4h, \t1\().4h transpose \v2\().4h, \v3\().4h, \t2\().4h, \t3\().4h .endm .macro transpose4x8.h v0, v1, v2, v3, t0, t1, t2, t3 transpose \t0\().4s, \t2\().4s, \v0\().4s, \v2\().4s transpose \t1\().4s, \t3\().4s, \v1\().4s, \v3\().4s transpose \v0\().8h, \v1\().8h, \t0\().8h, \t1\().8h transpose \v2\().8h, \v3\().8h, \t2\().8h, \t3\().8h .endm .macro transpose8x8.h r0, r1, r2, r3, r4, r5, r6, r7, r8, r9 trn1 \r8\().8h, \r0\().8h, \r1\().8h trn2 \r9\().8h, \r0\().8h, \r1\().8h trn1 \r1\().8h, \r2\().8h, \r3\().8h trn2 \r3\().8h, \r2\().8h, \r3\().8h trn1 \r0\().8h, \r4\().8h, \r5\().8h trn2 \r5\().8h, \r4\().8h, \r5\().8h trn1 \r2\().8h, \r6\().8h, \r7\().8h trn2 \r7\().8h, \r6\().8h, \r7\().8h trn1 \r4\().4s, \r0\().4s, \r2\().4s trn2 \r2\().4s, \r0\().4s, \r2\().4s trn1 \r6\().4s, \r5\().4s, \r7\().4s trn2 \r7\().4s, \r5\().4s, \r7\().4s trn1 \r5\().4s, \r9\().4s, \r3\().4s trn2 \r9\().4s, \r9\().4s, \r3\().4s trn1 \r3\().4s, \r8\().4s, \r1\().4s trn2 \r8\().4s, \r8\().4s, \r1\().4s trn1 \r0\().2d, \r3\().2d, \r4\().2d trn2 \r4\().2d, \r3\().2d, \r4\().2d trn1 \r1\().2d, \r5\().2d, \r6\().2d trn2 \r5\().2d, \r5\().2d, \r6\().2d trn2 \r6\().2d, \r8\().2d, \r2\().2d trn1 \r2\().2d, \r8\().2d, \r2\().2d trn1 \r3\().2d, \r9\().2d, \r7\().2d trn2 \r7\().2d, \r9\().2d, \r7\().2d .endm |
Here, transpose4x4.h
and transpose4x8.h
achieve fast transpositions of 4x4 and 4x8 (2x4x4) matrices by repeatedly calling the transpose macro.
...
In loongarch, matrix transposition is implemented using the Interleave method.
- vilvl (Vector Interleave Low)
- vilvh (Vector Interleave High)
The Loongarch's 4x4 transpose macro implementation in x264 is as follows:
...
Code Block |
---|
# input in0: [a0 a1 a2 a3] in1: [b0 b1 b2 b3] in2: [c0 c1 c2 c3] in3: [d0 d1 d2 d3] vilvl.w \tmp0, \in1, \in0 // tmp0: [a0 b0 a1 b1] vilvh.w \out1, \in1, \in0 // out1: [a2 b2 a3 b3] vilvl.w \tmp1, \in3, \in2 // tmp1: [c0 d0 c1 d1] vilvh.w \out3, \in3, \in2 // out3: [c2 d2 c3 d3] vilvl.d \out0, \tmp1, \tmp0 // out0: [a0 b0 c0 d0] vilvl.d \out2, \out3, \out1 // out2: [a2 b2 c2 d2] vilvh.d \out3, \out3, \out1 // out3: [a3 b3 c3 d3] vilvh.d \out1, \tmp1, \tmp0 // out1: [a1 b1 c1 d1] # output out0: [a0 b0 c0 d0] out1: [a1 b1 c1 d1] out2: [a2 b2 c2 d2] out3: [a3 b3 c3 d3] |