lavu/fixed_dsp: optimise R-V V fmul_reverse

Gathers are (unsurprisingly) a notable exception to the rule that R-V V
gets faster with larger group multipliers. So roll the function to speed
it up.

Before:
vector_fmul_reverse_fixed_c:       2840.7
vector_fmul_reverse_fixed_rvv_i32: 2430.2

After:
vector_fmul_reverse_fixed_c:       2841.0
vector_fmul_reverse_fixed_rvv_i32:  962.2

It might be possible to further optimise the function by moving the
reverse-subtract out of the loop and adding ad-hoc tail handling.
This commit is contained in:
Rémi Denis-Courmont 2023-11-19 13:24:29 +02:00
parent 4adb93dff0
commit 3a134e8299
1 changed files with 4 additions and 3 deletions

View File

@ -83,16 +83,17 @@ endfunc
func ff_vector_fmul_reverse_fixed_rvv, zve32x func ff_vector_fmul_reverse_fixed_rvv, zve32x
csrwi vxrm, 0 csrwi vxrm, 0
vsetvli t0, zero, e16, m4, ta, ma // e16/m4 and e32/m8 are possible but slow the gathers down.
vsetvli t0, zero, e16, m1, ta, ma
sh2add a2, a3, a2 sh2add a2, a3, a2
vid.v v0 vid.v v0
vadd.vi v0, v0, 1 vadd.vi v0, v0, 1
1: 1:
vsetvli t0, a3, e16, m4, ta, ma vsetvli t0, a3, e16, m1, ta, ma
slli t1, t0, 2 slli t1, t0, 2
vrsub.vx v4, v0, t0 // v4[i] = [VL-1, VL-2... 1, 0] vrsub.vx v4, v0, t0 // v4[i] = [VL-1, VL-2... 1, 0]
sub a2, a2, t1 sub a2, a2, t1
vsetvli zero, zero, e32, m8, ta, ma vsetvli zero, zero, e32, m2, ta, ma
vle32.v v8, (a2) vle32.v v8, (a2)
sub a3, a3, t0 sub a3, a3, t0
vle32.v v16, (a1) vle32.v v16, (a1)