1
0
mirror of https://git.ffmpeg.org/ffmpeg.git synced 2025-02-21 14:26:59 +00:00
Commit Graph

97 Commits

Author SHA1 Message Date
sunyuechi
98596f90f4 lavc/aacencdsp: R-V V abs_pow34
C908:
abs_pow34_c: 535.5
abs_pow34_rvv_f32: 337.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2023-12-11 18:42:07 +02:00
Rémi Denis-Courmont
272d0c164d lavc/lpc: R-V V apply_welch_window
apply_welch_window_even_c:       617.5
apply_welch_window_even_rvv_f64: 235.0
apply_welch_window_odd_c:        709.0
apply_welch_window_odd_rvv_f64:  256.5
2023-12-11 18:17:43 +02:00
Rémi Denis-Courmont
b3825bbe45 riscv: test for assembler support
This should fix the build on LLVM 16 and earlier, at the cost of turning
all non-RVV optimisations off.
2023-12-08 17:21:09 +02:00
sunyuechi
0b9d009b4a lavc/vc1dsp: R-V V inv_trans
C908:
vc1dsp.vc1_inv_trans_4x4_dc_c:      125.7
vc1dsp.vc1_inv_trans_4x4_dc_rvv_i32: 53.5
vc1dsp.vc1_inv_trans_4x8_dc_c:      230.7
vc1dsp.vc1_inv_trans_4x8_dc_rvv_i32: 65.5
vc1dsp.vc1_inv_trans_8x4_dc_c:      228.7
vc1dsp.vc1_inv_trans_8x4_dc_rvv_i64: 64.5
vc1dsp.vc1_inv_trans_8x8_dc_c:      476.5
vc1dsp.vc1_inv_trans_8x8_dc_rvv_i64: 80.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2023-12-08 17:20:48 +02:00
sunyuechi
8bdb663062 lavc/ac3dsp: R-V V float_to_fixed24
c910
    float_to_fixed24_c: 2207.2
    float_to_fixed24_rvv_f32: 696.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2023-12-06 16:04:22 +02:00
Rémi Denis-Courmont
0fa421c8f1 lavc/llvidencdsp: add R-V V diff_bytes
diff_bytes_c:      163.0
diff_bytes_rvv_i32: 52.7
2023-11-23 18:57:18 +02:00
Rémi Denis-Courmont
0183c2c830 lavc/aacpsdsp: use LMUL=2 and amortise strides
The input is laid out in 16 segments, of which 13 actually need to be
loaded. There are no really efficient ways to deal with this:
1) If we load 8 segments wit unit stride, then narrow to 16 segments with
   right shifts, we can only get one half-size vector per segment, or just 2
   elements per vector (EMUL=1/2) - at least with 128-bit vectors.
   This ends up unsurprisingly about as fas as the C code.
2) The current approach is to load with strides. We keep that approach,
   but improve it using three 4-segmented loads instead of 12 single-segment
   loads. This divides the number of distinct loaded addresses by 4.
3) A potential third approach would be to avoid segmentation altogether
   and splat the scalar coefficient into vectors. Then we can use a
   unit-stride and maximum EMUL. But the downside then is that we have to
   multiply the 3 (of 16) unused segments with zero as part of the
   multiply-accumulate operations.

In addition, we also reuse vectors mid-loop so as to increase the EMUL
from 1 to 2, which also improves performance a little bit.

Oeverall the gains are quite small with the device under test, as it does
not deal with segmented loads very well. But at least the code is tidier,
and should enjoy bigger speed-ups on better hardware implementation.

Before:
ps_hybrid_analysis_c:       1819.2
ps_hybrid_analysis_rvv_f32: 1037.0 (before)
ps_hybrid_analysis_rvv_f32:  990.0 (after)
2023-11-23 18:57:18 +02:00
Rémi Denis-Courmont
b88d4058f9 lavc/g722dsp: optimise R-V V apply_qmf
This stores the constant coefficients deinterleaved, so that they can be
loaded directly with NF=0. Unfortunately, we cannot optimise loading the
input, due to insufficient memory alignment (not 32-bit).

Before:
g722_apply_qmf_c:       82.5
g722_apply_qmf_rvv_i32: 78.2

After:
g722_apply_qmf_c:       82.5
g722_apply_qmf_rvv_i32: 65.2
2023-11-23 18:57:18 +02:00
Rémi Denis-Courmont
fbc7adba67 lavc/llviddsp: R-V V add_bytes
add_bytes_c:      2077.2
add_bytes_rvv_i32: 105.0
2023-11-18 22:07:14 +02:00
Rémi Denis-Courmont
ca664f2254 lavc/flacdsp: R-V V LPC16 function
In this case, the inner loop computing the scalar product can be reduced
to just one multiplication and one sum even with 128-bit vectors. The
result is a lot simpler, but also brings more modest performance gains:

flac_lpc_16_13_c:       15241.0
flac_lpc_16_13_rvv_i32: 11230.0
flac_lpc_16_16_c:       17884.0
flac_lpc_16_16_rvv_i32: 12125.7
flac_lpc_16_29_c:       27847.7
flac_lpc_16_29_rvv_i32: 10494.0
flac_lpc_16_32_c:       30051.5
flac_lpc_16_32_rvv_i32: 10355.0
2023-11-18 22:06:57 +02:00
Rémi Denis-Courmont
295092b46d lavc/flacdsp: R-V V LPC32
The entire set of 32 coefficients and corresponding past 32 samples can
fit in a single vector (with LMUL=8) exactly, but... since widening
double the needed vector sizes, we still end up too short with 128-bit
vectors. This adds a very simple version for future 256+-bit hardware,
and for pred_orders values up to 16, and a bit more involved loop for
for 128-bit hardware with pred_orders between 17 and 32.

With 128-bit hardware, the benchmarks look like this:
flac_lpc_32_13_c:       30152.0
flac_lpc_32_13_rvv_i32: 10244.7
flac_lpc_32_16_c:       37314.2
flac_lpc_32_16_rvv_i32: 10126.2
flac_lpc_32_29_c:       61910.0
flac_lpc_32_29_rvv_i32: 14495.2
flac_lpc_32_32_c:       68204.0
flac_lpc_32_32_rvv_i32: 13273.7
2023-11-18 22:05:43 +02:00
Rémi Denis-Courmont
07c303b708 lavc/flacdsp: R-V V decorrelate_indep 16-bit packed
flac_decorrelate_indep2_16_c:        981.7
flac_decorrelate_indep2_16_rvv_i32:  199.2
flac_decorrelate_indep4_16_c:       1749.7
flac_decorrelate_indep4_16_rvv_i32:  401.2
flac_decorrelate_indep6_16_c:       2517.7
flac_decorrelate_indep6_16_rvv_i32:  858.0
flac_decorrelate_indep8_16_c:       3285.7
flac_decorrelate_indep8_16_rvv_i32: 1123.5
2023-11-17 23:59:56 +02:00
Rémi Denis-Courmont
fb0295e5fd lavc/flacdsp: R-V V decorrelate_indep 32-bit packed
flac_decorrelate_indep2_32_c:       981.7
flac_decorrelate_indep2_32_rvv_i32: 183.7
flac_decorrelate_indep4_32_c:      1749.7
flac_decorrelate_indep4_32_rvv_i32: 362.5
flac_decorrelate_indep6_32_c:      2517.7
flac_decorrelate_indep6_32_rvv_i32: 715.2
flac_decorrelate_indep8_32_c:      3285.7
flac_decorrelate_indep8_32_rvv_i32: 909.0
2023-11-17 23:59:56 +02:00
Rémi Denis-Courmont
6183a69c0b lavc/flacdsp: R-V V decorrelate_ms packed
flac_decorrelate_ms_16_c:       585.5
flac_decorrelate_ms_16_rvv_i32: 263.0
flac_decorrelate_ms_32_c:       584.7
flac_decorrelate_ms_32_rvv_i32: 250.0
2023-11-17 23:59:23 +02:00
Rémi Denis-Courmont
636ae0e0bc lavc/flacdsp: R-V V packed decorrelate_{l,r}s
flac_decorrelate_ms_16_c:       457.2
flac_decorrelate_ms_16_rvv_i32: 203.0
flac_decorrelate_ms_32_c:       457.2
flac_decorrelate_ms_32_rvv_i32: 203.5
flac_decorrelate_rs_16_c:       456.2
flac_decorrelate_rs_16_rvv_i32: 207.0
flac_decorrelate_rs_32_c:       456.2
flac_decorrelate_rs_32_rvv_i32: 210.5
2023-11-17 23:59:22 +02:00
Rémi Denis-Courmont
d076517056 lavc/llauddsp: R-V V scalarproduct_and_madd_int32
scalarproduct_and_madd_int32_c:      10899.7
scalarproduct_and_madd_int32_rvv_i32: 1749.0
2023-11-16 16:53:44 +02:00
Rémi Denis-Courmont
45d0eb3f70 lavc/llauddsp: R-V V scalarproduct_and_madd_int16
scalarproduct_and_madd_int16_c:      10355.7
scalarproduct_and_madd_int16_rvv_i32: 1480.0
2023-11-16 16:53:44 +02:00
Rémi Denis-Courmont
90a779bed6 lavc/huffyuvdsp: basic R-V V add_hfyu_left_pred_bgr32
Better performance can probably be achieved with a more intricate
unrolled loop, but this is a start:

add_hfyu_left_pred_bgr32_c: 15084.0
add_hfyu_left_pred_bgr32_rvv_i32: 10280.2

This would actually be cleaner with the RISC-V P extension, but that is
not ratified yet (I think?) and usually not supported if V is supported.
2023-11-15 16:51:07 +02:00
Rémi Denis-Courmont
c536e92207 lavc/sbrdsp: R-V V hf_apply_noise functions
This is restricted to 128-bit vectors as larger vector sizes could read
past the end of the noise array. Support for future hardware with larger
vector sizes is left for some other time.

hf_apply_noise_0_c:       2319.7
hf_apply_noise_0_rvv_f32: 1229.0
hf_apply_noise_1_c:       2539.0
hf_apply_noise_1_rvv_f32: 1244.7
hf_apply_noise_2_c:       2319.7
hf_apply_noise_2_rvv_f32: 1232.7
hf_apply_noise_3_c:       2541.2
hf_apply_noise_3_rvv_f32: 1244.2
2023-11-13 18:34:29 +02:00
Rémi Denis-Courmont
5b33104fca lavc/sbrdsp: R-V V hf_gen
hf_gen_c:      2922.7
hf_gen_rvv_f32: 731.5
2023-11-13 18:33:02 +02:00
Rémi Denis-Courmont
cd7b352c53 lavc/sbrdsp: R-V V autocorrelate
With 5 accumulator vectors and 6 inputs, this can only use LMUL=2.
Also the number of vector loop iterations is small, just 5 on 128-bit
vector hardware.

The vector loop is somewhat unusual in that it processes data in
descending memory order, in order to save on vector slides:
in descending order, we can extract elements to carry over to the next
iteration from the bottom of the vectors directly. With ascending order
(see in the Opus postfilter function), there are no ways to get the top
elements directly. On the downside, this requires the use of separate
shift and sub (the would-be SH3SUB instruction does not exist), with
a small pipeline stall on the vector load address.

The edge cases in scalar are done in scalar as this saves on loads
and remains significantly faster than C.

autocorrelate_c: 669.2
autocorrelate_rvv_f32: 421.0
2023-11-12 14:03:09 +02:00
Rémi Denis-Courmont
f576a0835b lavc/aacpsdsp: rework R-V V hybrid_synthesis_deint
Given the size of the data set, strided memory accesses cannot be avoided.
We can still do better than the current code.

ps_hybrid_synthesis_deint_c:       12065.5
ps_hybrid_synthesis_deint_rvv_i32: 13650.2 (before)
ps_hybrid_synthesis_deint_rvv_i64:  8181.0 (after)
2023-11-12 14:03:09 +02:00
Rémi Denis-Courmont
eb508702a8 lavc/aacpsdsp: rework R-V V add_squares
Segmented loads may be slower than not. So this advantageously uses a
unit-strided load and narrowing shifts instead.

Before:
ps_add_squares_c: 60757.7
ps_add_squares_rvv_f32: 22242.5

After:
ps_add_squares_c: 60516.0
ps_add_squares_rvv_i64: 17067.7
2023-11-12 14:03:09 +02:00
Rémi Denis-Courmont
adc87a5f7c lavc/opusdsp: rewrite R-V V postfilter
This uses a more traditional approach allowing up processing of up to
period minus two elements per iteration. This also allows the algorithm
to work for all and any vector length.

As the T-Head C908 device under test can load 16 elements loop, there is
unsurprisingly a little performance drop when the period is minimal and
the parallelism is capped at 13 elements:

Before:
postfilter_15_c:         21222.2
postfilter_15_rvv_f32:   22007.7
postfilter_512_c:        20189.7
postfilter_512_rvv_f32:  22004.2
postfilter_1022_c:       20189.7
postfilter_1022_rvv_f32: 22004.2

After:
postfilter_15_c:         20189.5
postfilter_15_rvv_f32:    7057.2
postfilter_512_c:        20189.5
postfilter_512_rvv_f32:   5667.2
postfilter_1022_c:       20192.7
postfilter_1022_rvv_f32:  5667.2
2023-11-06 22:09:30 +02:00
Rémi Denis-Courmont
02594c8c01 lavc/pixblockdsp: rework R-V V get_pixels_unaligned
As in the aligned case, we can use VLSE64.V, though the way of doing so
gets more convoluted, so the performance gains are more modest:

get_pixels_unaligned_c:       126.7
get_pixels_unaligned_rvv_i32: 145.5 (before)
get_pixels_unaligned_rvv_i64:  62.2 (after)

For the reference, those are the aligned benchmarks (unchanged) on the
same T-Head C908 hardware:

get_pixels_c:                 126.7
get_pixels_rvi:                85.7
get_pixels_rvv_i64:            33.2
2023-11-06 19:42:49 +02:00
Rémi Denis-Courmont
f68ad5d2de lavc/sbrdsp: R-V V sbr_hf_g_filt
hf_g_filt_c:      1552.5
hf_g_filt_rvv_f32: 679.5
2023-11-06 19:42:49 +02:00
Rémi Denis-Courmont
d06fd18f8f lavc/sbrdsp: R-V V neg_odd_64
With 128-bit vectors, this is mostly pointless but also harmless.
Performance gains should be more noticeable with larger vector sizes.

neg_odd_64_c:       76.2
neg_odd_64_rvv_i64: 74.7
2023-11-01 22:53:26 +02:00
Rémi Denis-Courmont
b0aba7dd0c lavc/sbrdsp: R-V V sum_square
sum_square_c:       803.5
sum_square_rvv_f32: 283.2
2023-11-01 22:53:26 +02:00
Rémi Denis-Courmont
86bee42473 lavc/sbrdsp: R-V V sum64x5
sum64x5_c:       385.0
sum64x5_rvv_f32: 116.0
2023-11-01 22:53:26 +02:00
Rémi Denis-Courmont
92bcc6703a lavc/pixblockdsp: remove R-V V get_pixels_16
In the aligned case, the existing RVI assembler is actually much
faster. In the unaligned case, there is nothing much to gain over C.
2023-11-01 19:27:22 +02:00
Rémi Denis-Courmont
28840cf499 lavc/jpeg2000dsp: R-V V rct_int
jpeg2000_rct_int_c:       2592.2
jpeg2000_rct_int_rvv_i32: 1154.2
2023-11-01 18:52:55 +02:00
Rémi Denis-Courmont
73dea2bb91 lavc/jpeg2000dsp: R-V V ict_float
jpeg2000_ict_float_c:       3112.2
jpeg2000_ict_float_rvv_f32: 1225.0
2023-11-01 18:52:55 +02:00
Rémi Denis-Courmont
424c8ceb08 lavc/huffyuvdsp: R-V V add_int16
add_int16_128_c:      2390.5
add_int16_128_rvv_i32: 832.0
add_int16_rnd_width_c:      2390.2
add_int16_rnd_width_rvv_i32: 832.5
2023-10-31 21:33:25 +02:00
Rémi Denis-Courmont
7e1cdc69fb lavc/utvideodsp: R-V V restore_rgb_planes10
restore_rgb_planes10_c:      185852.2
restore_rgb_planes10_rvv_i32: 90130.5
2023-10-31 21:33:25 +02:00
Rémi Denis-Courmont
4aea0da230 lavc/utvideodsp: R-V V restore_rgb_planes
restore_rgb_planes_c:      133065.7
restore_rgb_planes_rvv_i32: 33317.2
2023-10-31 21:33:25 +02:00
Rémi Denis-Courmont
ae72412aa8 lavc/idctdsp: improve R-V V put_pixels_clamped 2023-10-30 18:14:16 +02:00
Rémi Denis-Courmont
d48810f3a5 lavc/idctdsp: improve R-V V add_pixels_clamped 2023-10-30 18:14:16 +02:00
Rémi Denis-Courmont
600c6f1b55 lavc/idctdsp: improve R-V V put_signed_pixels_clamped
This follows the same idea as with pixblockdsp, but applied at the
other end, whilst writing data at the end of the function.
2023-10-30 18:14:16 +02:00
Rémi Denis-Courmont
3ea2310e89 lavc/idctdsp: require Zve64x for R-V V functions
This will be required for the following changesets.
2023-10-30 18:14:16 +02:00
Rémi Denis-Courmont
300ee8b02d lavc/pixblockdsp: aligned R-V V 8-bit functions
If the scan lines are aligned, we can load each row as a 64-bit value,
thus avoiding segmentation. And then we can factor the conversion or
subtraction.

In principle, the same optimisation should be possible for high depth,
but would require 128-bit elements, for which no FFmpeg CPU flag
exists.
2023-10-30 18:14:16 +02:00
Rémi Denis-Courmont
722765687b lavc/pixblockdsp: rename unaligned R-V V functions 2023-10-30 18:14:16 +02:00
Rémi Denis-Courmont
3c6516330f lavc/exrdsp: R-V V reoder_pixels 2023-10-09 19:52:51 +03:00
Rémi Denis-Courmont
89c10d8d20 lavc/ac3: add R-V Zbb extract_exponents 2023-10-05 18:13:00 +03:00
Rémi Denis-Courmont
cec48e3b32 riscv: factor out the bswap32 assembler 2023-10-02 22:28:21 +03:00
Rémi Denis-Courmont
b36f3d5330 lavc/fmtconvert: unroll R-V V int32_to_float_fmul_scalar 2023-10-02 18:08:23 +03:00
Rémi Denis-Courmont
f3dfd4ccf2 lavc/aacpsdsp: unroll RISC-V V hybrid_synthesis_deint 2023-10-02 18:08:23 +03:00
Rémi Denis-Courmont
0f1336b285 lavc/aacpsdsp: unroll RISC-V V hybrid_analysis_ileave 2023-10-02 18:08:23 +03:00
Rémi Denis-Courmont
69d7486e59 lavc/aacpsdsp: unroll RISC-V V mul_pair_single 2023-10-02 18:08:23 +03:00
Rémi Denis-Courmont
c270928cc0 lavc/aacpsdsp: unroll R-V V stereo interpolate 2023-10-02 18:08:23 +03:00
Rémi Denis-Courmont
27d74fc1ef lavc/aacpsdsp: simplify R-V V stereo interpolate
Remove some useless vector splat.
2023-10-02 18:08:23 +03:00