Commit Graph

336 Commits

Author SHA1 Message Date
Logan Lyu e79686be96 lavc/aarch64: new optimization for 8-bit hevc_qpel_h hevc_qpel_uni_w_hv
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-06-06 12:50:18 +03:00
Logan Lyu 15972cce8c lavc/aarch64: new optimization for 8-bit hevc_qpel_uni_w_h
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-06-06 12:50:18 +03:00
Logan Lyu 0b7356c1b4 lavc/aarch64: new optimization for 8-bit hevc_pel_uni_w_pixels and qpel_uni_w_v
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-06-06 12:50:18 +03:00
xufuji456 bd2f00f665 codec/aarch64/hevc: add transform_luma_neon
got 56% speed up (run_count=1000, CPU=Cortex A53)
transform_4x4_luma_neon: 45 transform_4x4_luma_c: 103

Signed-off-by: xufuji456 <839789740@qq.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-04-14 12:07:57 +03:00
xufuji456 00a062b8d5 codec/aarch64/hevc:add idct_32x32_neon
got 73% speed up (run_count=1000, CPU=Cortex A53)
idct_32x32_neon: 4826 idct_32x32_c: 18236
idct_32x32_neon: 4824 idct_32x32_c: 18149
idct_32x32_neon: 4937 idct_32x32_c: 18333

Signed-off-by: Martin Storsjö <martin@martin.st>
2023-04-12 15:58:09 +03:00
J. Dekker b564ad8eac lavc/aarch64: add hevc deblock chroma 8-12bit
Benched on Ampere Altra:

hevc_h_loop_filter_chroma8_c: 367.7
hevc_h_loop_filter_chroma8_neon: 31.0
hevc_h_loop_filter_chroma10_c: 396.7
hevc_h_loop_filter_chroma10_neon: 27.5
hevc_h_loop_filter_chroma12_c: 377.0
hevc_h_loop_filter_chroma12_neon: 31.7
hevc_v_loop_filter_chroma8_c: 369.0
hevc_v_loop_filter_chroma8_neon: 55.0
hevc_v_loop_filter_chroma10_c: 389.0
hevc_v_loop_filter_chroma10_neon: 54.0
hevc_v_loop_filter_chroma12_c: 389.5
hevc_v_loop_filter_chroma12_neon: 53.0

Signed-off-by: J. Dekker <jdek@itanimul.li>
2023-04-06 06:54:26 +02:00
J. Dekker 37cde570bc lavc/aarch64: add clip N macro
Signed-off-by: J. Dekker <jdek@itanimul.li>
2023-03-22 14:48:13 +01:00
xufuji456 4b4de07721 libavcodec/hevc: add hevc idct4x4 neon of aarch64
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-02-28 13:12:52 +02:00
Martin Storsjö ec7fa13eb0 aarch64: hevcdsp_idct: Reuse preexisting macros for transposes
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-02-28 11:48:54 +02:00
Lynne e0661fc805
dca_core: convert to lavu/tx
Thanks to Martin Storsjö <martin@martin.st> for fixing and testing the
arm32 and aarch64 changes.
2022-11-06 14:39:36 +01:00
J. Dekker 9bed814e1d lavc/aarch64: add hevc horizontal qpel/uni/bi
checkasm --benchmark on Ampere Altra (Neoverse N1):

put_hevc_qpel_bi_h4_8_c: 170.7
put_hevc_qpel_bi_h4_8_neon: 64.5
put_hevc_qpel_bi_h6_8_c: 373.7
put_hevc_qpel_bi_h6_8_neon: 130.2
put_hevc_qpel_bi_h8_8_c: 662.0
put_hevc_qpel_bi_h8_8_neon: 138.5
put_hevc_qpel_bi_h12_8_c: 1529.5
put_hevc_qpel_bi_h12_8_neon: 422.0
put_hevc_qpel_bi_h16_8_c: 2735.5
put_hevc_qpel_bi_h16_8_neon: 560.5
put_hevc_qpel_bi_h24_8_c: 6015.7
put_hevc_qpel_bi_h24_8_neon: 1636.0
put_hevc_qpel_bi_h32_8_c: 10779.0
put_hevc_qpel_bi_h32_8_neon: 2204.5
put_hevc_qpel_bi_h48_8_c: 24375.0
put_hevc_qpel_bi_h48_8_neon: 4984.0
put_hevc_qpel_bi_h64_8_c: 42768.0
put_hevc_qpel_bi_h64_8_neon: 8795.7
put_hevc_qpel_h4_8_c: 149.0
put_hevc_qpel_h4_8_neon: 55.7
put_hevc_qpel_h6_8_c: 321.2
put_hevc_qpel_h6_8_neon: 106.0
put_hevc_qpel_h8_8_c: 578.7
put_hevc_qpel_h8_8_neon: 133.2
put_hevc_qpel_h12_8_c: 1279.0
put_hevc_qpel_h12_8_neon: 391.7
put_hevc_qpel_h16_8_c: 2286.2
put_hevc_qpel_h16_8_neon: 519.7
put_hevc_qpel_h24_8_c: 5100.7
put_hevc_qpel_h24_8_neon: 1546.2
put_hevc_qpel_h32_8_c: 9022.0
put_hevc_qpel_h32_8_neon: 2060.2
put_hevc_qpel_h48_8_c: 20293.5
put_hevc_qpel_h48_8_neon: 4656.7
put_hevc_qpel_h64_8_c: 36037.0
put_hevc_qpel_h64_8_neon: 8262.7
put_hevc_qpel_uni_h4_8_c: 162.2
put_hevc_qpel_uni_h4_8_neon: 61.7
put_hevc_qpel_uni_h6_8_c: 355.2
put_hevc_qpel_uni_h6_8_neon: 114.2
put_hevc_qpel_uni_h8_8_c: 651.0
put_hevc_qpel_uni_h8_8_neon: 135.7
put_hevc_qpel_uni_h12_8_c: 1412.5
put_hevc_qpel_uni_h12_8_neon: 402.7
put_hevc_qpel_uni_h16_8_c: 2551.0
put_hevc_qpel_uni_h16_8_neon: 533.5
put_hevc_qpel_uni_h24_8_c: 5782.2
put_hevc_qpel_uni_h24_8_neon: 1578.7
put_hevc_qpel_uni_h32_8_c: 10586.5
put_hevc_qpel_uni_h32_8_neon: 2102.2
put_hevc_qpel_uni_h48_8_c: 23812.0
put_hevc_qpel_uni_h48_8_neon: 4739.5
put_hevc_qpel_uni_h64_8_c: 42958.7
put_hevc_qpel_uni_h64_8_neon: 8366.5

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-10-25 14:56:38 +02:00
Reimar Döffinger 38cd829dce
aarch64: Implement stack spilling in a consistent way.
Currently it is done in several different ways, which
might cause needless dependencies or in case of
tx_float_neon.S is incorrect.

Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Reimar Döffinger <Reimar.Doeffinger@gmx.de>
2022-10-11 09:12:02 +02:00
Grzegorz Bernacki 8f4b000c37 lavc/aarch64: Add neon implementation for vsse_intra8
Provide optimized implementation for vsse_intra8 for arm64.

Performance tests are shown below.
- vsse_5_c: 87.7
- vsse_5_neon: 26.2

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Co-authored-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-10-04 13:24:20 +03:00
Grzegorz Bernacki bad67cb9fd lavc/aarch64: Provide optimized implementation of vsse8 for arm64.
Provide optimized implementation of vsse8 for arm64.

Performance comparison tests are shown below.
- vsse_1_c: 141.5
- vsse_1_neon: 32.5

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Signed-off-by: Grzegorz Bernacki <gjb@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-10-04 13:24:20 +03:00
Grzegorz Bernacki faea56c9c7 lavc/aarch64: Provide neon implementation of nsse8
Add vectorized implementation of nsse8 function.

Performance comparison tests are shown below.
- nsse_1_c: 256.0
- nsse_1_neon: 82.7

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Grzegorz Bernacki <gjb@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-10-04 13:24:20 +03:00
Grzegorz Bernacki f401a2af21 lavc/aarch64: Add neon implementation for pix_abs8 functions.
Provide optimized implementation of pix_abs8 function for arm64.

Performance comparison tests are shown below:
pix_abs_1_1_c: 162.5
pix_abs_1_1_neon: 27.0
pix_abs_1_2_c: 174.0
pix_abs_1_2_neon: 23.5
pix_abs_1_3_c: 203.2
pix_abs_1_3_neon: 34.7

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Co-authored-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Grzegorz Bernacki <gjb@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-10-04 13:24:04 +03:00
Martin Storsjö 8089fe072e aarch64: me_cmp: Avoid using the non-unrolled codepath for the minimum unroll size
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-29 10:29:11 +03:00
Martin Storsjö 6f2ad7f951 aarch64: me_cmp: Avoid redundant loads in ff_pix_abs16_y2_neon
This avoids one redundant load per row; pix3 from the previous
iteration can be used as pix2 in the next one.

Before:       Cortex A53    A72    A73
pix_abs_0_2_neon:  138.0   59.7   48.0
After:
pix_abs_0_2_neon:  109.7   50.2   39.5

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-29 10:29:10 +03:00
Andreas Rheinhardt 9beba05311 avcodec/fmtconvert: Remove unused AVCodecContext parameter
Unused since d74a8cb7e4.

Reviewed-by: Rémi Denis-Courmont <remi@remlab.net>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-09-21 20:26:40 +02:00
Hubert Mazur b2732115dd lavc/aarch64: Add neon implementation for pix_median_abs8
Provide optimized implementation for pix_median_abs8 function.

Performance comparison tests are shown below.
- median_sad_1_c: 277.0
- median_sad_1_neon: 82.0

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-21 12:57:56 +03:00
Hubert Mazur e9a6170213 lavc/aarch64: Add neon implementation for vsad8_intra
Provide optimized implementation for vsad8_intra function.

Performance comparison tests are shown below.
- vsad_5_c: 94.7
- vsad_5_neon: 20.7

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-21 12:57:56 +03:00
Hubert Mazur 0ee535b1db lavc/aarch64: Add neon implementation for pix_median_abs16
Provide optimized implementation for pix_median_abs16 function.

Performance comparison tests are shown below.
 - median_sad_0_c: 720.5
 - median_sad_0_neon: 127.2

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-21 12:57:56 +03:00
Rémi Denis-Courmont b52034270a lavc/vorbisdsp: use ptrdiff_t rather than intptr_t
... for a difference between pointers.
2022-09-19 13:51:00 -03:00
Andreas Rheinhardt a54e53a1c4 avcodec/vp8dsp: Constify src in vp8_mc_func
Reviewed-by: Peter Ross <pross@xvid.org>
Reviewed-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-09-11 20:57:51 +02:00
Hubert Mazur 06b98e396a lavc/aarch64: Provide neon implementation of nsse16
Add vectorized implementation of nsse16 function.

Performance comparison tests are shown below.
- nsse_0_c: 682.2
- nsse_0_neon: 116.5

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Co-authored-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-09 10:19:46 +03:00
Hubert Mazur 908abe8032 lavc/aarch64: Add neon implementation for vsse_intra16
Provide optimized implementation for vsse_intra16 for arm64.

Performance tests are shown below.
- vsse_4_c: 155.2
- vsse_4_neon: 36.2

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-09 10:19:46 +03:00
Hubert Mazur ce03ea3e79 lavc/aarch64: Add neon implementation for vsad_intra16
Provide optimized implementation for vsad_intra16 function for arm64.

Performance comparison tests are shown below.
- vsad_4_c: 177.5
- vsad_4_neon: 23.5

Benchmarks and tests are run with checkasm tool on AWS Gravtion 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-09 10:19:46 +03:00
Hubert Mazur c495a4b32d lavc/aarch64: Add neon implementation of vsse16
Provide optimized implementation of vsse16 for arm64.

Performance comparison tests are shown below.
- vsse_0_c: 257.7
- vsse_0_neon: 59.2

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-09 10:19:46 +03:00
Hubert Mazur 200f5e578f lavc/aarch64: Add neon implementation for vsad16
Provide optimized implementation of vsad16 function for arm64.

Performance comparison tests are shown below.
- vsad_0_c: 285.2
- vsad_0_neon: 39.5

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Co-authored-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-09-09 10:19:46 +03:00
Lynne f99d15cca0
arm/fft: disable NEON optimizations for 131072pt transforms
This has been broken since the start, and it was only discovered
when I started testing my replacement for the FFT.
Disable it, since there's no point in fixing slower code that's about
to be removed anyway.

The vfp version is not affected.
2022-08-29 07:13:43 +02:00
J. Dekker ce2f47318b lavc/aarch64: hevc_add_res add 12bit variants
hevc_add_res_4x4_12_c: 46.0
hevc_add_res_4x4_12_neon: 18.7
hevc_add_res_8x8_12_c: 194.7
hevc_add_res_8x8_12_neon: 25.2
hevc_add_res_16x16_12_c: 716.0
hevc_add_res_16x16_12_neon: 69.7
hevc_add_res_32x32_12_c: 3820.7
hevc_add_res_32x32_12_neon: 261.0

Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-08-18 15:04:43 +02:00
Martin Storsjö 48be6616d0 aarch64: me_cmp: Remove a leftover unnecessary instruction
This was missed in a2e45ad407.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-18 12:14:53 +03:00
Hubert Mazur 70efa4d011 lavc/aarch64: Add neon implementation for pix_abs8
Provide optimized implementation of pix_abs8 function for arm64.

Performance comparison tests are shown below.
- pix_abs_1_0_c: 101.2
- pix_abs_1_0_neon: 22.5
- sad_1_c: 101.2
- sad_1_neon: 22.5

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-18 12:07:26 +03:00
Hubert Mazur 74312e80d7 lavc/aarch64: Add neon implementation for sse8
Provide optimized implementation of sse8 function for arm64.

Performance comparison tests are shown below.
- sse_1_c: 130.7
- sse_1_neon: 29.7

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-18 12:07:26 +03:00
Hubert Mazur a2e45ad407 lavc/aarch64: Add neon implementation for pix_abs16_y2
Provide optimized implementation of pix_abs16_y2 function for arm64.

Performance comparison tests are shown below.
pix_abs_0_2_c: 317.2
pix_abs_0_2_neon: 37.5

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-18 12:07:26 +03:00
Hubert Mazur d7abb7d143 lavc/aarch64: Add neon implementation for sse4
Provide neon implementation for sse4 function.

Performance comparison tests are shown below.
- sse_2_c: 80.7
- sse_2_neon: 31.0

Benchmarks and tests are run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-18 12:07:26 +03:00
Hubert Mazur ad251fd262 lavc/aarch64: Add neon implementation for sse16
Provide neon implementation for sse16 function.

Performance comparison tests are shown below.
- sse_0_c: 268.2
- sse_0_neon: 43.5

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-18 12:07:26 +03:00
Martin Storsjö 60109d5b3d aarch64: me_cmp: Fix the indentation of function declarations
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-18 12:07:26 +03:00
J. Dekker aa9eabb7a5 lavc/aarch64: reformat add_res funcs
Signed-off-by: J. Dekker <jdek@itanimul.li>
2022-08-16 14:00:34 +02:00
Andreas Rheinhardt 333b32af8e avcodec/h264chroma: Constify src in h264_chroma_mc_func
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-08-05 03:02:13 +02:00
Andreas Rheinhardt b3bbbb14d0 avcodec/hevcdsp: Constify src pointers
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-08-05 02:54:04 +02:00
Andreas Rheinhardt abb85429f3 avcodec/me_cmp: Constify me_cmp_func buffer parameters
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-07-31 03:31:53 +02:00
Andreas Rheinhardt af43da3e4d avcodec/videodsp: Constify buf in VideoDSPContext.prefetch
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-07-31 03:14:34 +02:00
Martin Storsjö 4136405c86 aarch64: me_cmp: Don't do uaddlv once per iteration
The max height is currently documented as 16; the max difference per
pixel is 255, and a .8h element can easily contain 16*255, thus keep
accumulating in two .8h vectors, and just do the final accumulationat the
end. This should work for heights up to 256.

This requires a minor register renumbering in ff_pix_abs16_xy2_neon.

Before:       Cortex A53    A72    A73   Graviton 3
pix_abs_0_0_neon:   97.7   47.0   37.5   22.7
pix_abs_0_1_neon:  154.0   59.0   52.0   25.0
pix_abs_0_3_neon:  179.7   96.7   87.5   41.2
After:
pix_abs_0_0_neon:   96.0   39.2   31.2   22.0
pix_abs_0_1_neon:  150.7   59.7   46.2   23.7
pix_abs_0_3_neon:  175.7   83.7   81.7   38.2

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-07-16 17:26:17 +03:00
Martin Storsjö 68a03f6424 aarch64: me_cmp: Switch from uabd to uabal in ff_pix_abs16_xy2_neon
Using absolute-difference-accumulate does use twice the amount of
absolute-difference instructions, but avoids the need for the
uaddl and add instructions, reducing the total number of instructions
by 3.

These can be interleaved in the rest of the calculation, to avoid
tight dependencies at the end. Unfortunately, this is marginally
slower on Cortex A53, but faster on A72 and A73.

Before:       Cortex A53    A72    A73   Graviton 3
pix_abs_0_3_neon:  175.7  109.2   92.0   41.2
After:
pix_abs_0_3_neon:  179.7   96.7   87.5   41.2

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-07-16 17:25:54 +03:00
Martin Storsjö b46de9aba4 aarch64: me_cmp: Interleave some of the loads in ff_pix_abs16_xy2_neon
Before:       Cortex A53    A72    A73   Graviton 3
pix_abs_0_3_neon:  183.7  112.7   97.5   41.2
After:
pix_abs_0_3_neon:  175.7  109.2   92.0   41.2

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-07-16 17:25:44 +03:00
Martin Storsjö 02e7853fd9 libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon
Checkasm doesn't currently test this codepath.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-07-16 17:25:11 +03:00
Hubert Mazur 01e190dc99 lavc/aarch64: Add pix_abs16_x2 neon implementation
Provide neon implementation for pix_abs16_x2 function.

Performance tests of implementation are below.
 - pix_abs_0_1_c: 283.5
 - pix_abs_0_1_neon: 39.0

Benchmarks and tests run with checkasm tool on AWS Graviton 3.

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-07-13 23:25:22 +03:00
Hubert Mazur eb7ab3928f lavc/aarch64: Hook up the existing ff_pix_abs16_neon to the sad[0] function pointer
Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-07-11 23:58:28 +03:00
Swinney, Jonathan c471cc7474 lavc/aarch64: motion estimation functions in neon
- ff_pix_abs16_neon
 - ff_pix_abs16_xy2_neon

In direct micro benchmarks of these ff functions verses their C implementations,
these functions performed as follows on AWS Graviton 3.

ff_pix_abs16_neon:
pix_abs_0_0_c: 141.1
pix_abs_0_0_neon: 19.6

ff_pix_abs16_xy2_neon:
pix_abs_0_3_c: 269.1
pix_abs_0_3_neon: 39.3

Tested with:
./tests/checkasm/checkasm --test=motion --bench --disable-linux-perf

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-06-28 00:51:39 +03:00