Commit Graph

60 Commits

Author SHA1 Message Date
Martin Storsjö
73f4668ef8 swscale: aarch64: Simplify the assignment of lumToYV12
We normally don't need else statements here; the common pattern
is to assign lower level SIMD implementations first, then
conditionally reassign higher level ones afterwards, if supported.

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-03-10 14:03:58 +02:00
Krzysztof Pyrkosz
d765e5f043 swscale/aarch64: dotprod implementation of rgba32_to_Y
The idea is to split the 16 bit coefficients into lower and upper half,
invoke udot for the lower half, shift by 8, and follow by udot for the
upper half.

Benchmark on A78:
bgra_to_y_128_c:                                       682.0 ( 1.00x)
bgra_to_y_128_neon:                                    181.2 ( 3.76x)
bgra_to_y_128_dotprod:                                 117.8 ( 5.79x)
bgra_to_y_1080_c:                                     5742.5 ( 1.00x)
bgra_to_y_1080_neon:                                  1472.5 ( 3.90x)
bgra_to_y_1080_dotprod:                                906.5 ( 6.33x)
bgra_to_y_1920_c:                                    10194.0 ( 1.00x)
bgra_to_y_1920_neon:                                  2589.8 ( 3.94x)
bgra_to_y_1920_dotprod:                               1573.8 ( 6.48x)

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-03-04 10:16:44 +02:00
Krzysztof Pyrkosz
38929b824b swscale/aarch64: Refactor hscale_16_to_15__fs_4
This patch removes the use of stack for temporary state and replaces
interleaved ld4 loads with ld1.

Before/after:

A78
hscale_16_to_15__fs_4_dstW_8_neon:                      86.8 ( 1.72x)
hscale_16_to_15__fs_4_dstW_24_neon:                    147.5 ( 2.73x)
hscale_16_to_15__fs_4_dstW_128_neon:                   614.0 ( 3.14x)
hscale_16_to_15__fs_4_dstW_144_neon:                   680.5 ( 3.18x)
hscale_16_to_15__fs_4_dstW_256_neon:                  1193.2 ( 3.19x)
hscale_16_to_15__fs_4_dstW_512_neon:                  2305.0 ( 3.27x)

hscale_16_to_15__fs_4_dstW_8_neon:                      86.0 ( 1.74x)
hscale_16_to_15__fs_4_dstW_24_neon:                    106.8 ( 3.78x)
hscale_16_to_15__fs_4_dstW_128_neon:                   404.0 ( 4.81x)
hscale_16_to_15__fs_4_dstW_144_neon:                   451.8 ( 4.80x)
hscale_16_to_15__fs_4_dstW_256_neon:                   760.5 ( 5.06x)
hscale_16_to_15__fs_4_dstW_512_neon:                  1520.0 ( 5.01x)

A72
hscale_16_to_15__fs_4_dstW_8_neon:                     156.8 ( 1.52x)
hscale_16_to_15__fs_4_dstW_24_neon:                    217.8 ( 2.52x)
hscale_16_to_15__fs_4_dstW_128_neon:                   906.8 ( 2.90x)
hscale_16_to_15__fs_4_dstW_144_neon:                  1014.5 ( 2.91x)
hscale_16_to_15__fs_4_dstW_256_neon:                  1751.5 ( 2.96x)
hscale_16_to_15__fs_4_dstW_512_neon:                  3469.3 ( 2.97x)

hscale_16_to_15__fs_4_dstW_8_neon:                     151.2 ( 1.54x)
hscale_16_to_15__fs_4_dstW_24_neon:                    173.4 ( 3.15x)
hscale_16_to_15__fs_4_dstW_128_neon:                   660.0 ( 3.98x)
hscale_16_to_15__fs_4_dstW_144_neon:                   735.7 ( 4.00x)
hscale_16_to_15__fs_4_dstW_256_neon:                  1273.5 ( 4.09x)
hscale_16_to_15__fs_4_dstW_512_neon:                  2488.2 ( 4.16x)

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-03-02 01:17:29 +02:00
Martin Storsjö
b137347278 aarch64: Fix a few misindented lines
Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-28 23:23:09 +02:00
Krzysztof Pyrkosz
b92577405b swscale/aarch64/rgb2rgb_neon: Implemented {yuyv, uyvy}toyuv{420, 422}
A78:
uyvytoyuv420_neon:                                    6112.5 ( 6.96x)
uyvytoyuv422_neon:                                    6696.0 ( 6.32x)
yuyvtoyuv420_neon:                                    6113.0 ( 6.95x)
yuyvtoyuv422_neon:                                    6695.2 ( 6.31x)

A72:
uyvytoyuv420_neon:                                    9512.1 ( 6.09x)
uyvytoyuv422_neon:                                    9766.8 ( 6.32x)
yuyvtoyuv420_neon:                                    9639.1 ( 6.00x)
yuyvtoyuv422_neon:                                    9779.0 ( 6.03x)

A53:
uyvytoyuv420_neon:                                   12720.1 ( 9.10x)
uyvytoyuv422_neon:                                   14282.9 ( 6.71x)
yuyvtoyuv420_neon:                                   12637.4 ( 9.15x)
yuyvtoyuv422_neon:                                   14127.6 ( 6.77x)

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-17 11:39:42 +02:00
Krzysztof Pyrkosz
64107e22f5 swscale/aarch64/rgb24toyv12: skip early right shift by 2
It's a minor improvement that shaves off 5-8% from the execution time.
Instead of shifting by 2 right away and by 7 soon after, shift by 9 one
time.

Times before and after:

A78:
rgb24toyv12_16_200_neon:                              5366.8 ( 3.62x)
rgb24toyv12_128_60_neon:                             13574.0 ( 3.34x)
rgb24toyv12_512_16_neon:                             14463.8 ( 3.33x)
rgb24toyv12_1920_4_neon:                             13508.2 ( 3.34x)
rgb24toyv12_1920_4_negstride_neon:                   13525.0 ( 3.34x)

rgb24toyv12_16_200_neon:                              5293.8 ( 3.66x)
rgb24toyv12_128_60_neon:                             12955.0 ( 3.50x)
rgb24toyv12_512_16_neon:                             13784.0 ( 3.50x)
rgb24toyv12_1920_4_neon:                             12900.8 ( 3.49x)
rgb24toyv12_1920_4_negstride_neon:                   12902.8 ( 3.49x)

A72:
rgb24toyv12_16_200_neon:                              9695.8 ( 2.50x)
rgb24toyv12_128_60_neon:                             20286.6 ( 2.70x)
rgb24toyv12_512_16_neon:                             22276.6 ( 2.57x)
rgb24toyv12_1920_4_neon:                             19154.1 ( 2.77x)
rgb24toyv12_1920_4_negstride_neon:                   19055.1 ( 2.78x)

rgb24toyv12_16_200_neon:                              9214.8 ( 2.65x)
rgb24toyv12_128_60_neon:                             20731.5 ( 2.65x)
rgb24toyv12_512_16_neon:                             21145.0 ( 2.70x)
rgb24toyv12_1920_4_neon:                             17586.5 ( 2.99x)
rgb24toyv12_1920_4_negstride_neon:                   17571.0 ( 2.98x)

A53:
rgb24toyv12_16_200_neon:                             12880.4 ( 3.76x)
rgb24toyv12_128_60_neon:                             27776.3 ( 3.94x)
rgb24toyv12_512_16_neon:                             29411.3 ( 3.94x)
rgb24toyv12_1920_4_neon:                             27253.1 ( 3.98x)
rgb24toyv12_1920_4_negstride_neon:                   27474.3 ( 3.95x)

rgb24toyv12_16_200_neon:                             12196.3 ( 3.95x)
rgb24toyv12_128_60_neon:                             26943.1 ( 4.07x)
rgb24toyv12_512_16_neon:                             28642.3 ( 4.07x)
rgb24toyv12_1920_4_neon:                             26676.6 ( 4.08x)
rgb24toyv12_1920_4_negstride_neon:                   26713.8 ( 4.07x)

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-17 10:49:41 +02:00
Krzysztof Pyrkosz
c85a748979 swscale/aarch64/rgb2rgb: Implemented NEON shuf routines
The key idea is to pass the pre-generated tables to the TBL instruction
and churn through the data 16 bytes at a time. The remaining 4 elements
are handled with a specialized block located at the end of the routine.

The 3210 variant can be implemented using rev32, but surprisingly it is
slower than the generic TBL on A78, but much faster on A72.

There may be some room for improvement. Possibly instead of handling
last 8 and then 4 bytes separately, we can load these 4 into {v0.s}[2]
and process along with the last 8 bytes.

Speeds measured with checkasm --test=sw_rgb --bench --runs=10 | grep shuf

- A78
shuffle_bytes_0321_c:                                   75.5 ( 1.00x)
shuffle_bytes_0321_neon:                                26.5 ( 2.85x)
shuffle_bytes_1203_c:                                  136.2 ( 1.00x)
shuffle_bytes_1203_neon:                                27.2 ( 5.00x)
shuffle_bytes_1230_c:                                  135.5 ( 1.00x)
shuffle_bytes_1230_neon:                                28.0 ( 4.84x)
shuffle_bytes_2013_c:                                  138.8 ( 1.00x)
shuffle_bytes_2013_neon:                                22.0 ( 6.31x)
shuffle_bytes_2103_c:                                   76.5 ( 1.00x)
shuffle_bytes_2103_neon:                                20.5 ( 3.73x)
shuffle_bytes_2130_c:                                  137.5 ( 1.00x)
shuffle_bytes_2130_neon:                                28.0 ( 4.91x)
shuffle_bytes_3012_c:                                  138.2 ( 1.00x)
shuffle_bytes_3012_neon:                                21.5 ( 6.43x)
shuffle_bytes_3102_c:                                  138.2 ( 1.00x)
shuffle_bytes_3102_neon:                                27.2 ( 5.07x)
shuffle_bytes_3210_c:                                  138.0 ( 1.00x)
shuffle_bytes_3210_neon:                                22.0 ( 6.27x)

shuf3210 using rev32
shuffle_bytes_3210_c:                                  139.0 ( 1.00x)
shuffle_bytes_3210_neon:                                28.5 ( 4.88x)

- A72
shuffle_bytes_0321_c:                                  120.0 ( 1.00x)
shuffle_bytes_0321_neon:                                36.0 ( 3.33x)
shuffle_bytes_1203_c:                                  188.2 ( 1.00x)
shuffle_bytes_1203_neon:                                37.8 ( 4.99x)
shuffle_bytes_1230_c:                                  195.0 ( 1.00x)
shuffle_bytes_1230_neon:                                36.0 ( 5.42x)
shuffle_bytes_2013_c:                                  195.8 ( 1.00x)
shuffle_bytes_2013_neon:                                43.5 ( 4.50x)
shuffle_bytes_2103_c:                                  117.2 ( 1.00x)
shuffle_bytes_2103_neon:                                53.5 ( 2.19x)
shuffle_bytes_2130_c:                                  203.2 ( 1.00x)
shuffle_bytes_2130_neon:                                37.8 ( 5.38x)
shuffle_bytes_3012_c:                                  183.8 ( 1.00x)
shuffle_bytes_3012_neon:                                46.8 ( 3.93x)
shuffle_bytes_3102_c:                                  180.8 ( 1.00x)
shuffle_bytes_3102_neon:                                37.8 ( 4.79x)
shuffle_bytes_3210_c:                                  195.8 ( 1.00x)
shuffle_bytes_3210_neon:                                37.8 ( 5.19x)

shuf3210 using rev32
shuffle_bytes_3210_c:                                  194.8 ( 1.00x)
shuffle_bytes_3210_neon:                                30.8 ( 6.33x)

- x13s:
shuffle_bytes_0321_c:                                   49.4 ( 1.00x)
shuffle_bytes_0321_neon:                                18.1 ( 2.72x)
shuffle_bytes_1203_c:                                   98.4 ( 1.00x)
shuffle_bytes_1203_neon:                                18.4 ( 5.35x)
shuffle_bytes_1230_c:                                   97.4 ( 1.00x)
shuffle_bytes_1230_neon:                                19.1 ( 5.09x)
shuffle_bytes_2013_c:                                  101.4 ( 1.00x)
shuffle_bytes_2013_neon:                                16.9 ( 6.01x)
shuffle_bytes_2103_c:                                   53.9 ( 1.00x)
shuffle_bytes_2103_neon:                                13.9 ( 3.88x)
shuffle_bytes_2130_c:                                  100.9 ( 1.00x)
shuffle_bytes_2130_neon:                                19.1 ( 5.27x)
shuffle_bytes_3012_c:                                   97.4 ( 1.00x)
shuffle_bytes_3012_neon:                                17.1 ( 5.69x)
shuffle_bytes_3102_c:                                  100.9 ( 1.00x)
shuffle_bytes_3102_neon:                                19.1 ( 5.27x)
shuffle_bytes_3210_c:                                  100.6 ( 1.00x)
shuffle_bytes_3210_neon:                                16.9 ( 5.96x)

shuf3210 using rev32
shuffle_bytes_3210_c:                                  100.6 ( 1.00x)
shuffle_bytes_3210_neon:                                18.6 ( 5.40x)

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-07 12:54:55 +02:00
Krzysztof Pyrkosz
e25a19fc7c swscale/aarch64/output.S: refactor ff_yuv2plane1_8_neon
The benchmarks (before vs after) were gathered using
./tests/checkasm/checkasm --test=sw_scale --bench --runs=6 | grep yuv2yuv1

A78 before:
yuv2yuv1_0_512_accurate_c:                            2039.5 ( 1.00x)
yuv2yuv1_0_512_accurate_neon:                          385.5 ( 5.29x)
yuv2yuv1_0_512_approximate_c:                         2110.5 ( 1.00x)
yuv2yuv1_0_512_approximate_neon:                       385.5 ( 5.47x)
yuv2yuv1_3_512_accurate_c:                            2061.2 ( 1.00x)
yuv2yuv1_3_512_accurate_neon:                          381.2 ( 5.41x)
yuv2yuv1_3_512_approximate_c:                         2099.2 ( 1.00x)
yuv2yuv1_3_512_approximate_neon:                       381.2 ( 5.51x)
yuv2yuv1_8_512_accurate_c:                            2054.2 ( 1.00x)
yuv2yuv1_8_512_accurate_neon:                          385.5 ( 5.33x)
yuv2yuv1_8_512_approximate_c:                         2112.2 ( 1.00x)
yuv2yuv1_8_512_approximate_neon:                       385.5 ( 5.48x)
yuv2yuv1_11_512_accurate_c:                           2036.0 ( 1.00x)
yuv2yuv1_11_512_accurate_neon:                         381.2 ( 5.34x)
yuv2yuv1_11_512_approximate_c:                        2115.0 ( 1.00x)
yuv2yuv1_11_512_approximate_neon:                      381.2 ( 5.55x)
yuv2yuv1_16_512_accurate_c:                           2066.5 ( 1.00x)
yuv2yuv1_16_512_accurate_neon:                         385.5 ( 5.36x)
yuv2yuv1_16_512_approximate_c:                        2100.8 ( 1.00x)
yuv2yuv1_16_512_approximate_neon:                      385.5 ( 5.45x)
yuv2yuv1_19_512_accurate_c:                           2059.8 ( 1.00x)
yuv2yuv1_19_512_accurate_neon:                         381.2 ( 5.40x)
yuv2yuv1_19_512_approximate_c:                        2102.8 ( 1.00x)
yuv2yuv1_19_512_approximate_neon:                      381.2 ( 5.52x)

After:
yuv2yuv1_0_512_accurate_c:                            2206.0 ( 1.00x)
yuv2yuv1_0_512_accurate_neon:                          139.2 (15.84x)
yuv2yuv1_0_512_approximate_c:                         2050.0 ( 1.00x)
yuv2yuv1_0_512_approximate_neon:                       139.2 (14.72x)
yuv2yuv1_3_512_accurate_c:                            2205.2 ( 1.00x)
yuv2yuv1_3_512_accurate_neon:                          138.0 (15.98x)
yuv2yuv1_3_512_approximate_c:                         2052.5 ( 1.00x)
yuv2yuv1_3_512_approximate_neon:                       138.0 (14.87x)
yuv2yuv1_8_512_accurate_c:                            2171.0 ( 1.00x)
yuv2yuv1_8_512_accurate_neon:                          139.2 (15.59x)
yuv2yuv1_8_512_approximate_c:                         2064.2 ( 1.00x)
yuv2yuv1_8_512_approximate_neon:                       139.2 (14.82x)
yuv2yuv1_11_512_accurate_c:                           2164.8 ( 1.00x)
yuv2yuv1_11_512_accurate_neon:                         138.0 (15.69x)
yuv2yuv1_11_512_approximate_c:                        2048.8 ( 1.00x)
yuv2yuv1_11_512_approximate_neon:                      138.0 (14.85x)
yuv2yuv1_16_512_accurate_c:                           2154.5 ( 1.00x)
yuv2yuv1_16_512_accurate_neon:                         139.2 (15.47x)
yuv2yuv1_16_512_approximate_c:                        2047.2 ( 1.00x)
yuv2yuv1_16_512_approximate_neon:                      139.2 (14.70x)
yuv2yuv1_19_512_accurate_c:                           2144.5 ( 1.00x)
yuv2yuv1_19_512_accurate_neon:                         138.0 (15.54x)
yuv2yuv1_19_512_approximate_c:                        2046.0 ( 1.00x)
yuv2yuv1_19_512_approximate_neon:                      138.0 (14.83x)

A72 before:
yuv2yuv1_0_512_accurate_c:                            3779.8 ( 1.00x)
yuv2yuv1_0_512_accurate_neon:                          527.8 ( 7.16x)
yuv2yuv1_0_512_approximate_c:                         4128.2 ( 1.00x)
yuv2yuv1_0_512_approximate_neon:                       528.2 ( 7.81x)
yuv2yuv1_3_512_accurate_c:                            3836.2 ( 1.00x)
yuv2yuv1_3_512_accurate_neon:                          527.0 ( 7.28x)
yuv2yuv1_3_512_approximate_c:                         3991.0 ( 1.00x)
yuv2yuv1_3_512_approximate_neon:                       526.8 ( 7.58x)
yuv2yuv1_8_512_accurate_c:                            3732.8 ( 1.00x)
yuv2yuv1_8_512_accurate_neon:                          525.5 ( 7.10x)
yuv2yuv1_8_512_approximate_c:                         4060.0 ( 1.00x)
yuv2yuv1_8_512_approximate_neon:                       527.0 ( 7.70x)
yuv2yuv1_11_512_accurate_c:                           3836.2 ( 1.00x)
yuv2yuv1_11_512_accurate_neon:                         530.0 ( 7.24x)
yuv2yuv1_11_512_approximate_c:                        4014.0 ( 1.00x)
yuv2yuv1_11_512_approximate_neon:                      530.0 ( 7.57x)
yuv2yuv1_16_512_accurate_c:                           3726.2 ( 1.00x)
yuv2yuv1_16_512_accurate_neon:                         525.5 ( 7.09x)
yuv2yuv1_16_512_approximate_c:                        4114.2 ( 1.00x)
yuv2yuv1_16_512_approximate_neon:                      526.2 ( 7.82x)
yuv2yuv1_19_512_accurate_c:                           3812.2 ( 1.00x)
yuv2yuv1_19_512_accurate_neon:                         530.0 ( 7.19x)
yuv2yuv1_19_512_approximate_c:                        4012.2 ( 1.00x)
yuv2yuv1_19_512_approximate_neon:                      530.0 ( 7.57x)

After:
yuv2yuv1_0_512_accurate_c:                            3716.8 ( 1.00x)
yuv2yuv1_0_512_accurate_neon:                          215.1 (17.28x)
yuv2yuv1_0_512_approximate_c:                         3877.8 ( 1.00x)
yuv2yuv1_0_512_approximate_neon:                       222.8 (17.40x)
yuv2yuv1_3_512_accurate_c:                            3717.1 ( 1.00x)
yuv2yuv1_3_512_accurate_neon:                          217.8 (17.06x)
yuv2yuv1_3_512_approximate_c:                         3801.6 ( 1.00x)
yuv2yuv1_3_512_approximate_neon:                       220.3 (17.25x)
yuv2yuv1_8_512_accurate_c:                            3716.6 ( 1.00x)
yuv2yuv1_8_512_accurate_neon:                          213.8 (17.38x)
yuv2yuv1_8_512_approximate_c:                         3831.8 ( 1.00x)
yuv2yuv1_8_512_approximate_neon:                       218.1 (17.57x)
yuv2yuv1_11_512_accurate_c:                           3717.1 ( 1.00x)
yuv2yuv1_11_512_accurate_neon:                         219.1 (16.97x)
yuv2yuv1_11_512_approximate_c:                        3801.6 ( 1.00x)
yuv2yuv1_11_512_approximate_neon:                      216.1 (17.59x)
yuv2yuv1_16_512_accurate_c:                           3716.6 ( 1.00x)
yuv2yuv1_16_512_accurate_neon:                         213.6 (17.40x)
yuv2yuv1_16_512_approximate_c:                        3831.6 ( 1.00x)
yuv2yuv1_16_512_approximate_neon:                      215.1 (17.82x)
yuv2yuv1_19_512_accurate_c:                           3717.1 ( 1.00x)
yuv2yuv1_19_512_accurate_neon:                         223.8 (16.61x)
yuv2yuv1_19_512_approximate_c:                        3801.6 ( 1.00x)
yuv2yuv1_19_512_approximate_neon:                      219.1 (17.35x)

x13s before:
yuv2yuv1_0_512_accurate_c:                            1435.1 ( 1.00x)
yuv2yuv1_0_512_accurate_neon:                          221.1 ( 6.49x)
yuv2yuv1_0_512_approximate_c:                         1405.4 ( 1.00x)
yuv2yuv1_0_512_approximate_neon:                       219.1 ( 6.41x)
yuv2yuv1_3_512_accurate_c:                            1418.6 ( 1.00x)
yuv2yuv1_3_512_accurate_neon:                          215.9 ( 6.57x)
yuv2yuv1_3_512_approximate_c:                         1405.9 ( 1.00x)
yuv2yuv1_3_512_approximate_neon:                       224.1 ( 6.27x)
yuv2yuv1_8_512_accurate_c:                            1433.9 ( 1.00x)
yuv2yuv1_8_512_accurate_neon:                          218.6 ( 6.56x)
yuv2yuv1_8_512_approximate_c:                         1412.9 ( 1.00x)
yuv2yuv1_8_512_approximate_neon:                       218.9 ( 6.46x)
yuv2yuv1_11_512_accurate_c:                           1449.1 ( 1.00x)
yuv2yuv1_11_512_accurate_neon:                         217.6 ( 6.66x)
yuv2yuv1_11_512_approximate_c:                        1410.9 ( 1.00x)
yuv2yuv1_11_512_approximate_neon:                      221.1 ( 6.38x)
yuv2yuv1_16_512_accurate_c:                           1402.1 ( 1.00x)
yuv2yuv1_16_512_accurate_neon:                         214.6 ( 6.53x)
yuv2yuv1_16_512_approximate_c:                        1422.4 ( 1.00x)
yuv2yuv1_16_512_approximate_neon:                      222.9 ( 6.38x)
yuv2yuv1_19_512_accurate_c:                           1421.6 ( 1.00x)
yuv2yuv1_19_512_accurate_neon:                         217.4 ( 6.54x)
yuv2yuv1_19_512_approximate_c:                        1421.6 ( 1.00x)
yuv2yuv1_19_512_approximate_neon:                      221.4 ( 6.42x)

After:
yuv2yuv1_0_512_accurate_c:                            1413.6 ( 1.00x)
yuv2yuv1_0_512_accurate_neon:                           80.6 (17.53x)
yuv2yuv1_0_512_approximate_c:                         1455.6 ( 1.00x)
yuv2yuv1_0_512_approximate_neon:                        80.6 (18.05x)
yuv2yuv1_3_512_accurate_c:                            1429.1 ( 1.00x)
yuv2yuv1_3_512_accurate_neon:                           77.4 (18.47x)
yuv2yuv1_3_512_approximate_c:                         1462.6 ( 1.00x)
yuv2yuv1_3_512_approximate_neon:                        80.6 (18.14x)
yuv2yuv1_8_512_accurate_c:                            1425.4 ( 1.00x)
yuv2yuv1_8_512_accurate_neon:                           77.9 (18.30x)
yuv2yuv1_8_512_approximate_c:                         1436.6 ( 1.00x)
yuv2yuv1_8_512_approximate_neon:                        80.9 (17.76x)
yuv2yuv1_11_512_accurate_c:                           1429.4 ( 1.00x)
yuv2yuv1_11_512_accurate_neon:                          76.1 (18.78x)
yuv2yuv1_11_512_approximate_c:                        1447.1 ( 1.00x)
yuv2yuv1_11_512_approximate_neon:                       78.4 (18.46x)
yuv2yuv1_16_512_accurate_c:                           1439.9 ( 1.00x)
yuv2yuv1_16_512_accurate_neon:                          77.6 (18.55x)
yuv2yuv1_16_512_approximate_c:                        1422.1 ( 1.00x)
yuv2yuv1_16_512_approximate_neon:                       78.1 (18.20x)
yuv2yuv1_19_512_accurate_c:                           1447.1 ( 1.00x)
yuv2yuv1_19_512_accurate_neon:                          78.1 (18.52x)
yuv2yuv1_19_512_approximate_c:                        1474.4 ( 1.00x)
yuv2yuv1_19_512_approximate_neon:                       78.1 (18.87x)

Signed-off-by: Martin Storsjö <martin@martin.st>
2025-02-07 12:05:06 +02:00
Ramiro Polla
ca889b1328 swscale/aarch64: add neon {lum,chr}ConvertRange16
aarch64 A55:
chrRangeFromJpeg16_1920_c:    32684.2
chrRangeFromJpeg16_1920_neon:  8431.2 (3.88x)
chrRangeToJpeg16_1920_c:      24996.8
chrRangeToJpeg16_1920_neon:    9395.0 (2.66x)
lumRangeFromJpeg16_1920_c:    17305.2
lumRangeFromJpeg16_1920_neon:  4586.5 (3.77x)
lumRangeToJpeg16_1920_c:      21144.8
lumRangeToJpeg16_1920_neon:    5069.8 (4.17x)

aarch64 A76:
chrRangeFromJpeg16_1920_c:    11523.8
chrRangeFromJpeg16_1920_neon:  3367.5 (3.42x)
chrRangeToJpeg16_1920_c:      11655.2
chrRangeToJpeg16_1920_neon:    4087.2 (2.85x)
lumRangeFromJpeg16_1920_c:     5762.0
lumRangeFromJpeg16_1920_neon:  1815.8 (3.17x)
lumRangeToJpeg16_1920_c:       5946.2
lumRangeToJpeg16_1920_neon:    2148.2 (2.77x)
2024-12-05 21:10:29 +01:00
Ramiro Polla
6fe4a4ffb6 swscale/aarch64/range_convert: update neon range_convert functions to new API
aarch64 A55:
chrRangeFromJpeg8_1920_c:    28835.2 (1.00x)
chrRangeFromJpeg8_1920_neon:  5313.9 (5.43x)  5308.4 (5.43x)
chrRangeToJpeg8_1920_c:      23074.7 (1.00x)
chrRangeToJpeg8_1920_neon:    5551.3 (4.16x)  5549.2 (4.16x)
lumRangeFromJpeg8_1920_c:    15389.7 (1.00x)
lumRangeFromJpeg8_1920_neon:  3152.3 (4.88x)  3147.7 (4.89x)
lumRangeToJpeg8_1920_c:      19227.8 (1.00x)
lumRangeToJpeg8_1920_neon:    3628.7 (5.30x)  3630.2 (5.30x)

aarch64 A76:
chrRangeFromJpeg8_1920_c:    6324.4 (1.00x)
chrRangeFromJpeg8_1920_neon: 2344.5 (2.70x) 2304.2 (2.74x)
chrRangeToJpeg8_1920_c:      9656.0 (1.00x)
chrRangeToJpeg8_1920_neon:   2824.2 (3.42x) 2794.2 (3.46x)
lumRangeFromJpeg8_1920_c:    4422.0 (1.00x)
lumRangeFromJpeg8_1920_neon: 1104.5 (4.00x) 1106.2 (4.00x)
lumRangeToJpeg8_1920_c:      5949.1 (1.00x)
lumRangeToJpeg8_1920_neon:   1329.8 (4.47x) 1328.2 (4.48x)
2024-12-05 21:10:29 +01:00
Ramiro Polla
384fe39623 swscale/range_convert: fix mpeg ranges in yuv range conversion for non-8-bit pixel formats
There is an issue with the constants used in YUV to YUV range conversion,
where the upper bound is not respected when converting to mpeg range.

With this commit, the constants are calculated at runtime, depending on
the bit depth. This approach also allows us to more easily understand how
the constants are derived.

For bit depths <= 14, the number of fixed point bits has been set to 14
for all conversions, to simplify the code.
For bit depths > 14, the number of fixed points bits has been raised and
set to 18, to allow for the conversion to be accurate enough for the mpeg
range to be respected.

The convert functions now take the conversion constants (coeff and offset)
as function arguments.
For bit depths <= 14, coeff is unsigned 16-bit and offset is 32-bit.
For bit depths > 14, coeff is unsigned 32-bit and offset is 64-bit.

x86_64:
chrRangeFromJpeg8_1920_c:    2127.4   2125.0  (1.00x)
chrRangeFromJpeg16_1920_c:   2325.2   2127.2  (1.09x)
chrRangeToJpeg8_1920_c:      3166.9   3168.7  (1.00x)
chrRangeToJpeg16_1920_c:     2152.4   3164.8  (0.68x)
lumRangeFromJpeg8_1920_c:    1263.0   1302.5  (0.97x)
lumRangeFromJpeg16_1920_c:   1080.5   1299.2  (0.83x)
lumRangeToJpeg8_1920_c:      1886.8   2112.2  (0.89x)
lumRangeToJpeg16_1920_c:     1077.0   1906.5  (0.56x)

aarch64 A55:
chrRangeFromJpeg8_1920_c:   28835.2  28835.6  (1.00x)
chrRangeFromJpeg16_1920_c:  28839.8  32680.8  (0.88x)
chrRangeToJpeg8_1920_c:     23074.7  23075.4  (1.00x)
chrRangeToJpeg16_1920_c:    17318.9  24996.0  (0.69x)
lumRangeFromJpeg8_1920_c:   15389.7  15384.5  (1.00x)
lumRangeFromJpeg16_1920_c:  15388.2  17306.7  (0.89x)
lumRangeToJpeg8_1920_c:     19227.8  19226.6  (1.00x)
lumRangeToJpeg16_1920_c:    15387.0  21146.3  (0.73x)

aarch64 A76:
chrRangeFromJpeg8_1920_c:    6324.4   6268.1  (1.01x)
chrRangeFromJpeg16_1920_c:   6339.9  11521.5  (0.55x)
chrRangeToJpeg8_1920_c:      9656.0   9612.8  (1.00x)
chrRangeToJpeg16_1920_c:     6340.4  11651.8  (0.54x)
lumRangeFromJpeg8_1920_c:    4422.0   4420.8  (1.00x)
lumRangeFromJpeg16_1920_c:   4420.9   5762.0  (0.77x)
lumRangeToJpeg8_1920_c:      5949.1   5977.5  (1.00x)
lumRangeToJpeg16_1920_c:     4446.8   5946.2  (0.75x)

NOTE: all simd optimizations for range_convert have been disabled.
      they will be re-enabled when they are fixed for each architecture.

NOTE2: the same issue still exists in rgb2yuv conversions, which is not
       addressed in this commit.
2024-12-05 21:10:29 +01:00
Ramiro Polla
58bcdeb742 swscale/aarch64/range_convert: saturate output instead of limiting input
aarch64 A55:
chrRangeFromJpeg8_1920_c:    28836.2 (1.00x)
chrRangeFromJpeg8_1920_neon:  5312.6 (5.43x)  5313.9 (5.43x)
chrRangeToJpeg8_1920_c:      44196.2 (1.00x)
chrRangeToJpeg8_1920_neon:    6034.6 (7.32x)  5551.3 (7.96x)
lumRangeFromJpeg8_1920_c:    15388.5 (1.00x)
lumRangeFromJpeg8_1920_neon:  3150.7 (4.88x)  3152.3 (4.88x)
lumRangeToJpeg8_1920_c:      23069.7 (1.00x)
lumRangeToJpeg8_1920_neon:    3873.2 (5.96x)  3628.7 (6.36x)

aarch64 A76:
chrRangeFromJpeg8_1920_c:     6334.7 (1.00x)
chrRangeFromJpeg8_1920_neon:  2264.5 (2.80x)  2344.5 (2.70x)
chrRangeToJpeg8_1920_c:      11474.5 (1.00x)
chrRangeToJpeg8_1920_neon:    2646.5 (4.34x)  2824.2 (4.06x)
lumRangeFromJpeg8_1920_c:     4453.2 (1.00x)
lumRangeFromJpeg8_1920_neon:  1104.8 (4.03x)  1104.5 (4.03x)
lumRangeToJpeg8_1920_c:       6645.0 (1.00x)
lumRangeToJpeg8_1920_neon:    1310.5 (5.07x)  1329.8 (5.00x)
2024-12-05 21:10:29 +01:00
Ramiro Polla
2d1358a84d swscale/range_convert: saturate output instead of limiting input
For bit depths <= 14, the result is saturated to 15 bits.
For bit depths > 14, the result is saturated to 19 bits.

x86_64:
chrRangeFromJpeg8_1920_c:    2126.5   2127.4  (1.00x)
chrRangeFromJpeg16_1920_c:   2331.4   2325.2  (1.00x)
chrRangeToJpeg8_1920_c:      3163.0   3166.9  (1.00x)
chrRangeToJpeg16_1920_c:     3163.7   2152.4  (1.47x)
lumRangeFromJpeg8_1920_c:    1262.2   1263.0  (1.00x)
lumRangeFromJpeg16_1920_c:   1079.5   1080.5  (1.00x)
lumRangeToJpeg8_1920_c:      1860.5   1886.8  (0.99x)
lumRangeToJpeg16_1920_c:     1910.2   1077.0  (1.77x)

aarch64 A55:
chrRangeFromJpeg8_1920_c:   28836.2  28835.2  (1.00x)
chrRangeFromJpeg16_1920_c:  28840.1  28839.8  (1.00x)
chrRangeToJpeg8_1920_c:     44196.2  23074.7  (1.92x)
chrRangeToJpeg16_1920_c:    36527.3  17318.9  (2.11x)
lumRangeFromJpeg8_1920_c:   15388.5  15389.7  (1.00x)
lumRangeFromJpeg16_1920_c:  15389.3  15388.2  (1.00x)
lumRangeToJpeg8_1920_c:     23069.7  19227.8  (1.20x)
lumRangeToJpeg16_1920_c:    19227.8  15387.0  (1.25x)

aarch64 A76:
chrRangeFromJpeg8_1920_c:    6334.7   6324.4  (1.00x)
chrRangeFromJpeg16_1920_c:   6336.0   6339.9  (1.00x)
chrRangeToJpeg8_1920_c:     11474.5   9656.0  (1.19x)
chrRangeToJpeg16_1920_c:     9640.5   6340.4  (1.52x)
lumRangeFromJpeg8_1920_c:    4453.2   4422.0  (1.01x)
lumRangeFromJpeg16_1920_c:   4414.2   4420.9  (1.00x)
lumRangeToJpeg8_1920_c:      6645.0   5949.1  (1.12x)
lumRangeToJpeg16_1920_c:     6005.2   4446.8  (1.35x)

NOTE: all simd optimizations for range_convert have been disabled
      except for x86, which already had the same behaviour.
      they will be re-enabled when they are fixed for each architecture.
2024-12-05 21:10:29 +01:00
Niklas Haas
2d077f9acd swscale/internal: group user-facing options together
This is a preliminary step to separating these into a new struct. This
commit contains no functional changes, it is a pure search-and-replace.

Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
2024-11-21 12:49:56 +01:00
Ramiro Polla
f7ee0195df swscale/range_convert: drop redundant conditionals from arch-specific init functions
These conditions are already checked for in the main init function.
2024-10-27 13:20:56 +01:00
Ramiro Polla
7728b3357d swscale/range_convert: call arch-specific init functions from main init function
This commit also fixes the issue that the call to ff_sws_init_range_convert()
from sws_init_swscale() was not setting up the arch-specific optimizations.
2024-10-27 13:20:56 +01:00
Niklas Haas
67adb30322 swscale: rename SwsContext to SwsInternal
And preserve the public SwsContext as separate name. The motivation here
is that I want to turn SwsContext into a public struct, while keeping the
internal implementation hidden. Additionally, I also want to be able to
use multiple internal implementations, e.g. for GPU devices.

This commit does not include any functional changes. For the most part, it is
a simple rename. The only complications arise from the public facing API
functions, which preserve their current type (and hence require an additional
unwrapping step internally), and the checkasm test framework, which directly
accesses SwsInternal.

For consistency, the affected functions that need to maintain a distionction
have generally been changed to refer to the SwsContext as *sws, and the
SwsInternal as *c.

In an upcoming commit, I will provide a backing definition for the public
SwsContext, and update `sws_internal()` to dereference the internal struct
instead of merely casting it.

Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
2024-10-24 22:50:00 +02:00
Martin Storsjö
b9145fcab2 swscale: Fix aarch64 and i386 compilation failures
This unbreaks builds after c1a0e65763,
which broke with errors like

src/libswscale/aarch64/rgb2rgb.c:66:25: error: incompatible function pointer types assigning to 'void (*)(const uint8_t *, uint8_t *, uint8_t *, uint8_t *, int, int, int, int, int, const int32_t *)' (aka 'void (*)(const unsigned char *, unsigned char *, unsigned char *, unsigned char *, int, int, int, int, int, const int *)') from 'void (const uint8_t *, uint8_t *, uint8_t *, uint8_t *, int, int, int, int, int, int32_t *)' (aka 'void (const unsigned char *, unsigned char *, unsigned char *, unsigned char *, int, int, int, int, int, int *)') [-Wincompatible-function-pointer-types]
   66 |         ff_rgb24toyv12  = rgb24toyv12;
      |                         ^ ~~~~~~~~~~~

and

src/libswscale/aarch64/swscale_unscaled.c:213:29: error: incompatible function pointer types assigning to 'SwsFunc' (aka 'int (*)(struct SwsContext *, const unsigned char *const *, const int *, int, int, unsigned char *const *, const int *)') from 'int (SwsContext *, const uint8_t *const *, const int *, int, int, const uint8_t **, const int *)' (aka 'int (struct SwsContext *, const unsigned char *const *, const int *, int, int, const unsigned char **, const int *)') [-Wincompatible-function-pointer-types]
  213 |         c->convert_unscaled = nv24_to_yuv420p_neon_wrapper;
      |                             ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Martin Storsjö <martin@martin.st>
2024-10-08 09:29:07 +03:00
Niklas Haas
c1a0e65763 swscale/internal: constify SwsFunc
I want to move away from having random leaf processing functions mutate
plane pointers, and while we're at it, we might as well make the strides
and tables const as well.

Sponsored-by: Sovereign Tech Fund
Signed-off-by: Niklas Haas <git@haasn.dev>
2024-10-07 19:51:34 +02:00
Zhao Zhili
e18b46d95f swscale/aarch64: Fix rgb24toyv12 only works with aligned width
Since c0666d8b, rgb24toyv12 is broken for width non-aligned to 16.
Add a simple wrapper to handle the non-aligned part.

Co-authored-by: johzzy <hellojinqiang@gmail.com>
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-09-24 10:24:14 +08:00
Ramiro Polla
c0666d8bed swscale/aarch64/rgb2rgb: add neon implementation for rgb24toyv12
A55               A76
rgb24toyv12_16_200_c:     36890.6           17275.5
rgb24toyv12_16_200_neon:  12460.1 ( 2.96x)   5360.8 ( 3.22x)
rgb24toyv12_128_60_c:     83205.1           39884.8
rgb24toyv12_128_60_neon:  27468.4 ( 3.03x)  13552.5 ( 2.94x)
rgb24toyv12_512_16_c:     88111.6           42346.8
rgb24toyv12_512_16_neon:  29126.6 ( 3.03x)  14411.2 ( 2.94x)
rgb24toyv12_1920_4_c:     82068.1           39620.0
rgb24toyv12_1920_4_neon:  27011.6 ( 3.04x)  13492.2 ( 2.94x)
2024-09-06 23:11:13 +02:00
Ramiro Polla
d8848325a6 swscale/aarch64/rgb2rgb: add deinterleaveBytes neon implementation
A55               A76
deinterleave_bytes_c:             70342.0           34497.5
deinterleave_bytes_neon:          21594.5 ( 3.26x)   5535.2 ( 6.23x)
deinterleave_bytes_aligned_c:     71340.8           34651.2
deinterleave_bytes_aligned_neon:   8616.8 ( 8.28x)   3996.2 ( 8.67x)
2024-09-06 23:05:09 +02:00
Ramiro Polla
420d443600 swscale/aarch64: cosmetics fix (spaces inside curly braces) 2024-08-26 11:07:49 +02:00
Ramiro Polla
52887683e9 swscale/aarch64: add nv24/nv42 to yuv420p unscaled converter
A55               A76
nv24_yuv420p_128_c:       4956.1            1267.0
nv24_yuv420p_128_neon:    3109.1 ( 1.59x)    640.0 ( 1.98x)
nv24_yuv420p_1920_c:     35728.4           11736.2
nv24_yuv420p_1920_neon:   8011.1 ( 4.46x)   2436.0 ( 4.82x)
nv42_yuv420p_128_c:       4956.4            1270.5
nv42_yuv420p_128_neon:    3074.6 ( 1.61x)    639.5 ( 1.99x)
nv42_yuv420p_1920_c:     35685.9           11732.5
nv42_yuv420p_1920_neon:   7995.1 ( 4.46x)   2437.2 ( 4.81x)
2024-08-26 11:04:46 +02:00
Martin Storsjö
cfe0a36352 libswscale: aarch64: Fix the indentation of some macro invocations
Signed-off-by: Martin Storsjö <martin@martin.st>
2024-08-22 14:40:30 +03:00
Ramiro Polla
181cd260db swscale/aarch64/yuv2rgb: add neon yuv42{0,2}p -> gbrp unscaled colorspace converters
checkasm --bench on a Raspberry Pi 5 Model B Rev 1.0:
yuv420p_gbrp_128_c: 1243.0
yuv420p_gbrp_128_neon: 453.5
yuv420p_gbrp_1920_c: 18165.5
yuv420p_gbrp_1920_neon: 6700.0
yuv422p_gbrp_128_c: 1463.5
yuv422p_gbrp_128_neon: 471.5
yuv422p_gbrp_1920_c: 21343.7
yuv422p_gbrp_1920_neon: 6743.5
2024-08-18 22:26:17 +02:00
Zhao Zhili
4d90a76986 swscale/aarch64: Add argb/abgr to yuv
Test on Apple M1 with kperf:
				: -O3		: -O3 -fno-vectorize
abgr_to_uv_8_c			: 19.4		: 26.1
abgr_to_uv_8_neon		: 29.9		: 51.1
abgr_to_uv_128_c		: 146.4		: 558.9
abgr_to_uv_128_neon		: 85.1		: 83.4
abgr_to_uv_1080_c		: 1162.6	: 4786.4
abgr_to_uv_1080_neon		: 819.6		: 826.6
abgr_to_uv_1920_c		: 2063.6	: 8492.1
abgr_to_uv_1920_neon		: 1435.1	: 1447.1
abgr_to_uv_half_8_c		: 16.4		: 11.4
abgr_to_uv_half_8_neon		: 35.6		: 20.4
abgr_to_uv_half_128_c		: 108.6		: 359.4
abgr_to_uv_half_128_neon	: 75.4		: 42.6
abgr_to_uv_half_1080_c		: 883.4		: 2885.6
abgr_to_uv_half_1080_neon	: 460.6		: 481.1
abgr_to_uv_half_1920_c		: 1553.6	: 5106.9
abgr_to_uv_half_1920_neon	: 817.6		: 820.4
abgr_to_y_8_c			: 6.1		: 26.4
abgr_to_y_8_neon		: 40.6		: 6.4
abgr_to_y_128_c			: 99.9		: 390.1
abgr_to_y_128_neon		: 67.4		: 55.9
abgr_to_y_1080_c		: 735.9		: 3170.4
abgr_to_y_1080_neon		: 534.6		: 536.6
abgr_to_y_1920_c		: 1279.4	: 6016.4
abgr_to_y_1920_neon		: 932.6		: 927.6

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-07-05 16:32:31 +08:00
Zhao Zhili
52422133ae swscale/aarch64: Add bgra/rgba to yuv
Test on Apple M1 with kperf
				: -O3		: -O3 -fno-vectorize
bgra_to_uv_8_c			: 13.4		: 27.5
bgra_to_uv_8_neon		: 37.4		: 41.7
bgra_to_uv_128_c		: 155.9		: 550.2
bgra_to_uv_128_neon		: 91.7		: 92.7
bgra_to_uv_1080_c		: 1173.2	: 4558.2
bgra_to_uv_1080_neon		: 822.7		: 809.5
bgra_to_uv_1920_c		: 2078.2	: 8115.2
bgra_to_uv_1920_neon		: 1437.7	: 1438.7
bgra_to_uv_half_8_c		: 17.9		: 14.2
bgra_to_uv_half_8_neon		: 37.4		: 10.5
bgra_to_uv_half_128_c		: 103.9		: 326.0
bgra_to_uv_half_128_neon	: 73.9		: 68.7
bgra_to_uv_half_1080_c		: 850.2		: 3732.0
bgra_to_uv_half_1080_neon	: 484.2		: 490.0
bgra_to_uv_half_1920_c		: 1479.2	: 4942.7
bgra_to_uv_half_1920_neon	: 824.2		: 824.7
bgra_to_y_8_c			: 8.2		: 29.5
bgra_to_y_8_neon		: 18.2		: 32.7
bgra_to_y_128_c			: 101.4		: 361.5
bgra_to_y_128_neon		: 74.9		: 73.7
bgra_to_y_1080_c		: 739.4		: 3018.0
bgra_to_y_1080_neon		: 613.4		: 544.2
bgra_to_y_1920_c		: 1298.7	: 5326.0
bgra_to_y_1920_neon		: 918.7		: 934.2

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-07-05 16:32:31 +08:00
Zhao Zhili
b8b71be07a swscale/aarch64: Add bgr24 to yuv
Test on Apple M1 with kperf
				: -O3		: -O3 -fno-vectorize
bgr24_to_uv_8_c			: 28.5		: 52.5
bgr24_to_uv_8_neon		: 54.5		: 59.7
bgr24_to_uv_128_c		: 294.0		: 830.7
bgr24_to_uv_128_neon		: 99.7		: 112.0
bgr24_to_uv_1080_c		: 965.0		: 6624.0
bgr24_to_uv_1080_neon		: 751.5		: 754.7
bgr24_to_uv_1920_c		: 1693.2	: 11554.5
bgr24_to_uv_1920_neon		: 1292.5	: 1307.5
bgr24_to_uv_half_8_c		: 54.2		: 37.0
bgr24_to_uv_half_8_neon		: 27.2		: 22.5
bgr24_to_uv_half_128_c		: 127.2		: 392.5
bgr24_to_uv_half_128_neon	: 63.0		: 52.0
bgr24_to_uv_half_1080_c		: 880.2		: 3329.0
bgr24_to_uv_half_1080_neon	: 401.5		: 390.7
bgr24_to_uv_half_1920_c		: 1585.7	: 6390.7
bgr24_to_uv_half_1920_neon	: 694.7		: 698.7
bgr24_to_y_8_c			: 21.7		: 22.5
bgr24_to_y_8_neon		: 797.2		: 25.5
bgr24_to_y_128_c		: 88.0		: 280.5
bgr24_to_y_128_neon		: 63.7		: 55.0
bgr24_to_y_1080_c		: 616.7		: 2208.7
bgr24_to_y_1080_neon		: 900.0		: 452.0
bgr24_to_y_1920_c		: 1093.2	: 3894.7
bgr24_to_y_1920_neon		: 777.2		: 767.5

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-07-05 16:32:31 +08:00
Ramiro Polla
75f1a8e071 swscale/aarch64: add neon {lum,chr}ConvertRange
chrRangeFromJpeg_8_c: 29.2
chrRangeFromJpeg_8_neon: 19.5
chrRangeFromJpeg_24_c: 80.5
chrRangeFromJpeg_24_neon: 34.0
chrRangeFromJpeg_128_c: 413.7
chrRangeFromJpeg_128_neon: 156.0
chrRangeFromJpeg_144_c: 471.0
chrRangeFromJpeg_144_neon: 174.2
chrRangeFromJpeg_256_c: 842.0
chrRangeFromJpeg_256_neon: 305.5
chrRangeFromJpeg_512_c: 1699.0
chrRangeFromJpeg_512_neon: 608.0
chrRangeToJpeg_8_c: 51.7
chrRangeToJpeg_8_neon: 22.7
chrRangeToJpeg_24_c: 149.7
chrRangeToJpeg_24_neon: 38.0
chrRangeToJpeg_128_c: 761.7
chrRangeToJpeg_128_neon: 176.7
chrRangeToJpeg_144_c: 866.2
chrRangeToJpeg_144_neon: 198.7
chrRangeToJpeg_256_c: 1516.5
chrRangeToJpeg_256_neon: 348.7
chrRangeToJpeg_512_c: 3067.2
chrRangeToJpeg_512_neon: 692.7
lumRangeFromJpeg_8_c: 24.0
lumRangeFromJpeg_8_neon: 17.0
lumRangeFromJpeg_24_c: 56.7
lumRangeFromJpeg_24_neon: 21.0
lumRangeFromJpeg_128_c: 294.5
lumRangeFromJpeg_128_neon: 76.7
lumRangeFromJpeg_144_c: 332.5
lumRangeFromJpeg_144_neon: 86.7
lumRangeFromJpeg_256_c: 586.0
lumRangeFromJpeg_256_neon: 152.2
lumRangeFromJpeg_512_c: 1190.0
lumRangeFromJpeg_512_neon: 298.0
lumRangeToJpeg_8_c: 31.7
lumRangeToJpeg_8_neon: 19.5
lumRangeToJpeg_24_c: 83.5
lumRangeToJpeg_24_neon: 24.2
lumRangeToJpeg_128_c: 440.5
lumRangeToJpeg_128_neon: 91.0
lumRangeToJpeg_144_c: 504.2
lumRangeToJpeg_144_neon: 101.0
lumRangeToJpeg_256_c: 879.7
lumRangeToJpeg_256_neon: 177.2
lumRangeToJpeg_512_c: 1794.2
lumRangeToJpeg_512_neon: 354.0
2024-06-18 23:12:41 +02:00
Zhao Zhili
9dac8495b0 swscale/aarch64: Add rgb24 to yuv implementation
Test on Apple M1:

rgb24_to_uv_8_c: 0.0
rgb24_to_uv_8_neon: 0.2
rgb24_to_uv_128_c: 1.0
rgb24_to_uv_128_neon: 0.5
rgb24_to_uv_1080_c: 7.0
rgb24_to_uv_1080_neon: 5.7
rgb24_to_uv_1920_c: 12.5
rgb24_to_uv_1920_neon: 9.5
rgb24_to_uv_half_8_c: 0.2
rgb24_to_uv_half_8_neon: 0.2
rgb24_to_uv_half_128_c: 1.0
rgb24_to_uv_half_128_neon: 0.5
rgb24_to_uv_half_1080_c: 6.2
rgb24_to_uv_half_1080_neon: 3.0
rgb24_to_uv_half_1920_c: 11.2
rgb24_to_uv_half_1920_neon: 5.2
rgb24_to_y_8_c: 0.2
rgb24_to_y_8_neon: 0.0
rgb24_to_y_128_c: 0.5
rgb24_to_y_128_neon: 0.5
rgb24_to_y_1080_c: 4.7
rgb24_to_y_1080_neon: 3.2
rgb24_to_y_1920_c: 8.0
rgb24_to_y_1920_neon: 5.7

On Pixel 6:

rgb24_to_uv_8_c: 30.7
rgb24_to_uv_8_neon: 56.9
rgb24_to_uv_128_c: 213.9
rgb24_to_uv_128_neon: 173.2
rgb24_to_uv_1080_c: 1649.9
rgb24_to_uv_1080_neon: 1424.4
rgb24_to_uv_1920_c: 2907.9
rgb24_to_uv_1920_neon: 2480.7
rgb24_to_uv_half_8_c: 36.2
rgb24_to_uv_half_8_neon: 33.4
rgb24_to_uv_half_128_c: 167.9
rgb24_to_uv_half_128_neon: 99.4
rgb24_to_uv_half_1080_c: 1293.9
rgb24_to_uv_half_1080_neon: 778.7
rgb24_to_uv_half_1920_c: 2292.7
rgb24_to_uv_half_1920_neon: 1328.7
rgb24_to_y_8_c: 19.7
rgb24_to_y_8_neon: 27.7
rgb24_to_y_128_c: 129.9
rgb24_to_y_128_neon: 96.7
rgb24_to_y_1080_c: 995.4
rgb24_to_y_1080_neon: 767.7
rgb24_to_y_1920_c: 1747.4
rgb24_to_y_1920_neon: 1337.2

Note both tests use clang as compiler, which has vectorization
enabled by default with -O3.

Reviewed-by: Rémi Denis-Courmont <remi@remlab.net>
Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-06-11 01:12:09 +08:00
xufuji456
cc86343b96 lavc/hevcdsp_qpel_neon: using movi.16b instead of movi.2d
Building iOS platform with arm64, the compiler has a warning: "instruction movi.2d with immediate #0 may not function correctly on this CPU, converting to movi.16b"

Signed-off-by: xufuji456 <839789740@qq.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-11-28 15:54:49 +02:00
Martin Storsjö
a76b409dd0 aarch64: Reindent all assembly to 8/24 column indentation
libavcodec/aarch64/vc1dsp_neon.S is skipped here, as it intentionally
uses a layered indentation style to visually show how different
unrolled/interleaved phases fit together.

Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-21 23:25:54 +03:00
Martin Storsjö
93cda5a9c2 aarch64: Lowercase UXTW/SXTW and similar flags
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-21 23:25:23 +03:00
Martin Storsjö
184103b310 aarch64: Consistently use lowercase for vector element specifiers
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-10-21 23:25:18 +03:00
Hubert Mazur
2537fdc510 sw_scale: Add specializations for hscale 16 to 19
Provide arm64 neon optimized implementations for hscale16To19 with
filter sizes 4, 8 and X4.

The tests and benchmarks run on AWS Graviton 2 instances.
The results from a checkasm tool are shown below.

hscale_16_to_19__fs_4_dstW_512_c: 6216.0
hscale_16_to_19__fs_4_dstW_512_neon: 2257.0
hscale_16_to_19__fs_8_dstW_512_c: 10417.7
hscale_16_to_19__fs_8_dstW_512_neon: 3112.5
hscale_16_to_19__fs_12_dstW_512_c: 14890.5
hscale_16_to_19__fs_12_dstW_512_neon: 3899.0
hscale_16_to_19__fs_16_dstW_512_c: 19006.5
hscale_16_to_19__fs_16_dstW_512_neon: 5341.2
hscale_16_to_19__fs_32_dstW_512_c: 36629.5
hscale_16_to_19__fs_32_dstW_512_neon: 9502.7
hscale_16_to_19__fs_40_dstW_512_c: 45477.5
hscale_16_to_19__fs_40_dstW_512_neon: 11552.0

(Note, the checkasm tests for these functions haven't been
merged since they fail on x86.)

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-11-01 15:24:58 +02:00
Hubert Mazur
9ccf8c5bfc sw_scale: Add specializations for hscale 16 to 15
Add arm64 neon implementations for hscale 16 to 15 with filter
sizes 4, 8 and X4.

The tests and benchmarks run on AWS Graviton 2 instances.
The results from a checkasm tool are shown below.

hscale_16_to_15__fs_4_dstW_512_c: 6703.5
hscale_16_to_15__fs_4_dstW_512_neon: 2298.0
hscale_16_to_15__fs_8_dstW_512_c: 10983.0
hscale_16_to_15__fs_8_dstW_512_neon: 3216.5
hscale_16_to_15__fs_12_dstW_512_c: 15526.0
hscale_16_to_15__fs_12_dstW_512_neon: 3993.0
hscale_16_to_15__fs_16_dstW_512_c: 20183.5
hscale_16_to_15__fs_16_dstW_512_neon: 5369.7
hscale_16_to_15__fs_32_dstW_512_c: 39315.2
hscale_16_to_15__fs_32_dstW_512_neon: 9511.2
hscale_16_to_15__fs_40_dstW_512_c: 48995.7
hscale_16_to_15__fs_40_dstW_512_neon: 11570.0

(Note, the checkasm tests for these functions haven't been
merged since they fail on x86.)

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-11-01 15:24:53 +02:00
Hubert Mazur
1e9cfa5bb0 sw_scale: Add specializations for hscale 8 to 19
Add arm64 neon implementations for hscale 8 to 19 with filter
sizes 4, 4X and 8. Both implementations are based on very similar ones
dedicated to hscale 8 to 15. The major changes refer to saving
the data - instead of writing the result as int16_t it is done
with int32_t.

These functions are heavily inspired on patches provided by J. Swinney
and M. Storsjö for hscale8to15 which were slightly adapted for
hscale8to19.

The tests and benchmarks run on AWS Graviton 2 instances. The results
from a checkasm tool shown below.

hscale_8_to_19__fs_4_dstW_512_c: 5663.2
hscale_8_to_19__fs_4_dstW_512_neon: 1259.7
hscale_8_to_19__fs_8_dstW_512_c: 9306.0
hscale_8_to_19__fs_8_dstW_512_neon: 2020.2
hscale_8_to_19__fs_12_dstW_512_c: 12932.7
hscale_8_to_19__fs_12_dstW_512_neon: 2462.5
hscale_8_to_19__fs_16_dstW_512_c: 16844.2
hscale_8_to_19__fs_16_dstW_512_neon: 4671.2
hscale_8_to_19__fs_32_dstW_512_c: 32803.7
hscale_8_to_19__fs_32_dstW_512_neon: 5474.2
hscale_8_to_19__fs_40_dstW_512_c: 40948.0
hscale_8_to_19__fs_40_dstW_512_neon: 6669.7

Signed-off-by: Hubert Mazur <hum@semihalf.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-11-01 15:24:43 +02:00
Martin Storsjö
cb803a0072 swscale: aarch64: Fix yuv2rgb with negative strides
Treat the 32 bit stride registers as signed.

Alternatively, we could make the stride arguments ptrdiff_t instead
of int, and changing all of the assembly to operate on these
registers with their full 64 bit width, but that's a much larger
and more intrusive change (and risks missing some operation, which
would clamp the intermediates to 32 bit still).

Fixes: https://trac.ffmpeg.org/ticket/9985

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-10-27 21:49:26 +03:00
Swinney, Jonathan
0d7caa5b09 swscale/aarch64: add vscale specializations
This commit adds new code paths for vscale when filterSize is 2, 4, or
8. By using specialized code with unrolling to match the filterSize we
can improve performance.

On AWS c7g (Graviton 3, Neoverse V1) instances:
                                 before   after
yuv2yuvX_2_0_512_accurate_neon:  558.8    268.9
yuv2yuvX_4_0_512_accurate_neon:  637.5    434.9
yuv2yuvX_8_0_512_accurate_neon:  1144.8   806.2
yuv2yuvX_16_0_512_accurate_neon: 2080.5   1853.7

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-16 13:40:42 +03:00
Swinney, Jonathan
3e708722a2 swscale/aarch64: vscale optimization
Use scalar times vector multiply accumlate instructions instead of
vector times vector to remove the need for replicating load instructions
which are slightly slower.

On AWS c7g (Graviton 3, Neoverse V1) instances:
yuv2yuvX_8_0_512_accurate_neon:  1144.8  987.4
yuv2yuvX_16_0_512_accurate_neon: 2080.5 1869.4

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-16 13:40:42 +03:00
Swinney, Jonathan
75ffca7eef libswscale/aarch64: add another hscale specialization
This specialization handles the case where filtersize is 4 mod 8, e.g.
12, 20, etc. Aarch64 was previously using the c function for this case.
This implementation speeds up that case significantly.

hscale_8_to_15__fs_12_dstW_512_c: 6234.1
hscale_8_to_15__fs_12_dstW_512_neon: 1505.6

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-16 12:08:38 +03:00
Swinney, Jonathan
0ea61725b1 swscale/aarch64: add hscale specializations
This patch adds code to support specializations of the hscale function
and adds a specialization for filterSize == 4.

ff_hscale8to15_4_neon is a complete rewrite. Since the main bottleneck
here is loading the data from src, this data is loaded a whole block
ahead and stored back to the stack to be loaded again with ld4. This
arranges the data for most efficient use of the vector instructions and
removes the need for completion adds at the end. The number of
iterations of the C per iteration of the assembly is increased from 4 to
8, but because of the prefetching, there must be a special section
without prefetching when dstW < 16.

This improves speed on Graviton 2 (Neoverse N1) dramatically in the case
where previously fs=8 would have been required.

before: hscale_8_to_15__fs_8_dstW_512_neon: 1962.8
after : hscale_8_to_15__fs_4_dstW_512_neon: 1220.9

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-05-28 01:09:05 +03:00
Martin Storsjö
70db14376c swscale: aarch64: Optimize the final summation in the hscale routine
Before:                     Cortex A53      A72      A73  Graviton 2  Graviton 3
hscale_8_to_15_width8_neon:     8273.0   4602.5   4289.5      2429.7      1629.1
hscale_8_to_15_width16_neon:   12405.7   6803.0   6359.0      3549.0      2378.4
hscale_8_to_15_width32_neon:   21258.7  11491.7  11469.2      5797.2      3919.6
hscale_8_to_15_width40_neon:   25652.0  14173.7  12488.2      6893.5      4810.4

After:
hscale_8_to_15_width8_neon:     7633.0   3981.5   3350.2      1980.7      1261.1
hscale_8_to_15_width16_neon:   11666.7   5951.0   5512.0      3080.7      2131.4
hscale_8_to_15_width32_neon:   20900.7  10733.2   9481.7      5275.2      3862.1
hscale_8_to_15_width40_neon:   24826.0  13536.2  11502.0      6397.2      4731.9

Thus, this gives overall a 8-29% speedup for the smaller filter
sizes, around 1-8% for the larger filter sizes.

Inspired by a patch by Jonathan Swinney <jswinney@amazon.com>.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-22 10:49:46 +03:00
Anton Khirnov
1f80789bf7 sws: rename SwsContext.swscale to convert_unscaled
That function pointer is now used only for unscaled conversion.
2021-07-03 15:57:53 +02:00
Andreas Rheinhardt
f3c197b129 Include attributes.h directly
Some files currently rely on libavutil/cpu.h to include it for them;
yet said file won't use include it any more after the currently
deprecated functions are removed, so include attributes.h directly.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2021-04-19 14:34:10 +02:00
Lynne
3e098cca6e
aarch64/yuv2rgb_neon: fix return value
We return 0 for this particular architecture but should instead be
returning the number of lines.
Fixes users who check the return value matches what they expect.
2020-07-09 10:33:14 +01:00
Martin Storsjö
e0604d508e swscale: aarch64: Add a NEON implementation of interleaveBytes
This allows speeding up format conversions from yuv420 to nv12.

                             Cortex A53      A72      A73
interleave_bytes_c:             86077.5  51433.0  66972.0
interleave_bytes_neon:          19701.7  23019.2  15859.2
interleave_bytes_aligned_c:     86603.0  52017.2  67484.2
interleave_bytes_aligned_neon:   9061.0   7623.0   6309.0

Signed-off-by: Martin Storsjö <martin@martin.st>
2020-05-15 23:38:17 +03:00
Josh de Kock
718c8f9aa5 swscale: fix NEON hscale init
The NEON hscale function only supports X8 filter sizes and should only
be selected when these are being used. At the moment filterAlign is
set to 8 but in the future when extra NEON assembly for specific sizes is
added they will need to have checks here too.

The immediate usecase for this change is making the hscale checkasm
test easier and without NEON specific edge-cases (x86 already has these
guards).

Signed-off-by: Josh de Kock <josh@itanimul.li>
2020-05-15 10:29:30 +01:00
Martin Storsjö
9025d5c5ce swscale: aarch64: Don't clobber callee-saved registers v8-v15
Signed-off-by: Martin Storsjö <martin@martin.st>
2020-04-21 23:41:13 +03:00