Commit Graph

6 Commits

Author SHA1 Message Date
Martin Storsjö
70db14376c swscale: aarch64: Optimize the final summation in the hscale routine
Before:                     Cortex A53      A72      A73  Graviton 2  Graviton 3
hscale_8_to_15_width8_neon:     8273.0   4602.5   4289.5      2429.7      1629.1
hscale_8_to_15_width16_neon:   12405.7   6803.0   6359.0      3549.0      2378.4
hscale_8_to_15_width32_neon:   21258.7  11491.7  11469.2      5797.2      3919.6
hscale_8_to_15_width40_neon:   25652.0  14173.7  12488.2      6893.5      4810.4

After:
hscale_8_to_15_width8_neon:     7633.0   3981.5   3350.2      1980.7      1261.1
hscale_8_to_15_width16_neon:   11666.7   5951.0   5512.0      3080.7      2131.4
hscale_8_to_15_width32_neon:   20900.7  10733.2   9481.7      5275.2      3862.1
hscale_8_to_15_width40_neon:   24826.0  13536.2  11502.0      6397.2      4731.9

Thus, this gives overall a 8-29% speedup for the smaller filter
sizes, around 1-8% for the larger filter sizes.

Inspired by a patch by Jonathan Swinney <jswinney@amazon.com>.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-22 10:49:46 +03:00
Martin Storsjö
9025d5c5ce swscale: aarch64: Don't clobber callee-saved registers v8-v15
Signed-off-by: Martin Storsjö <martin@martin.st>
2020-04-21 23:41:13 +03:00
Martin Storsjö
872790b1f9 swscale: aarch64: Avoid using the x18 register
The x18 is a reserved platform register on Darwin and Windows.

x8/w8 seems to be unused in this function though (and same about
x10 and x14), so there's really no reason to use x18 here - just change
the uses of x18/w18 into x8/w8 instead without any further rewrites.

Signed-off-by: Martin Storsjö <martin@martin.st>
2020-04-20 00:09:34 +03:00
Sebastian Pop
bd83191271 swscale/aarch64: use multiply accumulate and increase vector factor to 4
This patch implements ff_hscale_8_to_15_neon with NEON fused multiply accumulate
and bumps the vectorization factor from 2 to 4.
The speedup is of 25% on Graviton1 A1 instances based on A-72 cpus:

$ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -
before: t:0.040303 avg:0.040287 max:0.040371 min:0.039214
after:  t:0.032168 avg:0.032215 max:0.033081 min:0.032146

The speedup is of 39% on Graviton2 m6g instances based on Neoverse-N1 cpus:
$ ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -
before: t:0.019446 avg:0.019423 max:0.019493 min:0.019181
after:  t:0.014015 avg:0.014096 max:0.015018 min:0.013971

Tested with `make check` on aarch64-linux.

Signed-off-by: Sebastian Pop <spop@amazon.com>
Reviewed-by: Jean-Baptiste Kempf <jb@videolan.org>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2019-12-17 23:41:47 +01:00
Clément Bœsch
040598218f sws/aarch64: restore ff_hscale_8_to_15_neon()
Fix final scaling and required filter alignment. Pass FATE.
2016-04-05 12:00:36 +02:00
Clément Bœsch
263eb76bdf sws/aarch64: add ff_hscale_8_to_15_neon
./ffmpeg -nostats -f lavfi -i testsrc2=4k:d=2 -vf bench=start,scale=1024x1024,bench=stop -f null -

    before: t:0.489726 avg:0.489883 max:0.491852 min:0.489482
    after:  t:0.256515 avg:0.256458 max:0.256999 min:0.253755
2016-03-31 10:12:55 +02:00