Commit Graph

10 Commits

Author SHA1 Message Date
Wu Jianhua 7bbad32d5a libavfilter/x86/vf_gblur: correct the order of loop step
The problem was caused by if the width of the processed block
minus 1 is a multiple of the aligned number the instruction
jle .bscale_scalar would skip the Optimized Loop Step, which
will lead to an incorrect sampling when specifying steps more
than 1. Move the Optimized Loop Step after .bscale_scalar to
ensure the loop step is enabled.

Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
2021-09-18 12:38:01 +02:00
Wu Jianhua fcf10c925d libavfilter/x86/vf_gblur: fixed the fate-test failed on MacOS
Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
2021-09-18 12:37:56 +02:00
Wu Jianhua 4041c1029b libavfilter/x86/vf_gblur: add localbuf and ff_horiz_slice_avx2/512()
We introduced a ff_horiz_slice_avx2/512() implemented on a new algorithm.
In a nutshell, the new algorithm does three things, gathering data from
8/16 rows, blurring data, and scattering data back to the image buffer.
Here we used a customized transpose 8x8/16x16 to avoid the huge overhead
brought by gather and scatter instructions, which is dependent on the
temporary buffer called localbuf added newly.

Performance data:
ff_horiz_slice_avx2(old): 109.89
ff_horiz_slice_avx2(new): 666.67
ff_horiz_slice_avx512: 1000

Co-authored-by: Cheng Yanfei <yanfei.cheng@intel.com>
Co-authored-by: Jin Jun <jun.i.jin@intel.com>
Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
2021-08-29 19:58:33 +02:00
Wu Jianhua 68a2722aee libavfilter/x86/vf_gblur: add ff_verti_slice_avx2/512()
The new vertical slice with AVX2/512 acceleration can significantly
improve the performance of Gaussian Filter 2D.

Performance data:
ff_verti_slice_c: 32.57
ff_verti_slice_avx2: 476.19
ff_verti_slice_avx512: 833.33

Co-authored-by: Cheng Yanfei <yanfei.cheng@intel.com>
Co-authored-by: Jin Jun <jun.i.jin@intel.com>
Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
2021-08-29 19:58:33 +02:00
Wu Jianhua 4a5e24721c libavfilter/x86/vf_gblur: add ff_postscale_slice_avx512()
Co-authored-by: Cheng Yanfei <yanfei.cheng@intel.com>
Co-authored-by: Jin Jun <jun.i.jin@intel.com>
Signed-off-by: Wu Jianhua <jianhua.wu@intel.com>
2021-08-29 19:58:33 +02:00
James Almer 1628409b18 x86/vf_gblur: fix reg name in UNIX64 prologue
Signed-off-by: James Almer <jamrial@gmail.com>
2021-02-17 15:51:28 -03:00
James Almer 2b4da1cb8c x86/vf_gblur: fix postscale_slice prologue
x86_32 ABI does not pass float arguments directly on xmm regs, and the Win64
ABI uses only the first four regs for this purpose.

Signed-off-by: James Almer <jamrial@gmail.com>
2021-02-17 13:33:20 -03:00
Paul B Mahol 44cf3a2b16 avfilter/x86/vf_gblur: add postscale SIMD 2021-02-16 21:12:11 +01:00
Paul B Mahol 64a805883d avfilter/vf_gblur: fix heap-buffer overflow
Fixes #8282
2019-10-16 12:13:04 +02:00
Ruiling Song 83f9da7768 avfilter/vf_gblur: add x86 SIMD optimizations
The horizontal pass get ~2x performance with the patch
under single thread.

Tested overall performance using the command(avx2 enabled):
./ffmpeg -i 1080p.mp4 -vf gblur -f null /dev/null
./ffmpeg -i 1080p.mp4 -vf gblur=threads=1 -f null /dev/null
For single thread, the fps improves from 43 to 60, about 40%.
For multi-thread, the fps improves from 110 to 130, about 20%.

Signed-off-by: Ruiling Song <ruiling.song@intel.com>
2019-06-12 08:53:11 +08:00