Commit Graph

516 Commits

Author SHA1 Message Date
Lynne 997f9bdb99
x86/tx_float: correctly load the transform length
The field is a standard field, yet we were loading it as if it was
a quadword. This worked for forward transforms by chance, but broke
when the transform was inverse.
checkasm couldn't catch that because we only test forward transforms,
which are identical to inverse transforms but with a different revtab.
2021-07-18 15:04:57 +02:00
James Almer 7a6ea6ce2a x86/tx_float: remove ff_ prefix from external constant tables
Fixes compilation with some assemblers.

Reviewed-by: Lynne
Signed-off-by: James Almer <jamrial@gmail.com>
2021-04-25 18:42:38 -03:00
Lynne bb40f800bd
x86/tx_float: fix forgotten 2-argument mulps
Yasm *really* cannot deal with any omitted arguments at all.
2021-04-24 22:33:42 +02:00
Lynne e2cf0a1f68
x86/tx_float: use all arguments on vperm2f and vpermilps and reindent comments
Apparently even old nasm isn't required to accept incomplete instructions.
2021-04-24 22:21:13 +02:00
James Almer fddddc7ec2 x86/tx_float: Fixes compilation with old yasm
Use three operand format on some instructions, and lea to load effective
addresses of tables.

Signed-off-by: James Almer <jamrial@gmail.com>
2021-04-24 17:02:31 -03:00
Lynne e448a4b4ea
lavu/x86/tx_float: fix FMA3 implying AVX2 is available
It's the other way around - AVX2 implies FMA3 is available.
2021-04-24 19:00:27 +02:00
Lynne 119a3f7e8d
lavu/x86: add FFT assembly
This commit adds a pure x86 assembly SIMD version of the FFT in libavutil/tx.
The design of this pure assembly FFT is pretty unconventional.

On the lowest level, instead of splitting the complex numbers into
real and imaginary parts, we keep complex numbers together but split
them in terms of parity. This saves a number of shuffles in each transform,
but more importantly, it splits each transform into two independent
paths, which we process using separate registers in parallel.
This allows us to keep all units saturated and lets us use all available
registers to avoid dependencies.
Moreover, it allows us to double the granularity of our per-load permutation,
skipping many expensive lookups and allowing us to use just 4 loads per register,
rather than 8, or in case FMA3 (and by extension, AVX2), use the vgatherdpd
instruction, which is at least as fast as 4 separate loads on old hardware,
and quite a bit faster on modern CPUs).

Higher up, we go for a bottom-up construction of large transforms, foregoing
the traditional per-transform call-return recursion chains. Instead, we always
start at the bottom-most basis transform (in this case, a 32-point transform),
and continue constructing larger and larger transforms until we return to the
top-most transform.
This way, we only touch the stack 3 times per a complete target transform:
once for the 1/2 length transform and two times for the 1/4 length transform.

The combination algorithm we use is a standard Split-Radix algorithm,
as used in our C code. Although a version with less operations exists
(Steven G. Johnson and Matteo Frigo's "A modified split-radix FFT with fewer
arithmetic operations", IEEE Trans. Signal Process. 55 (1), 111–119 (2007),
which is the one FFTW uses), it only has 2% less operations and requires at least 4x
the binary code (due to it needing 4 different paths to do a single transform).
That version also has other issues which prevent it from being implemented
with SIMD code as efficiently, which makes it lose the marginal gains it offered,
and cannot be performed bottom-up, requiring many recursive call-return chains,
whose overhead adds up.

We go through a lot of effort to minimize load/stores by keeping as much in
registers in between construcring transforms. This saves us around 32 cycles,
on paper, but in reality a lot more due to load/store aliasing (a load from a
memory location cannot be issued while there's a store pending, and there are
only so many (2 for Zen 3) load/store units in a CPU).
Also, we interleave coefficients during the last stage to save on a store+load
per register.

Each of the smallest, basis transforms (4, 8 and 16-point in our case)
has been extremely optimized. Our 8-point transform is barely 20 instructions
in total, beating our old implementation 8-point transform by 1 instruction.
Our 2x8-point transform is 23 instructions, beating our old implementation by
6 instruction and needing 50% less cycles. Our 16-point transform's combination
code takes slightly more instructions than our old implementation, but makes up
for it by requiring a lot less arithmetic operations.

Overall, the transform was optimized for the timings of Zen 3, which at the
time of writing has the most IPC from all documented CPUs. Shuffles were
preferred over arithmetic operations due to their 1/0.5 latency/throughput.

On average, this code is 30% faster than our old libavcodec implementation.
It's able to trade blows with the previously-untouchable FFTW on small transforms,
and due to its tiny size and better prediction, outdoes FFTW on larger transforms
by 11% on the largest currently supported size.
2021-04-24 17:19:18 +02:00
Andreas Rheinhardt f3c197b129 Include attributes.h directly
Some files currently rely on libavutil/cpu.h to include it for them;
yet said file won't use include it any more after the currently
deprecated functions are removed, so include attributes.h directly.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2021-04-19 14:34:10 +02:00
Henrik Gramner 0b2b03568f avutil/x86inc: fix warnings when assembling with Nasm 2.15
Some new warnings regarding use of empty macro parameters has
been added, so adjust some x86inc code to silence those.

Fixes part of ticket #8771

Signed-off-by: James Almer <jamrial@gmail.com>
2020-07-12 11:30:23 -03:00
Martin Storsjö 1001b6a750 libavutil: x86: Include stdlib.h before using _byteswap_ulong
When clang works in MSVC mode, it does have the _byteswap_ulong
builtin, but one has to include stdlib.h before using it.

Signed-off-by: Martin Storsjö <martin@martin.st>
2020-01-23 18:30:26 +02:00
James Almer 9d002d7818 x86/float_dsp: add ff_vector_dmul_{sse2,avx}
~3x to 5x faster.

Signed-off-by: James Almer <jamrial@gmail.com>
2018-09-14 12:54:42 -03:00
James Almer 481741ece0 x86/pixelutils: don't use the AVX2 functions on CPUs known to be slow with them
Signed-off-by: James Almer <jamrial@gmail.com>
2018-07-31 22:14:53 -03:00
James Almer d5b3077ecf x86/pixelutils: add missing preprocessor wrapper to the AVX2 functions
Should fix compilation with old yasm/nasm

Signed-off-by: James Almer <jamrial@gmail.com>
2018-07-31 22:14:42 -03:00
Jun Zhao d36b8394f4 avutil/pixelutils: sad_32x32 sse2/avx2 optimizations.
add ff_pixelutils_sad_32x32_sse2, ff_pixelutils_sad_{a,u}_32x32_sse2,
ff_pixelutils_sad_32x32_avx22, ff_pixelutils_sad_{a,u}_32x32_avx2

use perf record/report profiling, get instructions:u for avx2 sad_32x32:

  72.05%  pixelutils  pixelutils     [.] block_sad_32x32_c
  18.50%  pixelutils  pixelutils     [.] block_sad_16x16_c
   4.78%  pixelutils  pixelutils     [.] block_sad_8x8_c
   2.69%  pixelutils  pixelutils     [.] block_sad_4x4_c
   0.89%  pixelutils  pixelutils     [.] block_sad_2x2_c
   0.16%  pixelutils  pixelutils     [.] ff_pixelutils_sad_32x32_avx2
   0.16%  pixelutils  pixelutils     [.] ff_pixelutils_sad_u_32x32_avx2
   0.12%  pixelutils  pixelutils     [.] ff_pixelutils_sad_a_32x32_avx2

sse2 sad_32x32 instructions:u like:

  71.86%  pixelutils  pixelutils     [.] block_sad_32x32_c
  18.42%  pixelutils  pixelutils     [.] block_sad_16x16_c
   4.81%  pixelutils  pixelutils     [.] block_sad_8x8_c
   2.68%  pixelutils  pixelutils     [.] block_sad_4x4_c
   0.88%  pixelutils  pixelutils     [.] block_sad_2x2_c
   0.29%  pixelutils  pixelutils     [.] ff_pixelutils_sad_32x32_sse2
   0.26%  pixelutils  pixelutils     [.] ff_pixelutils_sad_u_32x32_sse2
   0.23%  pixelutils  pixelutils     [.] ff_pixelutils_sad_a_32x32_sse2

Signed-off-by: Jun Zhao <mypopydev@gmail.com>
2018-07-31 19:17:51 +08:00
alexander schmid b23c4a9dbd lavu/x86/cpu: Fix aesni detection 2018-07-19 20:17:44 +02:00
Jun Zhao 09628cb1b4 avutil/pixelutils: correct the function name in comments
Signed-off-by: Jun Zhao <mypopydev@gmail.com>
2018-07-11 20:12:33 +08:00
James Almer 35347e7e9b Merge commit '4cf84e254ae75b524e1cacae499a97d7cc9e5906'
* commit '4cf84e254ae75b524e1cacae499a97d7cc9e5906':
  Drop some unnecessary config.h #includes

Merged-by: James Almer <jamrial@gmail.com>
2018-02-11 23:08:48 -03:00
Diego Biurrun 4cf84e254a Drop some unnecessary config.h #includes 2018-02-06 10:03:15 +01:00
Henrik Gramner 6f62b0bd4f x86inc: Drop cpuflags_slowctz 2018-01-20 19:23:37 +01:00
Henrik Gramner eb5f063e7c x86inc: Correctly set mmreg variables 2018-01-20 19:23:37 +01:00
Henrik Gramner 6b6edd1216 x86inc: Support creating global symbols from local labels
On ELF platforms such symbols needs to be flagged as functions with the
correct visibility to please certain linkers in some scenarios.
2018-01-20 19:23:37 +01:00
Henrik Gramner 9e4b3675f2 x86inc: Use .rdata instead of .rodata on Windows
The standard section for read-only data on Windows is .rdata. Nasm will
flag non-standard sections as executable by default which isn't ideal.
2018-01-20 19:23:37 +01:00
Henrik Gramner 3a02cbe3fa x86inc: Enable AVX emulation for floating-point pseudo-instructions
There are 32 pseudo-instructions for each floating-point comparison
instruction, but only 8 of them are actually valid in legacy-encoded mode.
The remaining 24 requires the use of VEX-encoded (v-prefixed) instructions
and can therefore be disregarded for this purpose.
2018-01-20 19:23:37 +01:00
James Almer 90d216cb90 x86inc: set the correct amount of simd regs in x86_64 when avx512 is enabled but not used
Fixes compilation of libavresample/x86/audio_mix.asm

Reviewed-by: Gramner
Signed-off-by: James Almer <jamrial@gmail.com>
2017-12-24 23:02:54 -03:00
Henrik Gramner f7197f68dc x86inc: AVX-512 support
AVX-512 consists of a plethora of different extensions, but in order to keep
things a bit more manageable we group together the following extensions
under a single baseline cpu flag which should cover SKL-X and future CPUs:
 * AVX-512 Foundation (F)
 * AVX-512 Conflict Detection Instructions (CD)
 * AVX-512 Byte and Word Instructions (BW)
 * AVX-512 Doubleword and Quadword Instructions (DQ)
 * AVX-512 Vector Length Extensions (VL)

On x86-64 AVX-512 provides 16 additional vector registers, prefer using
those over existing ones since it allows us to avoid using `vzeroupper`
unless more than 16 vector registers are required. They also happen to
be volatile on Windows which means that we don't need to save and restore
existing xmm register contents unless more than 22 vector registers are
required.

Big thanks to Intel for their support.
2017-12-24 22:02:41 +01:00
James Darnley e2218ed8ce avutil: add alignment needed for AVX-512 2017-12-24 22:02:41 +01:00
James Darnley 4783a01c11 avutil: detect when AVX-512 is available 2017-12-24 22:02:41 +01:00
James Darnley 8b81eabe57 avutil: add AVX-512 flags 2017-12-24 22:02:41 +01:00
Martin Vignali b37196adff avutil/x86util : add macro for loading a 128 bits constants in an xmm or in each part of an ymm in order to simplify avx2 asm func 2017-12-02 18:25:15 +01:00
Dale Curtis 50e30d9bb7 Don't use _tzcnt instrinics with clang for windows w/o BMI.
Technically _tzcnt* intrinsics are only available when the BMI
instruction set is present. However the instruction encoding
degrades to "rep bsf" on older processors.

Clang for Windows debatably restricts the _tzcnt* instrinics behind
the __BMI__ architecture define, so check for its presence or
exclude the usage of these intrinics when clang is present.

See also:
https://ffmpeg.org/pipermail/ffmpeg-devel/2015-November/183404.html
https://bugs.llvm.org/show_bug.cgi?id=30506
http://lists.llvm.org/pipermail/cfe-dev/2016-October/051034.html

Signed-off-by: Dale Curtis <dalecurtis@chromium.org>
Reviewed-by: Matt Oliver <protogonoi@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-10-25 21:50:37 +02:00
James Almer 2904db9045 Merge commit '994c4bc10751e39c7ed9f67ffd0c0dea5223daf2'
* commit '994c4bc10751e39c7ed9f67ffd0c0dea5223daf2':
  x86util: Port all macros to cpuflags

See d5f8a642f6

Merged-by: James Almer <jamrial@gmail.com>
2017-10-21 12:15:57 -03:00
James Almer 3d828c9fd5 cpu: split flag checks per arch in av_cpu_max_align()
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
2017-10-09 11:48:24 +02:00
James Almer 3b345d389b avutil/cpu: split flag checks per arch in av_cpu_max_align()
Signed-off-by: James Almer <jamrial@gmail.com>
2017-09-27 23:10:09 -03:00
James Almer 0c005fa86f Merge commit '7abdd026df6a9a52d07d8174505b33cc89db7bf6'
* commit '7abdd026df6a9a52d07d8174505b33cc89db7bf6':
  asm: Consistently uppercase SECTION markers

Merged-by: James Almer <jamrial@gmail.com>
2017-09-26 18:48:06 -03:00
Ivan Kalvachev 30ae07d7ef Add macros to x86util.asm .
Improved version of VBROADCASTSS that works like the avx2 instruction.
Emulation of vpbroadcastd.
Horizontal sum HSUMPS that places the result in all elements.
Emulation of blendvps and pblendvb.

Signed-off-by: Ivan Kalvachev <ikalvachev@gmail.com>
2017-08-18 17:18:32 +01:00
James Almer 4d62ee6746 x86inc: don't use read-only data sections on COFF targets
Yasm:
src/libavfilter/x86/af_volume.asm:24: warning: Standard COFF does not support read-only data sections
src/libavfilter/x86/af_volume.asm:24: warning: Unrecognized qualifier `align'

Nasm:
src/libavfilter/x86/af_volume.asm:24: error: standard COFF does not support section alignment specification
src/libavutil/x86/x86inc.asm:92: ... from macro `SECTION_RODATA' defined here

Tested-by: Clément Bœsch <u@pkh.me>
Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-27 12:48:04 -03:00
Diego Biurrun fd502f4f5f build: Generalize yasm/nasm-related variable names
None of them are specific to the YASM assembler.

(Cherry-picked from libav commit 39e208f4d4)

Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-21 17:00:29 -03:00
James Almer e229df9478 x86/aacpsdsp: add ff_ps_hybrid_synthesis_deint_{sse,sse4}
About 2x faster than the c version.
2017-06-18 22:33:27 -03:00
Henrik Gramner aad1b6786e x86inc: Add some additional cpuflag relations
Simplifies writing assembly code that depends on available instructions.

LZCNT implies SSE2
BMI1 implies AVX+LZCNT
AVX2 implies BMI2
2017-06-12 11:41:25 +02:00
Anton Mitrofanov d991b3e8a8 x86inc: Remove argument from WIN64_RESTORE_XMM
The use of rsp was pretty much hardcoded there and probably didn't work
otherwise with stack_size > 0.
2017-06-09 13:43:01 +02:00
Henrik Gramner cd4ca82459 x86inc: Prefer r14/r15 over r12/r13 on x86-64
Due to a peculiarity in the ModR/M addressing encoding, the r12 and r13
registers sometimes requires an additional byte when used as a base register.

r14 and r15 doesn't have that issue, so prefer using them.
2017-06-09 13:43:00 +02:00
Henrik Gramner 88dcdfad09 x86inc: Make REP_RET identical to RET in SSSE3+ functions
There's no point in emitting a rep prefix before ret on modern CPUs.
2017-06-09 13:43:00 +02:00
Henrik Gramner 406e0ddc0b x86inc: Fix call with memory operands
We overload the `call` instruction with a macro, but it would misbehave when
the macro argument wasn't a valid identifier. Fix it by explicitly checking
if the argument is an identifier.
2017-06-09 13:43:00 +02:00
James Almer 0fbc7a2169 x86/float_dsp: remove usage of integer instructions 2017-05-12 23:34:49 -03:00
James Almer f1d80bc630 x86/float_dsp: add ff_vector_fmul_reverse_avx2
~20% faster than AVX.

Signed-off-by: James Almer <jamrial@gmail.com>
2017-04-11 21:35:35 -03:00
James Almer ed9b25a148 x86/float_dsp: add ff_vector_dmac_scalar_{sse2,avx,fma3} 2017-04-10 12:18:55 -03:00
Clément Bœsch f291a9a1ad Merge commit '99434f4df81b6801b2b535d5b9143305595784f6'
* commit '99434f4df81b6801b2b535d5b9143305595784f6':
  float_dsp: Have implementation match function pointer prototype

Merged-by: Clément Bœsch <cboesch@gopro.com>
2017-03-30 10:23:25 +02:00
James Almer c97e986e90 Merge commit '7911186ed616ae81dd8617d6d0e8b08c818db9d8'
* commit '7911186ed616ae81dd8617d6d0e8b08c818db9d8':
  emms: Give apriv_emms_yasm() a more general name

Merged-by: James Almer <jamrial@gmail.com>
2017-03-23 18:28:56 -03:00
James Almer 29db87af52 Merge commit '6be7944ee2ec2f045e6eb9a93237e992c8b20ac4'
* commit '6be7944ee2ec2f045e6eb9a93237e992c8b20ac4':
  x86: Add missing colons after assembly labels

Merged-by: James Almer <jamrial@gmail.com>
2017-03-23 18:05:27 -03:00
James Almer d8962ffbd8 avutil/x86util: don't use movss in VBROADCASTSS macro when src and dst args are the same
Reviewed-by: Henrik Gramner <henrik@gramner.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2017-03-21 19:15:00 -03:00