Commit Graph

530 Commits

Author SHA1 Message Date
Andreas Rheinhardt 636631d9db Remove unnecessary libavutil/(avutil|common|internal).h inclusions
Some of these were made possible by moving several common macros to
libavutil/macros.h.

While just at it, also improve the other headers a bit.

Reviewed-by: Martin Storsjö <martin@martin.st>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-02-24 12:56:49 +01:00
Andreas Rheinhardt 6c694074e1 avutil/x86/emms: Don't unnecessarily include lavu/cpu.h
Only include it if it is needed, namely if __MMX__ is undefined.

X86 is currently the only arch where lavu/cpu.h is basically
automatically included (for internal development): #if ARCH_X86
is true, lavu/internal.h (which is basically included everywhere)
includes lavu/x86/emms.h which can mask missing inclusions
of lavu/cpu.h if the developer works on x86/x64. This has happened
in 8e825ec3ab and also earlier
(see 6d2365882f).
By including said header only if necessary ordinary developer machines
will behave like non-x86 arches, so that missing inclusions of cpu.h
won't go unnoticed any more.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-02-21 12:37:51 +01:00
Alexander Kanavin 91326dc942 libavutil: include assembly with full path from source root
Otherwise nasm writes the full host-specific paths into .o
output, which breaks binary reproducibility.

Signed-off-by: Alexander Kanavin <alex.kanavin@gmail.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
2022-02-08 10:42:26 +01:00
Lynne 3bbe9c5e38
lavu/tx: refactor assembly codelet definition
This commit does some refactoring to make defining assembly codelets
smaller, and fixes compiler redefinition warnings. It also allows
for other assembly versions to reuse the same boilerplate code as
x86.

Finally, it also adds the out_of_place flag to all assembly codelets.
This changes nothing, as out-of-place operation was assumed to be
available anyway, but this makes it more explicit.
2022-02-07 03:56:45 +01:00
Lynne 2e82c61055
x86/tx_float: avoid redefining macros
FFT16_FN was used for fft8 and for fft16 afterwards.
2022-02-02 07:51:45 +01:00
Lynne 35080149ef
x86/tx_float: mark AVX2 functions as AVXSLOW
Makes Bulldozer prefer AVX functions rather than AVX2,
which are 64% slower:

AVX:  117653 decicycles in av_tx (fft), 1048535 runs,     41 skips
AVX2: 193385 decicycles in av_tx (fft), 1048561 runs,     15 skips

The only difference between both is that vgatherdpd is used in
the former. We don't want to mark them with the new SLOW_GATHER
flag however, since gathers are still faster on Haswell/Zen 2/3
than plain loads.
2022-01-29 03:08:16 +01:00
Lynne 6c397f6bb5
x86/tx_float: add missing FF_TX_OUT_OF_PLACE flag to functions
This caused smaller length dedicated transforms to not be picked up.
2022-01-27 02:18:35 +01:00
Lynne 9787005846
x86/tx_float: do not build tx_float_init.c if x86 assembly is disabled
This broke builds with --disable-mmx, which also disabled assembly
entirely, but ARCH_X86 was still true, so the init file tried to find
assembly that didn't exist.
Instead of checking for architecture, check if external x86 assembly
is enabled.
2022-01-27 02:17:46 +01:00
Lynne 28bff6ae54
x86/tx_float: add permute-free FFT versions
These are used in the PFA transforms and MDCTs.
2022-01-26 04:13:58 +01:00
Lynne ef4bd81615
lavu/tx: rewrite internal code as a tree-based codelet constructor
This commit rewrites the internal transform code into a constructor
that stitches transforms (codelets).
This allows for transforms to reuse arbitrary parts of other
transforms, and allows transforms to be stacked onto one
another (such as a full iMDCT using a half-iMDCT which in turn
uses an FFT). It also permits for each step to be individually
replaced by assembly or a custom implementation (such as an ASIC).
2022-01-26 04:12:44 +01:00
James Almer 8c2d2fd6cc avutil/cpu: move slow gather checks below in the function
Put them together with other similar slow flag checks.

Signed-off-by: James Almer <jamrial@gmail.com>
2021-12-21 17:51:17 -03:00
Alan Kelly ffbab99f2c libavutil/cpu: Add AV_CPU_FLAG_SLOW_GATHER.
This flag is set on Haswell and earlier and all AMD cpus.
2021-12-21 17:44:44 -03:00
James Almer 67b92d68c6 x86/intmath: add VEX encoded versions of av_clipf() and av_clipd()
Prevents mixing inlined SSE instructions and AVX instructions when the compiler
generates the latter.

Signed-off-by: James Almer <jamrial@gmail.com>
2021-11-19 11:21:03 -03:00
Mark Reid c3502f4f75 libavutil/common: clip nan value to amin
Changes av_clipf to return amin if a is nan.
Before if a is nan av_clipf_c returned nan and
av_clipf_sse would return amax. Now the both
should behave the same.

This works because nan > amin is false.
The max(nan, amin) will be amin.

Signed-off-by: James Almer <jamrial@gmail.com>
2021-11-15 16:50:08 -03:00
Lynne 997f9bdb99
x86/tx_float: correctly load the transform length
The field is a standard field, yet we were loading it as if it was
a quadword. This worked for forward transforms by chance, but broke
when the transform was inverse.
checkasm couldn't catch that because we only test forward transforms,
which are identical to inverse transforms but with a different revtab.
2021-07-18 15:04:57 +02:00
James Almer 7a6ea6ce2a x86/tx_float: remove ff_ prefix from external constant tables
Fixes compilation with some assemblers.

Reviewed-by: Lynne
Signed-off-by: James Almer <jamrial@gmail.com>
2021-04-25 18:42:38 -03:00
Lynne bb40f800bd
x86/tx_float: fix forgotten 2-argument mulps
Yasm *really* cannot deal with any omitted arguments at all.
2021-04-24 22:33:42 +02:00
Lynne e2cf0a1f68
x86/tx_float: use all arguments on vperm2f and vpermilps and reindent comments
Apparently even old nasm isn't required to accept incomplete instructions.
2021-04-24 22:21:13 +02:00
James Almer fddddc7ec2 x86/tx_float: Fixes compilation with old yasm
Use three operand format on some instructions, and lea to load effective
addresses of tables.

Signed-off-by: James Almer <jamrial@gmail.com>
2021-04-24 17:02:31 -03:00
Lynne e448a4b4ea
lavu/x86/tx_float: fix FMA3 implying AVX2 is available
It's the other way around - AVX2 implies FMA3 is available.
2021-04-24 19:00:27 +02:00
Lynne 119a3f7e8d
lavu/x86: add FFT assembly
This commit adds a pure x86 assembly SIMD version of the FFT in libavutil/tx.
The design of this pure assembly FFT is pretty unconventional.

On the lowest level, instead of splitting the complex numbers into
real and imaginary parts, we keep complex numbers together but split
them in terms of parity. This saves a number of shuffles in each transform,
but more importantly, it splits each transform into two independent
paths, which we process using separate registers in parallel.
This allows us to keep all units saturated and lets us use all available
registers to avoid dependencies.
Moreover, it allows us to double the granularity of our per-load permutation,
skipping many expensive lookups and allowing us to use just 4 loads per register,
rather than 8, or in case FMA3 (and by extension, AVX2), use the vgatherdpd
instruction, which is at least as fast as 4 separate loads on old hardware,
and quite a bit faster on modern CPUs).

Higher up, we go for a bottom-up construction of large transforms, foregoing
the traditional per-transform call-return recursion chains. Instead, we always
start at the bottom-most basis transform (in this case, a 32-point transform),
and continue constructing larger and larger transforms until we return to the
top-most transform.
This way, we only touch the stack 3 times per a complete target transform:
once for the 1/2 length transform and two times for the 1/4 length transform.

The combination algorithm we use is a standard Split-Radix algorithm,
as used in our C code. Although a version with less operations exists
(Steven G. Johnson and Matteo Frigo's "A modified split-radix FFT with fewer
arithmetic operations", IEEE Trans. Signal Process. 55 (1), 111–119 (2007),
which is the one FFTW uses), it only has 2% less operations and requires at least 4x
the binary code (due to it needing 4 different paths to do a single transform).
That version also has other issues which prevent it from being implemented
with SIMD code as efficiently, which makes it lose the marginal gains it offered,
and cannot be performed bottom-up, requiring many recursive call-return chains,
whose overhead adds up.

We go through a lot of effort to minimize load/stores by keeping as much in
registers in between construcring transforms. This saves us around 32 cycles,
on paper, but in reality a lot more due to load/store aliasing (a load from a
memory location cannot be issued while there's a store pending, and there are
only so many (2 for Zen 3) load/store units in a CPU).
Also, we interleave coefficients during the last stage to save on a store+load
per register.

Each of the smallest, basis transforms (4, 8 and 16-point in our case)
has been extremely optimized. Our 8-point transform is barely 20 instructions
in total, beating our old implementation 8-point transform by 1 instruction.
Our 2x8-point transform is 23 instructions, beating our old implementation by
6 instruction and needing 50% less cycles. Our 16-point transform's combination
code takes slightly more instructions than our old implementation, but makes up
for it by requiring a lot less arithmetic operations.

Overall, the transform was optimized for the timings of Zen 3, which at the
time of writing has the most IPC from all documented CPUs. Shuffles were
preferred over arithmetic operations due to their 1/0.5 latency/throughput.

On average, this code is 30% faster than our old libavcodec implementation.
It's able to trade blows with the previously-untouchable FFTW on small transforms,
and due to its tiny size and better prediction, outdoes FFTW on larger transforms
by 11% on the largest currently supported size.
2021-04-24 17:19:18 +02:00
Andreas Rheinhardt f3c197b129 Include attributes.h directly
Some files currently rely on libavutil/cpu.h to include it for them;
yet said file won't use include it any more after the currently
deprecated functions are removed, so include attributes.h directly.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2021-04-19 14:34:10 +02:00
Henrik Gramner 0b2b03568f avutil/x86inc: fix warnings when assembling with Nasm 2.15
Some new warnings regarding use of empty macro parameters has
been added, so adjust some x86inc code to silence those.

Fixes part of ticket #8771

Signed-off-by: James Almer <jamrial@gmail.com>
2020-07-12 11:30:23 -03:00
Martin Storsjö 1001b6a750 libavutil: x86: Include stdlib.h before using _byteswap_ulong
When clang works in MSVC mode, it does have the _byteswap_ulong
builtin, but one has to include stdlib.h before using it.

Signed-off-by: Martin Storsjö <martin@martin.st>
2020-01-23 18:30:26 +02:00
James Almer 9d002d7818 x86/float_dsp: add ff_vector_dmul_{sse2,avx}
~3x to 5x faster.

Signed-off-by: James Almer <jamrial@gmail.com>
2018-09-14 12:54:42 -03:00
James Almer 481741ece0 x86/pixelutils: don't use the AVX2 functions on CPUs known to be slow with them
Signed-off-by: James Almer <jamrial@gmail.com>
2018-07-31 22:14:53 -03:00
James Almer d5b3077ecf x86/pixelutils: add missing preprocessor wrapper to the AVX2 functions
Should fix compilation with old yasm/nasm

Signed-off-by: James Almer <jamrial@gmail.com>
2018-07-31 22:14:42 -03:00
Jun Zhao d36b8394f4 avutil/pixelutils: sad_32x32 sse2/avx2 optimizations.
add ff_pixelutils_sad_32x32_sse2, ff_pixelutils_sad_{a,u}_32x32_sse2,
ff_pixelutils_sad_32x32_avx22, ff_pixelutils_sad_{a,u}_32x32_avx2

use perf record/report profiling, get instructions:u for avx2 sad_32x32:

  72.05%  pixelutils  pixelutils     [.] block_sad_32x32_c
  18.50%  pixelutils  pixelutils     [.] block_sad_16x16_c
   4.78%  pixelutils  pixelutils     [.] block_sad_8x8_c
   2.69%  pixelutils  pixelutils     [.] block_sad_4x4_c
   0.89%  pixelutils  pixelutils     [.] block_sad_2x2_c
   0.16%  pixelutils  pixelutils     [.] ff_pixelutils_sad_32x32_avx2
   0.16%  pixelutils  pixelutils     [.] ff_pixelutils_sad_u_32x32_avx2
   0.12%  pixelutils  pixelutils     [.] ff_pixelutils_sad_a_32x32_avx2

sse2 sad_32x32 instructions:u like:

  71.86%  pixelutils  pixelutils     [.] block_sad_32x32_c
  18.42%  pixelutils  pixelutils     [.] block_sad_16x16_c
   4.81%  pixelutils  pixelutils     [.] block_sad_8x8_c
   2.68%  pixelutils  pixelutils     [.] block_sad_4x4_c
   0.88%  pixelutils  pixelutils     [.] block_sad_2x2_c
   0.29%  pixelutils  pixelutils     [.] ff_pixelutils_sad_32x32_sse2
   0.26%  pixelutils  pixelutils     [.] ff_pixelutils_sad_u_32x32_sse2
   0.23%  pixelutils  pixelutils     [.] ff_pixelutils_sad_a_32x32_sse2

Signed-off-by: Jun Zhao <mypopydev@gmail.com>
2018-07-31 19:17:51 +08:00
alexander schmid b23c4a9dbd lavu/x86/cpu: Fix aesni detection 2018-07-19 20:17:44 +02:00
Jun Zhao 09628cb1b4 avutil/pixelutils: correct the function name in comments
Signed-off-by: Jun Zhao <mypopydev@gmail.com>
2018-07-11 20:12:33 +08:00
James Almer 35347e7e9b Merge commit '4cf84e254ae75b524e1cacae499a97d7cc9e5906'
* commit '4cf84e254ae75b524e1cacae499a97d7cc9e5906':
  Drop some unnecessary config.h #includes

Merged-by: James Almer <jamrial@gmail.com>
2018-02-11 23:08:48 -03:00
Diego Biurrun 4cf84e254a Drop some unnecessary config.h #includes 2018-02-06 10:03:15 +01:00
Henrik Gramner 6f62b0bd4f x86inc: Drop cpuflags_slowctz 2018-01-20 19:23:37 +01:00
Henrik Gramner eb5f063e7c x86inc: Correctly set mmreg variables 2018-01-20 19:23:37 +01:00
Henrik Gramner 6b6edd1216 x86inc: Support creating global symbols from local labels
On ELF platforms such symbols needs to be flagged as functions with the
correct visibility to please certain linkers in some scenarios.
2018-01-20 19:23:37 +01:00
Henrik Gramner 9e4b3675f2 x86inc: Use .rdata instead of .rodata on Windows
The standard section for read-only data on Windows is .rdata. Nasm will
flag non-standard sections as executable by default which isn't ideal.
2018-01-20 19:23:37 +01:00
Henrik Gramner 3a02cbe3fa x86inc: Enable AVX emulation for floating-point pseudo-instructions
There are 32 pseudo-instructions for each floating-point comparison
instruction, but only 8 of them are actually valid in legacy-encoded mode.
The remaining 24 requires the use of VEX-encoded (v-prefixed) instructions
and can therefore be disregarded for this purpose.
2018-01-20 19:23:37 +01:00
James Almer 90d216cb90 x86inc: set the correct amount of simd regs in x86_64 when avx512 is enabled but not used
Fixes compilation of libavresample/x86/audio_mix.asm

Reviewed-by: Gramner
Signed-off-by: James Almer <jamrial@gmail.com>
2017-12-24 23:02:54 -03:00
Henrik Gramner f7197f68dc x86inc: AVX-512 support
AVX-512 consists of a plethora of different extensions, but in order to keep
things a bit more manageable we group together the following extensions
under a single baseline cpu flag which should cover SKL-X and future CPUs:
 * AVX-512 Foundation (F)
 * AVX-512 Conflict Detection Instructions (CD)
 * AVX-512 Byte and Word Instructions (BW)
 * AVX-512 Doubleword and Quadword Instructions (DQ)
 * AVX-512 Vector Length Extensions (VL)

On x86-64 AVX-512 provides 16 additional vector registers, prefer using
those over existing ones since it allows us to avoid using `vzeroupper`
unless more than 16 vector registers are required. They also happen to
be volatile on Windows which means that we don't need to save and restore
existing xmm register contents unless more than 22 vector registers are
required.

Big thanks to Intel for their support.
2017-12-24 22:02:41 +01:00
James Darnley e2218ed8ce avutil: add alignment needed for AVX-512 2017-12-24 22:02:41 +01:00
James Darnley 4783a01c11 avutil: detect when AVX-512 is available 2017-12-24 22:02:41 +01:00
James Darnley 8b81eabe57 avutil: add AVX-512 flags 2017-12-24 22:02:41 +01:00
Martin Vignali b37196adff avutil/x86util : add macro for loading a 128 bits constants in an xmm or in each part of an ymm in order to simplify avx2 asm func 2017-12-02 18:25:15 +01:00
Dale Curtis 50e30d9bb7 Don't use _tzcnt instrinics with clang for windows w/o BMI.
Technically _tzcnt* intrinsics are only available when the BMI
instruction set is present. However the instruction encoding
degrades to "rep bsf" on older processors.

Clang for Windows debatably restricts the _tzcnt* instrinics behind
the __BMI__ architecture define, so check for its presence or
exclude the usage of these intrinics when clang is present.

See also:
https://ffmpeg.org/pipermail/ffmpeg-devel/2015-November/183404.html
https://bugs.llvm.org/show_bug.cgi?id=30506
http://lists.llvm.org/pipermail/cfe-dev/2016-October/051034.html

Signed-off-by: Dale Curtis <dalecurtis@chromium.org>
Reviewed-by: Matt Oliver <protogonoi@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2017-10-25 21:50:37 +02:00
James Almer 2904db9045 Merge commit '994c4bc10751e39c7ed9f67ffd0c0dea5223daf2'
* commit '994c4bc10751e39c7ed9f67ffd0c0dea5223daf2':
  x86util: Port all macros to cpuflags

See d5f8a642f6

Merged-by: James Almer <jamrial@gmail.com>
2017-10-21 12:15:57 -03:00
James Almer 3d828c9fd5 cpu: split flag checks per arch in av_cpu_max_align()
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
2017-10-09 11:48:24 +02:00
James Almer 3b345d389b avutil/cpu: split flag checks per arch in av_cpu_max_align()
Signed-off-by: James Almer <jamrial@gmail.com>
2017-09-27 23:10:09 -03:00
James Almer 0c005fa86f Merge commit '7abdd026df6a9a52d07d8174505b33cc89db7bf6'
* commit '7abdd026df6a9a52d07d8174505b33cc89db7bf6':
  asm: Consistently uppercase SECTION markers

Merged-by: James Almer <jamrial@gmail.com>
2017-09-26 18:48:06 -03:00
Ivan Kalvachev 30ae07d7ef Add macros to x86util.asm .
Improved version of VBROADCASTSS that works like the avx2 instruction.
Emulation of vpbroadcastd.
Horizontal sum HSUMPS that places the result in all elements.
Emulation of blendvps and pblendvb.

Signed-off-by: Ivan Kalvachev <ikalvachev@gmail.com>
2017-08-18 17:18:32 +01:00
James Almer 4d62ee6746 x86inc: don't use read-only data sections on COFF targets
Yasm:
src/libavfilter/x86/af_volume.asm:24: warning: Standard COFF does not support read-only data sections
src/libavfilter/x86/af_volume.asm:24: warning: Unrecognized qualifier `align'

Nasm:
src/libavfilter/x86/af_volume.asm:24: error: standard COFF does not support section alignment specification
src/libavutil/x86/x86inc.asm:92: ... from macro `SECTION_RODATA' defined here

Tested-by: Clément Bœsch <u@pkh.me>
Signed-off-by: James Almer <jamrial@gmail.com>
2017-06-27 12:48:04 -03:00