Commit Graph

53 Commits

Author SHA1 Message Date
Lynne bbe95f7353
x86: replace explicit REP_RETs with RETs
From x86inc:
> On AMD cpus <=K10, an ordinary ret is slow if it immediately follows either
> a branch or a branch target. So switch to a 2-byte form of ret in that case.
> We can automatically detect "follows a branch", but not a branch target.
> (SSSE3 is a sufficient condition to know that your cpu doesn't have this problem.)

x86inc can automatically determine whether to use REP_RET rather than
REP in most of these cases, so impact is minimal. Additionally, a few
REP_RETs were used unnecessary, despite the return being nowhere near a
branch.

The only CPUs affected were AMD K10s, made between 2007 and 2011, 16
years ago and 12 years ago, respectively.

In the future, everyone involved with x86inc should consider dropping
REP_RETs altogether.
2023-02-01 04:23:55 +01:00
James Almer bda3a9faf4 x86/float_dsp: use three operand form for some instructions
Fixes compilation with old yasm

Signed-off-by: James Almer <jamrial@gmail.com>
2022-09-13 13:50:09 -03:00
Paul B Mahol 72acff9f59 avutil/x86/float_dsp: add fma3 for scalarproduct 2022-09-13 17:43:15 +02:00
Andreas Rheinhardt 2718a3be1f avutil/x86/float_dsp: Remove obsolete 3dnowext function
x64 always has MMX, MMXEXT, SSE and SSE2 and this means
that some functions for MMX, MMXEXT, SSE and 3dnow are always
overridden by other functions (unless one e.g. explicitly
disables SSE2). So given that the only systems which benefit
from ff_vector_fmul_window_3dnowext are truely ancient 32bit
AMD x86s it is removed.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-06-22 13:37:22 +02:00
Alexander Kanavin 91326dc942 libavutil: include assembly with full path from source root
Otherwise nasm writes the full host-specific paths into .o
output, which breaks binary reproducibility.

Signed-off-by: Alexander Kanavin <alex.kanavin@gmail.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
2022-02-08 10:42:26 +01:00
James Almer 9d002d7818 x86/float_dsp: add ff_vector_dmul_{sse2,avx}
~3x to 5x faster.

Signed-off-by: James Almer <jamrial@gmail.com>
2018-09-14 12:54:42 -03:00
James Almer 0fbc7a2169 x86/float_dsp: remove usage of integer instructions 2017-05-12 23:34:49 -03:00
James Almer f1d80bc630 x86/float_dsp: add ff_vector_fmul_reverse_avx2
~20% faster than AVX.

Signed-off-by: James Almer <jamrial@gmail.com>
2017-04-11 21:35:35 -03:00
James Almer ed9b25a148 x86/float_dsp: add ff_vector_dmac_scalar_{sse2,avx,fma3} 2017-04-10 12:18:55 -03:00
James Almer dc79824deb x86/float_dsp: zero extend offset from ff_scalarproduct_float_sse
Reviewed-by: Christophe Gisquet <christophe.gisquet@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2016-01-08 16:14:32 -03:00
James Almer 4ee38ed7f9 x86/float_dsp: zero extend len from ff_butterflies_float_sse implicitly
Reviewed-by: Christophe Gisquet <christophe.gisquet@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2016-01-08 16:14:27 -03:00
James Almer 7f520524f6 x86/float_dsp: remove len check from ff_butterflies_float_sse
The function documentation explicitly mentions it needs to be a multiple of 4.

Reviewed-by: Christophe Gisquet <christophe.gisquet@gmail.com>
Signed-off-by: James Almer <jamrial@gmail.com>
2016-01-08 16:14:24 -03:00
James Almer 4d2c014a8f x86/float_dsp: add missing colon to labels
Silences warnings with Nasm

Signed-off-by: James Almer <jamrial@gmail.com>
2015-07-26 02:51:08 -03:00
James Almer 85065d2a7c x86/float_dsp: add missing femms
It was lost during the port.
Should fix fate on 3dnowext machines.

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-08 20:06:28 +02:00
James Almer dcaf9660b6 x86/float_dsp: port vector_fmul_window to yasm
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-06-08 12:41:32 +02:00
James Almer 3b06208a57 x86/float_dsp: remove duplicated code from vector_dmul_scalar
Use the xm# and ym# aliases as they remain in sync with m# after a SWAP.
No actual changes to the assembly.

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-19 14:21:51 +02:00
James Almer 11b36b1ee0 x86/float_dsp: unroll loop in vector_fmac_scalar
~6% faster SSE2 performance. AVX/FMA3 are unaffected.

Signed-off-by: James Almer <jamrial@gmail.com>
Reviewed-by: Christophe Gisquet <christophe.gisquet@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-16 18:36:52 +02:00
James Almer 3b808900af x86/float_dsp: use SWAP in vector_fmac_scalar Win64
The mova is unnecessary

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-04-16 15:46:21 +02:00
James Almer 7d7487e85c x86/float_dsp: add ff_vector_{fmul_add, fmac_scalar}_fma3
~7% faster than AVX

Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-03-13 04:34:05 +01:00
Christophe Gisquet 133b34207c x86: float dsp: unroll SSE versions
vector_fmul and vector_fmac_scalar are guaranteed that they can process in
batch of 16 elements, but their SSE versions only does 8 at a time.

Therefore, unroll them a bit.
299 to 261c for 256 elements in vector_fmac_scalar on Arrandale/Win64.

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2014-02-15 18:54:21 +01:00
Michael Niedermayer e91339cde2 Merge commit '566b7a20fd0cab44d344329538d314454a0bcc2f'
* commit '566b7a20fd0cab44d344329538d314454a0bcc2f':
  x86: float dsp: butterflies_float SSE

Conflicts:
	libavutil/x86/float_dsp.asm

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-05-03 11:57:59 +02:00
Christophe Gisquet 566b7a20fd x86: float dsp: butterflies_float SSE
97c -> 49c
Some codecs could benefit from more unrolling, but AAC doesn't.
2013-05-03 08:08:02 +02:00
Michael Niedermayer 92218aad00 butterflies_float: replace 2 lea by 2 add
adds are simpler instructions and should be faster or equally fast
on all cpus

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-04-17 00:10:06 +02:00
Christophe Gisquet 1a4007964c x86: float dsp: butterflies_float SSE
97c -> 49c
Some codecs could benefit from more unrolling, but AAC doesn't.

Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
2013-04-17 00:03:25 +02:00
Michael Niedermayer 8102f27b5b Merge commit '73b704ac609d83e0be124589f24efd9b94947cf9'
* commit '73b704ac609d83e0be124589f24efd9b94947cf9':
  arm: Add some missing header #includes
  floatdsp: move scalarproduct_float from dsputil to avfloatdsp.

Conflicts:
	libavcodec/acelp_pitch_delay.c
	libavcodec/amrnbdec.c
	libavcodec/amrwbdec.c
	libavcodec/ra288.c
	libavcodec/x86/dsputil_mmx.c
	libavutil/x86/float_dsp.asm

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-01-23 14:31:55 +01:00
Michael Niedermayer 6e6e170898 Merge commit '42d324694883cdf1fff1612ac70fa403692a1ad4'
* commit '42d324694883cdf1fff1612ac70fa403692a1ad4':
  floatdsp: move vector_fmul_reverse from dsputil to avfloatdsp.

Conflicts:
	libavcodec/arm/dsputil_init_vfp.c
	libavcodec/arm/dsputil_vfp.S
	libavcodec/dsputil.c
	libavcodec/ppc/float_altivec.c
	libavcodec/x86/dsputil.asm
	libavutil/x86/float_dsp.asm

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-01-23 14:04:50 +01:00
Michael Niedermayer b1b870fbd7 Merge commit '55aa03b9f8f11ebb7535424cc0e5635558590f49'
* commit '55aa03b9f8f11ebb7535424cc0e5635558590f49':
  floatdsp: move vector_fmul_add from dsputil to avfloatdsp.

Conflicts:
	libavcodec/dsputil.c
	libavcodec/x86/dsputil.asm

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2013-01-23 13:54:34 +01:00
Ronald S. Bultje 42d3246948 floatdsp: move vector_fmul_reverse from dsputil to avfloatdsp.
Now, nellymoserenc and aacenc no longer depends on dsputil. Independent
of this patch, wmaprodec also does not depend on dsputil, so I removed
it from there also.
2013-01-22 11:55:42 -08:00
Ronald S. Bultje 55aa03b9f8 floatdsp: move vector_fmul_add from dsputil to avfloatdsp. 2013-01-22 11:55:42 -08:00
Ronald S. Bultje d56668bd80 floatdsp: move scalarproduct_float from dsputil to avfloatdsp.
This makes the aac decoder and all voice codecs independent of dsputil.
2013-01-22 11:55:42 -08:00
Michael Niedermayer 5c076205a6 Merge remote-tracking branch 'qatar/master'
* qatar/master:
  golomb: use unsigned arithmetics in svq3_get_ue_golomb()
  x86: float_dsp: fix loading of the len parameter on x86-32
  takdec: fix initialisation of LOCAL_ALIGNED array
  takdec: fix initialisation of LOCAL_ALIGNED array

Conflicts:
	libavcodec/rv30.c
	libavcodec/svq3.c
	libavcodec/takdec.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2012-12-08 16:36:47 +01:00
Justin Ruggles 1c012e6bfb x86: float_dsp: fix loading of the len parameter on x86-32 2012-12-07 21:19:29 -05:00
Michael Niedermayer af164d7d9f Merge commit 'c25fc5c2bb6ae8c93541c9427df3e47206d95152'
* commit 'c25fc5c2bb6ae8c93541c9427df3e47206d95152':
  fate: dpcm: Add dependencies
  SBR DSP x86: implement SSE sbr_hf_gen
  AAC SBR: use AVFloatDSPContext's vector_fmul
  fate: image: Add dependencies
  Changelog: add an entry for deprecating the avconv -vol option
  x86: float_dsp: fix compilation of ff_vector_dmul_scalar_avx() on x86-32

Conflicts:
	Changelog
	libavutil/x86/float_dsp.asm
	tests/fate/image.mak

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2012-12-07 15:21:41 +01:00
Michael Niedermayer 15784c2bab Merge commit '9d5c62ba5b586c80af508b5914934b1c439f6652'
* commit '9d5c62ba5b586c80af508b5914934b1c439f6652':
  lavu/opt: do not filter out the initial sign character except for flags
  eval: treat dB as decibels instead of decibytes
  float_dsp: add vector_dmul_scalar() to multiply a vector of doubles

Conflicts:
	libavutil/eval.c
	tests/ref/fate/eval

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2012-12-06 14:33:38 +01:00
Justin Ruggles ecc8b02194 x86: float_dsp: fix compilation of ff_vector_dmul_scalar_avx() on x86-32
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
2012-12-06 14:11:15 +01:00
Justin Ruggles ac7eb4cb20 float_dsp: add vector_dmul_scalar() to multiply a vector of doubles
Include x86-optimized versions for SSE2 and AVX.
2012-12-05 11:23:36 -05:00
Michael Niedermayer b4d4e51027 Merge commit '3c370f5abc55739a261534b9f9bdc739cedbbbb9'
* commit '3c370f5abc55739a261534b9f9bdc739cedbbbb9':
  riff: only warn on a bad INFO chunk code size instead of failing
  configure: Add separate list for libraries and use where appropriate
  x86: float_dsp: add SSE version of vector_fmul_scalar()

Conflicts:
	configure
	libavformat/riff.c
	libavutil/x86/float_dsp.asm

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2012-11-27 14:10:05 +01:00
Justin Ruggles 947f933687 x86: float_dsp: add SSE version of vector_fmul_scalar() 2012-11-26 11:30:19 -05:00
Diego Biurrun 2b479bcab0 build: Drop AVX assembly ifdefs
An assembler able to cope with AVX instructions is now required.
2012-11-11 20:43:28 +01:00
Michael Niedermayer 3174616f59 Merge commit '6860b4081d046558c44b1b42f22022ea341a2a73'
* commit '6860b4081d046558c44b1b42f22022ea341a2a73':
  x86: include x86inc.asm in x86util.asm
  cng: Reindent some incorrectly indented lines
  cngdec: Allow flushing the decoder
  cngdec: Make the dbov variable have the right unit
  cngdec: Fix the memset size to cover the full array
  cngdec: Update the LPC coefficients after averaging the reflection coefficients
  configure: fix print_config() with broke awks

Conflicts:
	libavcodec/x86/ac3dsp.asm
	libavcodec/x86/dct32.asm
	libavcodec/x86/deinterlace.asm
	libavcodec/x86/dsputil.asm
	libavcodec/x86/dsputilenc.asm
	libavcodec/x86/fft.asm
	libavcodec/x86/fmtconvert.asm
	libavcodec/x86/h264_chromamc.asm
	libavcodec/x86/h264_deblock.asm
	libavcodec/x86/h264_deblock_10bit.asm
	libavcodec/x86/h264_idct.asm
	libavcodec/x86/h264_idct_10bit.asm
	libavcodec/x86/h264_intrapred.asm
	libavcodec/x86/h264_intrapred_10bit.asm
	libavcodec/x86/h264_weight.asm
	libavcodec/x86/vc1dsp.asm
	libavcodec/x86/vp3dsp.asm
	libavcodec/x86/vp56dsp.asm
	libavcodec/x86/vp8dsp.asm

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2012-10-31 13:43:33 +01:00
Diego Biurrun 6860b4081d x86: include x86inc.asm in x86util.asm
This is necessary to allow refactoring some x86util macros with cpuflags.
2012-10-31 00:37:42 +01:00
Michael Niedermayer 7beadfe1f7 Merge remote-tracking branch 'qatar/master'
* qatar/master:
  mov_chan: Only set the channel_layout if setting it to a nonzero value
  mov_chan: Reindent an incorrectly indented line
  mp2 muxer: mark as AVFMT_NOTIMESTAMPS.
  x86: float_dsp: fix ff_vector_fmac_scalar_avx() on Win64
  x86: more specific checks for availability of required assembly capabilities
  x86: avcodec: Drop silly "_mmx" suffix from dsputil template names
  fate: Drop redundant setting of FUZZ to 1
  cavsdsp: set idct permutation independently of dsputil
  x86: allow using add_hfyu_median_prediction_cmov on any cpu with cmov

Conflicts:
	libavcodec/x86/dsputil_mmx.c
	libavformat/mp3enc.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2012-09-08 12:53:44 +02:00
Justin Ruggles 7327525997 x86: float_dsp: fix ff_vector_fmac_scalar_avx() on Win64
The SWAP macro does not work for explicit xmm/ymm usage, so instead just move
the scalar value from xmm2 to xmm0.
2012-09-07 14:49:10 -04:00
Michael Niedermayer c617bed34f Merge remote-tracking branch 'qatar/master'
* qatar/master:
  MSS1 and MSS2: set final pixel format after common stuff has been initialised
  MSS2 decoder
  configure: handle --disable-asm before check_deps
  x86: Split inline and external assembly #ifdefs
  configure: x86: Separate inline from standalone assembler capabilities
  pktdumper: Use a custom define instead of PATH_MAX for buffers
  pktdumper: Use av_strlcpy instead of strncpy
  pktdumper: Use sizeof(variable) instead of the direct buffer length

Conflicts:
	Changelog
	configure
	libavcodec/allcodecs.c
	libavcodec/avcodec.h
	libavcodec/codec_desc.c
	libavcodec/dct-test.c
	libavcodec/imgconvert.c
	libavcodec/mss12.c
	libavcodec/version.h
	libavfilter/x86/gradfun.c
	libswscale/x86/yuv2rgb.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2012-08-31 13:34:32 +02:00
Diego Biurrun 17337f54c0 x86: Split inline and external assembly #ifdefs 2012-08-31 01:53:25 +02:00
Michael Niedermayer 2fc7c818cb Merge remote-tracking branch 'qatar/master'
* qatar/master:
  x86: fix build with nasm 2.08
  x86: use nop cpu directives only if supported
  x86: fix rNmp macros with nasm
  build: add trailing / to yasm/nasm -I flags
  x86: use 32-bit source registers with movd instruction
  x86: add colons after labels

Conflicts:
	Makefile
	libavutil/x86/x86inc.asm

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2012-08-07 23:04:55 +02:00
Mans Rullgard a3df4781f4 x86: add colons after labels
nasm prints a warning if the colon is missing.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-08-07 15:20:56 +01:00
Michael Niedermayer c6963a220d Merge remote-tracking branch 'qatar/master'
* qatar/master:
  proresdsp: port x86 assembly to cpuflags.
  lavr: x86: improve non-SSE4 version of S16_TO_S32_SX macro
  lavfi: better channel layout negotiation
  alac: check for truncated packets
  alac: reverse lpc coeff order, simplify filter
  lavr: add x86-optimized mixing functions
  x86: add support for fmaddps fma4 instruction with abstraction to avx/sse
  tscc2: fix typo in array index
  build: use COMPILE template for HOSTOBJS
  build: do full flag handling for all compiler-type tools
  eval: fix printing of NaN in eval fate test.
  build: Rename aandct component to more descriptive aandcttables
  mpegaudio: bury inline asm under HAVE_INLINE_ASM.
  x86inc: automatically insert vzeroupper for YMM functions.
  rtmp: Check the buffer length of ping packets
  rtmp: Allow having more unknown data at the end of a chunk size packet without failing
  rtmp: Prevent reading outside of an allocate buffer when receiving server bandwidth packets

Conflicts:
	Makefile
	configure
	libavcodec/x86/proresdsp.asm
	libavutil/eval.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2012-07-27 23:42:19 +02:00
Ronald S. Bultje 30b45d9c38 x86inc: automatically insert vzeroupper for YMM functions. 2012-07-26 13:43:16 -07:00
Michael Niedermayer cabbd271a5 Merge remote-tracking branch 'qatar/master'
* qatar/master: (24 commits)
  flvdec: remove incomplete, disabled seeking code
  mem: add support for _aligned_malloc() as found on Windows
  lavc: Extend the documentation for avcodec_init_packet
  flvdec: remove incomplete, disabled seeking code
  http: replace atoll() with strtoll()
  mpegts: remove unused/incomplete/broken seeking code
  af_amix: allow float planar sample format as input
  af_amix: use AVFloatDSPContext.vector_fmac_scalar()
  float_dsp: add x86-optimized functions for vector_fmac_scalar()
  float_dsp: Move vector_fmac_scalar() from libavcodec to libavutil
  lavr: Add x86-optimized function for flt to s32 conversion
  lavr: Add x86-optimized function for flt to s16 conversion
  lavr: Add x86-optimized functions for s32 to flt conversion
  lavr: Add x86-optimized functions for s32 to s16 conversion
  lavr: Add x86-optimized functions for s16 to flt conversion
  lavr: Add x86-optimized function for s16 to s32 conversion
  rtpenc: Support packetizing iLBC
  rtpdec: Add a depacketizer for iLBC
  Implement the iLBC storage file format
  mov: Support muxing/demuxing iLBC
  ...

Conflicts:
	Changelog
	configure
	libavcodec/avcodec.h
	libavcodec/dsputil.c
	libavcodec/version.h
	libavformat/movenc.c
	libavformat/mpegts.c
	libavformat/version.h
	libavutil/mem.c

Merged-by: Michael Niedermayer <michaelni@gmx.at>
2012-06-19 20:53:27 +02:00