From x86inc:
> On AMD cpus <=K10, an ordinary ret is slow if it immediately follows either
> a branch or a branch target. So switch to a 2-byte form of ret in that case.
> We can automatically detect "follows a branch", but not a branch target.
> (SSSE3 is a sufficient condition to know that your cpu doesn't have this problem.)
x86inc can automatically determine whether to use REP_RET rather than
REP in most of these cases, so impact is minimal. Additionally, a few
REP_RETs were used unnecessary, despite the return being nowhere near a
branch.
The only CPUs affected were AMD K10s, made between 2007 and 2011, 16
years ago and 12 years ago, respectively.
In the future, everyone involved with x86inc should consider dropping
REP_RETs altogether.
x64 always has MMX, MMXEXT, SSE and SSE2 and this means
that some functions for MMX, MMXEXT, SSE and 3dnow are always
overridden by other functions (unless one e.g. explicitly
disables SSE2). So given that the only systems that
benefit from these functions are truely ancient 32bit x86s
they are removed.
Moreover, some of the removed code was buggy/not bitexact
and lead to failures involving the f32le and f32be versions of
gray, gbrp and gbrap on x86-32 when SSE2 was not disabled.
See e.g.
https://fate.ffmpeg.org/report.cgi?time=20220609221253&slot=x86_32-debian-kfreebsd-gcc-4.4-cpuflags-mmx
Notice that yuv2yuvX_mmx is not removed, because it is used
by SSE3 and AVX2 as fallback in case of unaligned data and
also for tail processing. I don't know why yuv2yuvX_mmxext
isn't being used for this; an earlier version [1] of
554c2bc708 used it, but
the version that was eventually applied does not.
[1]: https://ffmpeg.org/pipermail/ffmpeg-devel/2020-November/272124.html
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
256 bits is just wide enough to fit all the operands needed to vectorize
the software implementation, but AVX2 is needed to for a couple of
instructions like cross-lane permutation.
Output is bit-for-bit identical to C.
Signed-off-by: Nelson Gomez <nelson.gomez@microsoft.com>
* commit '6860b4081d046558c44b1b42f22022ea341a2a73':
x86: include x86inc.asm in x86util.asm
cng: Reindent some incorrectly indented lines
cngdec: Allow flushing the decoder
cngdec: Make the dbov variable have the right unit
cngdec: Fix the memset size to cover the full array
cngdec: Update the LPC coefficients after averaging the reflection coefficients
configure: fix print_config() with broke awks
Conflicts:
libavcodec/x86/ac3dsp.asm
libavcodec/x86/dct32.asm
libavcodec/x86/deinterlace.asm
libavcodec/x86/dsputil.asm
libavcodec/x86/dsputilenc.asm
libavcodec/x86/fft.asm
libavcodec/x86/fmtconvert.asm
libavcodec/x86/h264_chromamc.asm
libavcodec/x86/h264_deblock.asm
libavcodec/x86/h264_deblock_10bit.asm
libavcodec/x86/h264_idct.asm
libavcodec/x86/h264_idct_10bit.asm
libavcodec/x86/h264_intrapred.asm
libavcodec/x86/h264_intrapred_10bit.asm
libavcodec/x86/h264_weight.asm
libavcodec/x86/vc1dsp.asm
libavcodec/x86/vp3dsp.asm
libavcodec/x86/vp56dsp.asm
libavcodec/x86/vp8dsp.asm
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Add support for all x86-64 registers
Prefer caller-saved register over callee-saved on WIN64
Support up to 15 function arguments
Also (by Ronald S. Bultje)
Fix up our asm to work with new x86inc.asm.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Justin Ruggles <justin.ruggles@gmail.com>
* qatar/master: (27 commits)
cmdutils: use new avcodec_is_decoder/encoder() functions.
lavc: make codec_is_decoder/encoder() public.
lavc: deprecate AVCodecContext.sub_id.
libcdio: add a forgotten AVClass to the private context.
swscale: remove "cpu flags" from -sws_flags description.
proresenc: give user a possibility to alter some encoding parameters
vorbisenc: add output buffer overwrite protection
libopencore-amrnbenc: fix end-of-stream handling
ra144enc: fix end-of-stream handling
nellymoserenc: zero any leftover packet bytes
nellymoserenc: use proper MDCT overlap delay
qpeg: Use bytestream2 functions to prevent buffer overreads.
swscale: make %rep unconditional.
vp8: convert simple loopfilter x86 assembly to use named arguments.
vp8: convert idct x86 assembly to use named arguments.
vp8: convert mc x86 assembly to use named arguments.
vp8: convert loopfilter x86 assembly to use cpuflags().
vp8: convert idct/mc x86 assembly to use cpuflags().
swscale: remove now unnecessary hack.
x86inc: don't "bake" stack_offset in named arguments.
...
Conflicts:
cmdutils.c
doc/APIchanges
libavcodec/mpeg12.c
libavcodec/options.c
libavcodec/qpeg.c
libavcodec/utils.c
libavcodec/version.h
libavdevice/libcdio.c
tests/lavf-regression.sh
Merged-by: Michael Niedermayer <michaelni@gmx.at>
They were introduced in an earlier commit that introduced use of named
arguments. One cause was a typo, a second cause appears to be a bug in
x264asm that I work around by not using named arguments.
* qatar/master:
swscale: convert yuv2yuvX() to using named arguments.
swscale: rename "dstw" to "w" to prevent name collisions.
swscale: use named registers in yuv2yuv1_plane() place.
lavf: fix aspect ratio mismatch message.
avconv: set AVFormatContext.duration from '-t'
cljr: implement encode2.
cljr: set the properties of the coded_frame, not input frame.
dnxhdenc: switch to encode2.
bmpenc: switch to encode2().
Conflicts:
libavcodec/bmpenc.c
libavcodec/cljr.c
libavformat/utils.c
tests/ref/vsynth1/cljr
tests/ref/vsynth2/cljr
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* qatar/master:
pixdesc: mark pseudopaletted formats with a special flag.
avconv: switch to avcodec_encode_video2().
libx264: implement encode2().
libx264: split extradata writing out of encode_nals().
lavc: add avcodec_encode_video2() that encodes from an AVFrame -> AVPacket
cmdutils: update copyright year to 2012.
swscale: sign-extend integer function argument to qword on x86-64.
x86inc: support yasm -f win64 flag also.
h264: manually save/restore XMM registers for functions using INIT_MMX.
x86inc: allow manual use of WIN64_SPILL_XMM.
aacdec: Use correct speaker order for 7.1.
aacdec: Remove incorrect comment.
aacdec: Simplify output configuration.
Remove Sun medialib glue code.
dsputil: set STRIDE_ALIGN to 16 for x86 also.
pngdsp: swap argument inversion.
Conflicts:
cmdutils.c
configure
doc/APIchanges
ffmpeg.c
libavcodec/aacdec.c
libavcodec/dsputil.h
libavcodec/libx264.c
libavcodec/mlib/dsputil_mlib.c
libavcodec/utils.c
libavfilter/vf_scale.c
libavutil/avutil.h
libswscale/mlib/yuv2rgb_mlib.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* qatar/master:
swscale: make yuv2yuv1 use named registers.
h264: mark h264_idct_add8_10 with number of XMM registers.
swscale: fix V plane memory location in bilinear/unscaled RGB/YUYV case.
vp8: always update next_framep[] before returning from decode_frame().
avconv: estimate next_dts from framerate if it is set.
avconv: better next_dts usage.
avconv: rename InputStream.pts to last_dts.
avconv: reduce overloading for InputStream.pts.
avconv: rename InputStream.next_pts to next_dts.
avconv: rework -t handling for encoding.
avconv: set encoder timebase for subtitles.
pva-demux test: add -vn
swscale: K&R formatting cosmetics for SPARC code
apedec: allow the user to set the maximum number of output samples per call
apedec: do not unnecessarily zero output samples for mono frames
apedec: allocate a single flat buffer for decoded samples
apedec: use sizeof(field) instead of sizeof(type)
swscale: split C output functions into separate file.
swscale: Split C input functions into separate file.
bytestream: Add bytestream2 writing API.
The avconv changes are due to massive regressions and bugs not merged yet.
Conflicts:
ffmpeg.c
libavcodec/vp8.c
libswscale/swscale.c
libswscale/x86/swscale_template.c
tests/fate/demux.mak
tests/ref/lavf/asf
tests/ref/lavf/avi
tests/ref/lavf/mkv
tests/ref/lavf/mpg
tests/ref/lavf/nut
tests/ref/lavf/ogg
tests/ref/lavf/rm
tests/ref/lavf/ts
tests/ref/seek/lavf_avi
tests/ref/seek/lavf_mkv
tests/ref/seek/lavf_rm
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* qatar/master:
sgidec: Use bytestream2 functions to prevent buffer overreads.
cosmetics: Move static and inline attributes to more standard places.
configure: provide libavfilter/version.h header to get_version()
swscale: change yuv2yuvX code to use cpuflag().
libx264: Don't leave max_b_frames as -1 if the user didn't set it
FATE: convert output to rgba for the targa tests which currently output pal8
fate: add missing reference files for targa tests in 9c2f9b0e2
FATE: enable the 2 remaining targa conformance suite tests
targa: add support for rgb555 palette
FATE: fix targa tests on big-endian systems
Conflicts:
libavcodec/sgidec.c
libavcodec/targa.c
libswscale/x86/output.asm
tests/fate/image.mak
Merged-by: Michael Niedermayer <michaelni@gmx.at>