Add support for all x86-64 registers
Prefer caller-saved register over callee-saved on WIN64
Support up to 15 function arguments
Also (by Ronald S. Bultje)
Fix up our asm to work with new x86inc.asm.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Signed-off-by: Justin Ruggles <justin.ruggles@gmail.com>
Revert "swscale: update context offsets after removal of AlpMmxFilter."
(commit a95e3fa90b)
and
Revert "swscale: Remove some write-only variables related to alpha handling."
(commit 9d03cb9fc5).
They broke alpha handling - it's the evil inline asm that still uses that
variable, so it's not truely write-only.
Disadvantage is that it no longer allows modifying brightness through
adjustment of the RGB lookup table. Advantage is that now monowhite/black
no longer need to be identified as a RGB format.
They were introduced in an earlier commit that introduced use of named
arguments. One cause was a typo, a second cause appears to be a bug in
x264asm that I work around by not using named arguments.
At very small dimensions, this calculation could lead to zero-sized
filters, which leads to uninitialized output, zero-sized allocations,
loop overflows in SIMD that uses do{..}while(i++<filtersize); instead
of for(i=0;i<filtersize;i++){..} and several other similar failures.
Therefore, require a minimum filtersize of 1.
Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
CC: libav-stable@libav.org
This will be useful to test more aggressively for failures to mark XMM
registers as clobbered in Win64 builds, and prevent regressions thereof.
Based on a patch by Ramiro Polla <ramiro.polla@gmail.com>
Fixes problems where rgbToRgbWrapper() is called even though it doesn't
support this particular conversion (e.g. converting from RGB444 to
anything). Thirdly, fixes issues where rgbToRgbWrapper() is called for
non-native endiannness conversions (e.g. RGB555BE on a LE system).
Fourthly, fixes crashes when converting from e.g. monowhite to
monowhite, which calls planarCopyWrapper() and overwrites/reads because
n_bytes != n_pixels.
This fixes integer multiplication overflows in RGB48 output
(vertical) scaling as detected by IOC. What happens is that for
certain types of filters (lanczos, spline, bicubic), the
intermediate sum of coefficients in the middle of a filter can
be larger than the fixed-point equivalent of 1.0, even if the
final sum is 1.0. This is fine and we support that.
However, at frame edges, initFilter() will merge the coefficients
for the off-screen pixels into the top or bottom pixel, such as
to emulate edge extension. This means that suddenly, a single
coefficient can be larger than the fixed-point equivalent of
1.0, which the vertical scaling routines do not support.
Therefore, remove the merging of coefficients for edges for
the vertical scaling filter, and instead add edge detection
to the scaler itself so that it copies the pointers (not data)
for the edges (i.e. it uses line[0] for line[-1] as well), so
that a single coefficient is never larger than the fixed-point
equivalent of 1.0.
This fixes the same overflow as in the RGB48/16-bit YUV scaling;
some filters can overflow both negatively and positively (e.g.
spline/lanczos), so we bias a signed integer so it's "half signed"
and "half unsigned", and can cover overflows in both directions
while maintaining full 31-bit depth.
Signed-off-by: Mans Rullgard <mans@mansr.com>
We're shifting individual components (8-bit, unsigned) left by 24,
so making them unsigned should give the same results without the
overflow.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
For certain types of filters where the intermediate sum of coefficients
can go above the fixed-point equivalent of 1.0 in the middle of a filter,
the sum of a 31-bit calculation can overflow in both directions and can
thus not be represented in a 32-bit signed or unsigned integer. To work
around this, we subtract 0x40000000 from a signed integer base, so that
we're halfway signed/unsigned, which makes it fit even if it overflows.
After the filter finishes, we add the scaled bias back after a shift.
We use the same trick for 16-bit bpc YUV output routines.
Signed-off-by: Mans Rullgard <mans@mansr.com>
As old bits are shifted out of the accumulator, they cause signed
overflows when they reach the end. Making the variable unsigned fixes
this.
Signed-off-by: Mans Rullgard <mans@mansr.com>
This was removed erroneously in
046f081b46. This define still is
necessary for getting MAP_ANONYMOUS defined on linux/glibc,
despite the define reshuffling done in that commit.
Without MAP_ANONYMOUS defined, the mprotect calls for setting the
generated mmx2 scaler code pages executable are left out, causing
crashes if that codepath is chosen.
This patch fixes scaling from 192x144 to 320x240 with
-sws_flags fast_bilinear, which crashes on linux at the
moment.
Signed-off-by: Martin Storsjö <martin@martin.st>
Although gcc guarantees 16 byte stack alignment, threads under WinXP
don't appear to be guaranteed to start stack aligned. So fix the
alignment.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Altivec does unaligned reads from this buffer in
hscale_altivec_real(), and can thus read up to 16 bytes beyond
the end of the buffer. Therefore, add an extra 16 bytes of
padding at the end of the conversion buffer.
This fixes fate-lavfi-pixfmts_scale on AltiVec-enabled builds
under valgrind.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
Use uintptr_t instead of plain int. Without this change, the
comparisons will come out wrong for pointers in certain ranges.
Fixes random failures on ppc64. Also fixes some compiler warnings.
Signed-off-by: Mans Rullgard <mans@mansr.com>
SSE-optimized hScale() scales up to 4 pixels at once, so we need to
allocate up to 3 padding pixels to prevent overreads. This fixes
valgrind errors in various swscale-tests on fate.
Speed: from 3.9x to 9.6x speed improvement over C, and some small
(up to 15%) speed improvements over existing MMX code (particularly
for bigger filters).
This allows using more specific implementations for chroma/luma, e.g.
we can make assumptions on filterSize being constant, thus avoiding
that test at runtime.
It just does that part in scalar form, I doubt using a vector store
over 2 array would speed it up particularly.
The function should be written to not use a scratch buffer.
The logged information is possibly false, and it tends to be outdated
after each change since the logging code needs to be manually updated.
Simplify and prevent confusing wrong debug messages.
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
Also remove the unnecessary isSupportedIn/Out macros.
Make the code more compact/readable, and simplify the access to
lsws-specific pixel format information.
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
When converting RGB format to RGB format with the same bits per sample,
unscaled path performs conversion on the whole buffer at once. For
non-multiple-of-16 BGR24 to RGB24 conversion it means that padding at the
end of line will be converted too. Since it may be of arbitrary length
(e.g. 8 bytes), operating on the whole buffer produces obviously wrong
results.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
ptrdiff_t can be 4 bytes, which leads to the next element being 4-byte
aligned and thus at a different offset than intended. Forcing 8-byte
alignment forces equal offset of dither16/32 on x86-32 and x86-64.
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
We operated on 31-bits, but with e.g. lanczos scaling, values can
add up to beyond 0x80000000, thus leading to output of zeroes. Drop
one bit of precision fixes this.