Commit Graph

2448 Commits

Author SHA1 Message Date
Rémi Denis-Courmont a1bfb5290e sws/rgb2rgb: RISC-V 64-bit V packed YUYV/UYVY to planar 4:2:2
This is currently 64-bit only because the stack spilling code would not
assemble on RV32I (and it would corrupt s0 and s1 on RV128I, in theory).

This could be added later in the unlikely that someone wants it.
2022-09-30 07:25:44 +02:00
Rémi Denis-Courmont 9181835a24 sws/rgb2rgb: RISC-V V interleaveBytes 2022-09-30 07:24:09 +02:00
Rémi Denis-Courmont 66a03f4053 sws/rgb2rgb: RISC-V V shuffle_bytes_xxxx functions 2022-09-30 07:24:09 +02:00
Andreas Rheinhardt 888a02a126 swscale/output: Don't call av_pix_fmt_desc_get() in a loop
Up until now, libswscale/output.c used a macro to write
an output pixel which involved a call to av_pix_fmt_desc_get()
to find out whether the input pixel format is BE or LE
despite this being known at compile-time (there are templates
per pixfmt). Even worse, these calls are made in a loop,
so that e.g. there are eight calls to av_pix_fmt_desc_get()
for every pixel processed in yuv2rgba64_X_c_template()
for 64bit RGB formats.

This commit modifies these macros to ensure that isBE()
is evaluated at compile-time. This saved 41184B of .text
for me (GCC 11.2, -O3). Of course, it also improved performance.
E.g. ffmpeg_g -f lavfi -i testsrc2,format=yuva420p -pix_fmt rgba64le \
-threads 1  -t 1:00  -f null - (which uses yuv2rgba64le_X_c,
which is an invocation of yuv2rgba64_X_c_template() mentioned above),
performance improved from 95589 to 41387 decicycles for one call
to yuv2packedX; for the be variant the numbers went down from
76087 to 43024 decicycles.

Reviewed-by: Anton Khirnov <anton@khirnov.net>
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-09-19 23:40:41 +02:00
Andreas Rheinhardt 4d7a1a4619 swscale/input: Avoid calls to av_pix_fmt_desc_get()
Up until now, libswscale/input.c used a macro to read
an input pixel which involved a call to av_pix_fmt_desc_get()
to find out whether the input pixel format is BE or LE
despite this being known at compile-time (there are templates
per pixfmt). Even worse, these calls are made in a loop,
so that e.g. there are six calls to av_pix_fmt_desc_get()
for every pair of UV pixel processed in
rgb64ToUV_half_c_template().

This commit modifies these macros to ensure that isBE()
is evaluated at compile-time. This saved 9743B of .text
for me (GCC 11.2, -O3). For a simple RGB64LE->YUV420P
transformation like
ffmpeg -f lavfi -i haldclutsrc,format=rgba64le -pix_fmt yuv420p \
-threads 1  -t 1:00  -f null -
the amount of decicycles spent in rgb64LEToUV_half_c
(which is created via the template mentioned above)
decreases from 19751 to 5341; for RGBA64BE the number
went down from 11945 to 5393. For shared builds (where
the call to av_pix_fmt_desc_get() is indirect) the old numbers
are 15230 for RGBA64BE and 27502 for RGBA64LE, whereas
the numbers with this patch are indistinguishable from
the numbers from a static build.

Also make the macros that are touched conform to the
usual convention of using uppercase names while just at it.

Reviewed-by: Anton Khirnov <anton@khirnov.net>
Reviewed-by: Paul B Mahol <onemda@gmail.com>
Reviewed-by: Michael Niedermayer <michael@niedermayer.cc>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-09-19 23:40:41 +02:00
Hao Chen 925ac0da32
swscale/la: Add output_lasx.c file.
ffmpeg -i 1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -s 640x480 -pix_fmt
rgb24 -y /dev/null -an
before: 150fps
after:  183fps

Signed-off-by: Hao Chen <chenhao@loongson.cn>
Reviewed-by: yinshiyou-hf@loongson.cn
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2022-09-10 22:56:39 +02:00
Hao Chen 74d09b068d
swscale/la: Add yuv2rgb_lasx.c and rgb2rgb_lasx.c files
ffmpeg -i 1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -pix_fmt rgb24 -y /dev/null -an
before: 178fps
after:  210fps

Signed-off-by: Hao Chen <chenhao@loongson.cn>
Reviewed-by: yinshiyou-hf@loongson.cn
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2022-09-10 22:56:38 +02:00
Hao Chen 38cacce22a
swscale/la: Optimize hscale functions with lasx.
ffmpeg -i 1_h264_1080p_30fps_3Mbps.mp4 -f rawvideo -s 640x480 -y /dev/null -an
before: 101fps
after:  138fps

Signed-off-by: Hao Chen <chenhao@loongson.cn>
Reviewed-by: yinshiyou-hf@loongson.cn
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2022-09-10 22:56:38 +02:00
Philip Langdale 09a8e5debb swscale/output: add support for Y210LE and Y212LE 2022-09-10 12:29:12 -07:00
Philip Langdale 68181623e9 swscale/output: add support for XV30LE 2022-09-10 12:29:12 -07:00
Philip Langdale 366f073c62 swscale/output: add support for XV36LE 2022-09-10 12:29:12 -07:00
Philip Langdale caf8d4d256 swscale/output: add support for P012
This generalises the existing P010 support.
2022-09-10 12:29:12 -07:00
Andreas Rheinhardt d2428d80ce swscale/input: Remove spec-incompliant ';'
These macros are definitions, not only declarations and therefore
should not contain a semicolon. Such a semicolon is actually
spec-incompliant, but compilers happen to accept them.

Reviewed-by: Philip Langdale <philipl@overt.org>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-09-08 19:21:30 +02:00
Philip Langdale 4a59eba227 swscale/input: add support for Y212LE 2022-09-06 12:49:10 -07:00
Philip Langdale 198b5b90d5 swscale/input: add support for XV30LE 2022-09-06 12:49:10 -07:00
Philip Langdale 5bdd726115 swscale/input: add support for P012
As we now have three of these formats, I added macros to generate the
conversion functions.
2022-09-06 12:49:10 -07:00
Philip Langdale 8d9462844a swscale/input: add support for XV36LE 2022-09-06 12:49:10 -07:00
Philip Langdale 45726aa117 libswscale: add support for VUYX format
As we already have support for VUYA, I figured I should do the small
amount of work to support VUYX as well. That means a little refactoring
to share code.
2022-08-25 19:03:49 -07:00
Andreas Rheinhardt de33506e4b swscale/x86/rgb_2_rgb: Empty MMX state in ff_shuffle_bytes_2103_mmxext
Fixes FATE-failures with the the filter-2xbr filter-3xbr filter-4xbr
filter-ep2x filter-ep3x filter-hq2x filter-hq3x filter-hq4x
filter-paletteuse-bayer filter-paletteuse-bayer0
filter-paletteuse-nodither and filter-paletteuse-sierra2_4a tests
when using 32bit x86 with CPUFLAGS ranging from "mmx+mmxext" to
"mmx+mmxext+sse+sse2+sse3" (the relevant function is only overwritten
when using SSSE3).

Reviewed-by: Lynne <dev@lynne.ee>
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-08-23 12:21:00 +02:00
Timo Rothenpieler aca569aad2 swscale/input: add rgbaf16 input support
This is by no means perfect, since at least ddagrab will return scRGB
data with values outside of 0.0f to 1.0f for HDR values.
Its primary purpose is to be able to work with the format at all.
2022-08-19 22:09:36 +02:00
Timo Rothenpieler f2de911818 swscale: add opaque parameter to input functions 2022-08-19 22:09:36 +02:00
Andreas Rheinhardt 8bec225c3c swscale/x86/yuv2yuvX: Remove unused ff_yuv2yuvX_mmx()
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-08-19 12:01:34 +02:00
Alan Kelly a38293e444 libswscale: Enable hscale_avx2 for all input sizes.
ff_shuffle_filter_coefficients shuffles the tail as required.

Signed-off-by: Anton Khirnov <anton@khirnov.net>
2022-08-18 16:24:48 +02:00
Alan Kelly a6724285fd sws: allow avx2 hscale to process inputs of any size.
The main loop processes blocks of 16 pixels. The tail processes blocks
of size 4.

Signed-off-by: Anton Khirnov <anton@khirnov.net>
2022-08-18 16:24:48 +02:00
Alan Kelly 51a34e8525 sws: Replace call to yuv2yuvX_mmx by yuv2yuvX_mmxext
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-08-18 16:19:13 +02:00
Swinney, Jonathan 0d7caa5b09 swscale/aarch64: add vscale specializations
This commit adds new code paths for vscale when filterSize is 2, 4, or
8. By using specialized code with unrolling to match the filterSize we
can improve performance.

On AWS c7g (Graviton 3, Neoverse V1) instances:
                                 before   after
yuv2yuvX_2_0_512_accurate_neon:  558.8    268.9
yuv2yuvX_4_0_512_accurate_neon:  637.5    434.9
yuv2yuvX_8_0_512_accurate_neon:  1144.8   806.2
yuv2yuvX_16_0_512_accurate_neon: 2080.5   1853.7

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-16 13:40:42 +03:00
Swinney, Jonathan 3e708722a2 swscale/aarch64: vscale optimization
Use scalar times vector multiply accumlate instructions instead of
vector times vector to remove the need for replicating load instructions
which are slightly slower.

On AWS c7g (Graviton 3, Neoverse V1) instances:
yuv2yuvX_8_0_512_accurate_neon:  1144.8  987.4
yuv2yuvX_16_0_512_accurate_neon: 2080.5 1869.4

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-16 13:40:42 +03:00
Swinney, Jonathan 4dcd191a50 checkasm: updated tests for sw_scale
Change the reference to exactly match the C reference in swscale,
instead of exactly matching the x86 SIMD implementations (which
differs slightly). Test with and without SWS_ACCURATE_RND - if this
flag isn't set, the output must match the C reference exactly,
otherwise it is allowed to be off by 2.

Mark a couple x86 functions as unavailable when SWS_ACCURATE_RND
is set - apparently this discrepancy hasn't been noticed in other
exact tests before.

Add a test for yuv2plane1.

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-16 13:40:42 +03:00
Swinney, Jonathan 75ffca7eef libswscale/aarch64: add another hscale specialization
This specialization handles the case where filtersize is 4 mod 8, e.g.
12, 20, etc. Aarch64 was previously using the c function for this case.
This implementation speeds up that case significantly.

hscale_8_to_15__fs_12_dstW_512_c: 6234.1
hscale_8_to_15__fs_12_dstW_512_neon: 1505.6

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-08-16 12:08:38 +03:00
Timo Rothenpieler b77fff47d0 configure: always enable gnu_windres if available
Use the appropiate Makefile variable to ensure the resource file is
only built into shared libraries instead.
2022-08-13 14:42:36 +02:00
James Almer 68e017c487 swscale/output: fix reading chroma values when generating vuya output
Signed-off-by: James Almer <jamrial@gmail.com>
2022-08-08 09:39:33 -03:00
James Almer 1974813261 swscale/output: add VUYA output support
Signed-off-by: James Almer <jamrial@gmail.com>
2022-08-07 09:33:16 -03:00
James Almer f0abd07996 swscale/input: add VUYA input support
Reviewed-by: Philip Langdale <philipl@overt.org>
Signed-off-by: James Almer <jamrial@gmail.com>
2022-08-05 09:39:21 -03:00
Andreas Rheinhardt da668fa7d2 swscale/rgb2rgb: Don't cast const away
Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-07-31 01:09:52 +02:00
Matthieu Bouron 0a6bb7da55 swscale: add NV16 input/output
Signed-off-by: Anton Khirnov <anton@khirnov.net>
2022-07-19 12:20:16 +02:00
Michael Niedermayer fd26b07e8b Bump versions after 5.1 branch
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2022-07-13 00:29:05 +02:00
Michael Niedermayer 6f1b144358 Bump Versions for 5.1 branch
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2022-07-13 00:27:37 +02:00
Andreas Rheinhardt 81d3472031 swscale/x86/swscale: Simplify macro
This is possible now that it is no longer used by MMX.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-06-22 13:36:18 +02:00
Andreas Rheinhardt a05f22eaf3 swscale/x86/swscale: Remove obsolete and harmful MMX(EXT) functions
x64 always has MMX, MMXEXT, SSE and SSE2 and this means
that some functions for MMX, MMXEXT, SSE and 3dnow are always
overridden by other functions (unless one e.g. explicitly
disables SSE2). So given that the only systems that
benefit from these functions are truely ancient 32bit x86s
they are removed.

Moreover, some of the removed code was buggy/not bitexact
and lead to failures involving the f32le and f32be versions of
gray, gbrp and gbrap on x86-32 when SSE2 was not disabled.
See e.g.
https://fate.ffmpeg.org/report.cgi?time=20220609221253&slot=x86_32-debian-kfreebsd-gcc-4.4-cpuflags-mmx

Notice that yuv2yuvX_mmx is not removed, because it is used
by SSE3 and AVX2 as fallback in case of unaligned data and
also for tail processing. I don't know why yuv2yuvX_mmxext
isn't being used for this; an earlier version [1] of
554c2bc708 used it, but
the version that was eventually applied does not.

[1]: https://ffmpeg.org/pipermail/ffmpeg-devel/2020-November/272124.html

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-06-22 13:36:04 +02:00
Andreas Rheinhardt 2831837182 swscale/x86/yuv2rgb: Remove obsolete MMX functions
x64 always has MMX, MMXEXT, SSE and SSE2 and this means
that some functions for MMX, MMXEXT and 3dnow are always
overridden by other functions (unless one e.g. explicitly
disables SSE2) for x64. So given that the only systems that
benefit from these functions are truely ancient 32bit x86s
they are removed.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-06-22 13:35:50 +02:00
Andreas Rheinhardt 608319a311 swscale/x86/rgb2rgb: Remove obsolete MMX, 3dnow functions
x64 always has MMX, MMXEXT, SSE and SSE2 and this means
that some functions for MMX, MMXEXT and 3dnow are always
overridden by other functions (unless one e.g. explicitly
disables SSE2) for x64. So given that the only systems that
benefit from these functions are truely ancient 32bit x86s
they are removed.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-06-22 13:35:38 +02:00
Andreas Rheinhardt 40e6575aa3 all: Replace if (ARCH_FOO) checks by #if ARCH_FOO
This is more spec-compliant because it does not rely
on dead-code elimination by the compiler. Especially
MSVC has problems with this, as can be seen in
https://ffmpeg.org/pipermail/ffmpeg-devel/2022-May/296373.html
or
https://ffmpeg.org/pipermail/ffmpeg-devel/2022-May/297022.html

This commit does not eliminate every instance where we rely
on dead code elimination: It only tackles branching to
the initialization of arch-specific dsp code, not e.g. all
uses of CONFIG_ and HAVE_ checks. But maybe it is already
enough to compile FFmpeg with MSVC with whole-programm-optimizations
enabled (if one does not disable too many components).

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-06-15 04:56:37 +02:00
Vardan Margaryan 73302aa193 swscale/x86/yuv_2_rgb: fix access to memory past the frame data in yuv to rgb conversion
Y, U, V data is loaded at the end of the current iteration for the next
iteration.
It results in memory access past the frame data on the last iteration
(that data is never used after the loading).

So load data at the start of the iteration, so that only useful data is
loaded.

Signed-off-by: Vardan Margaryan <v.t.margaryan@gmail.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
2022-06-06 09:51:17 +02:00
Swinney, Jonathan 0ea61725b1 swscale/aarch64: add hscale specializations
This patch adds code to support specializations of the hscale function
and adds a specialization for filterSize == 4.

ff_hscale8to15_4_neon is a complete rewrite. Since the main bottleneck
here is loading the data from src, this data is loaded a whole block
ahead and stored back to the stack to be loaded again with ld4. This
arranges the data for most efficient use of the vector instructions and
removes the need for completion adds at the end. The number of
iterations of the C per iteration of the assembly is increased from 4 to
8, but because of the prefetching, there must be a special section
without prefetching when dstW < 16.

This improves speed on Graviton 2 (Neoverse N1) dramatically in the case
where previously fs=8 would have been required.

before: hscale_8_to_15__fs_8_dstW_512_neon: 1962.8
after : hscale_8_to_15__fs_4_dstW_512_neon: 1220.9

Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-05-28 01:09:05 +03:00
Andreas Rheinhardt f2b79c5b85 lib*/version: Move library version functions into files of their own
This avoids having to rebuild big files every time FFMPEG_VERSION
changes (which it does with every commit).

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2022-05-10 06:49:32 +02:00
Martin Storsjö 70db14376c swscale: aarch64: Optimize the final summation in the hscale routine
Before:                     Cortex A53      A72      A73  Graviton 2  Graviton 3
hscale_8_to_15_width8_neon:     8273.0   4602.5   4289.5      2429.7      1629.1
hscale_8_to_15_width16_neon:   12405.7   6803.0   6359.0      3549.0      2378.4
hscale_8_to_15_width32_neon:   21258.7  11491.7  11469.2      5797.2      3919.6
hscale_8_to_15_width40_neon:   25652.0  14173.7  12488.2      6893.5      4810.4

After:
hscale_8_to_15_width8_neon:     7633.0   3981.5   3350.2      1980.7      1261.1
hscale_8_to_15_width16_neon:   11666.7   5951.0   5512.0      3080.7      2131.4
hscale_8_to_15_width32_neon:   20900.7  10733.2   9481.7      5275.2      3862.1
hscale_8_to_15_width40_neon:   24826.0  13536.2  11502.0      6397.2      4731.9

Thus, this gives overall a 8-29% speedup for the smaller filter
sizes, around 1-8% for the larger filter sizes.

Inspired by a patch by Jonathan Swinney <jswinney@amazon.com>.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-04-22 10:49:46 +03:00
Martin Storsjö 2d368392a5 Keep including the full version.h when headers are included externally
This avoids unnecessary churn and build breakage for users, by
making sure the whole version.h is included like it has been so far,
while keeping the benefit of not needing to rebuild most files in
the ffmpeg tree on minor/micro bumps.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-03-19 00:01:57 +02:00
Martin Storsjö f3a0e2ee2b doc: Add an entry to APIchanges about changes to version.h and version_major.h
Also bump the minor versions of all libraries, to signify the
API change of splitting the version.h headers and adding the
new version_major.h header.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-03-16 14:12:46 +02:00
Martin Storsjö 6cd2ac388d libswscale: Split version.h
Signed-off-by: Martin Storsjö <martin@martin.st>
2022-03-16 14:05:26 +02:00
Martin Storsjö c523724c69 swscale: Take the destination range into account for yuv->rgb->yuv conversions
The range parameters need to be set up before calling
sws_init_context (which selects which fastpaths can be used;
this gets called by sws_getContext); solely passing them via
sws_setColorspaceDetails isn't enough.

This fixes producing full range YUV range output when doing
YUV->YUV conversions between different YUV color spaces.

Signed-off-by: Martin Storsjö <martin@martin.st>
2022-02-25 11:01:17 +02:00