Commit Graph

116725 Commits

Author SHA1 Message Date
Rémi Denis-Courmont
677f28b310 lavc/h264dsp: stick R-V V weight to 16-bit precision
T-Head C908 (ns):
h264_weight2_8_c:        1607.8
h264_weight2_8_rvv_i32:   515.0 (before)
h264_weight2_8_rvv_i32:   348.5 (after)
h264_weight4_8_c:        2255.8
h264_weight4_8_rvv_i32:  1015.0 (before)
h264_weight4_8_rvv_i32:   691.0 (after)
h264_weight8_8_c:        3857.5
h264_weight8_8_rvv_i32:  2218.8 (before)
h264_weight8_8_rvv_i32:  1561.3 (after)
h264_weight16_8_c:       7431.5
h264_weight16_8_rvv_i32: 2737.3 (before)
h264_weight16_8_rvv_i32: 1848.3 (after)

SpacemiT X60 (ns):
h264_weight2_8_c:        1624.1
h264_weight2_8_rvv_i32:   352.6 (before)
h264_weight2_8_rvv_i32:   259.3 (after)
h264_weight4_8_c:        2259.3
h264_weight4_8_rvv_i32:   685.8 (before)
h264_weight4_8_rvv_i32:   530.3 (after)
h264_weight8_8_c:        4103.3
h264_weight8_8_rvv_i32:  1581.8 (before)
h264_weight8_8_rvv_i32:  1238.6 (after)
h264_weight16_8_c:       7624.3
h264_weight16_8_rvv_i32: 2738.1 (before)
h264_weight16_8_rvv_i32: 1853.3 (after)
2024-08-02 21:24:01 +03:00
Rémi Denis-Courmont
afd45c7ff7 lavc/h264dsp: stick R-V V biweight to 16-bit
T-Head C908 (ns):
h264_biweight2_8_c:        2414.5
h264_biweight2_8_rvv_i32:   701.8 (before)
h264_biweight2_8_rvv_i32:   468.5 (after)
h264_biweight4_8_c:        4655.3
h264_biweight4_8_rvv_i32:  1377.5 (before)
h264_biweight4_8_rvv_i32:   931.8 (after)
h264_biweight8_8_c:        9701.5
h264_biweight8_8_rvv_i32:  2896.0 (before)
h264_biweight8_8_rvv_i32:  2070.5 (after)
h264_biweight16_8_c:      18025.0
h264_biweight16_8_rvv_i32: 3460.8 (before)
h264_biweight16_8_rvv_i32: 1978.0 (after)

SpacemiT X60 (ns):
h264_biweight2_8_c:        2415.5
h264_biweight2_8_rvv_i32:   478.2 (before)
h264_biweight2_8_rvv_i32:   362.8 (after)
h264_biweight4_8_c:        4655.3
h264_biweight4_8_rvv_i32:   946.7 (before)
h264_biweight4_8_rvv_i32:   727.3 (after)
h264_biweight8_8_c:        9061.8
h264_biweight8_8_rvv_i32:  2071.7 (before)
h264_biweight8_8_rvv_i32:  1685.8 (after)
h264_biweight16_8_c:      18020.5
h264_biweight16_8_rvv_i32: 3457.2 (before)
h264_biweight16_8_rvv_i32: 1935.8 (after)
2024-08-02 21:24:01 +03:00
Zhao Zhili
670ff6c7ce avcodec/nvenc: rework on DTS generation
Before the patch, the method to generate DTS only works with
timebase equal to 1/fps. With timebase like 1/1000

./ffmpeg -i foo.mp4 -an -c:v h264_nvenc -enc_time_base 1/1000 bar.mp4

pts 0    dts -3
pts 160  dts 37
pts 80   dts 77
pts 40   dts 117 <-- invalid
pts 120  dts 157
pts 320  dts 197
pts 240  dts 237
pts 200  dts 277 <-- invalid
pts 280  dts 317 <-- invalid

The generated DTS can be larger than PTS, since it only reorder the
input PTS and minus the number of frame delay, which doesn't take
timebase into account. It should minus the "time" of frame delay.

9a245bd trying to fix the issue, but the implementation is incomplete,
which only use time_base.num. Then it got reverted by ac7c265b33.

After this patch:

pts 0    dts -120
pts 160  dts -80
pts 80   dts -40
pts 40   dts 0
pts 120  dts 40
pts 320  dts 80
pts 240  dts 120
pts 200  dts 160
pts 280  dts 200

Signed-off-by: Timo Rothenpieler <timo@rothenpieler.org>
2024-08-02 17:57:19 +02:00
Roman Arzumanyan
bcea693f75 avcodec/cuviddec: more accurately guess probed sw pixel format
Signed-off-by: Timo Rothenpieler <timo@rothenpieler.org>
2024-08-02 17:38:46 +02:00
Gnattu OC
d50f9701b6 avutil/hwcontext_videotoolbox: Correctly set trc
The color trc key was assigned a color primaries value which causes
the resulting colorspace is always SDR.

Fixes #10884.

Signed-off-by: Gnattu OC <gnattuoc@me.com>
Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-08-02 10:24:09 +08:00
Rémi Denis-Courmont
1b2a925e94 lavc/riscv: drop probing for F & D extensions
F and D extensions are included in all RISC-V application profiles ever
made (so starting from RV64GC a.k.a. RVA20). Realistically they need to be
selected at compilation time.

Currently, there are no consumers for these two flags. If there is ever a
need to reintroduce F- or D-specific optimisations, we can always use
__riscv_f or __riscv_d compiler predefined macros respectively.
2024-08-01 22:56:50 +03:00
Rémi Denis-Courmont
2f083fd581 lavc/audiodsp: drop R-V F vector_clipf
This is now firmly slower than C.

SiFive-U74 (cycles):
audiodsp.vector_clipf_c:   31.2
audiodsp.vector_clipf_rvf: 39.5
2024-08-01 19:29:40 +03:00
Rémi Denis-Courmont
c48213b2dc lavc/audiodsp: drop opposite sign optimisation
This was added along side the original SSE(one) DSP function in
0a68cd876e without rationale. This was
presumably faster on x87, which is no longer relevant since we pretty
much assume SSE2 or later on x86.

Meanwhile this function is ~2.5x slower than the normal floating point
one on SiFive-U74.
2024-08-01 19:29:40 +03:00
Rémi Denis-Courmont
d86b6767ce lavc/audiodsp: properly unroll vector_clipf
Given that source and destination can alias, the compiler was forced to
perform each read-modify-write sequentially. We cannot use the `restrict`
qualifier to avoid this here because the AC-3 encoder uses the function
in-place. Instead this commit provides an explicit guarantee to the
compiler that batches of 8 elements will not overlap, so that it can
interleave calculations.

In practice contemporary optimising compilers are able to unroll and keep
the temporary array in FPU registers (without spilling).

On SiFive-U74, this speeds the same signs branch by 4x, and the
opposite signs branch 1.5x.
2024-08-01 19:29:40 +03:00
Rémi Denis-Courmont
d527d23872 lavc/pixblockdsp: specialise aligned 16-bit get_pixels
The current code assumes that we have unaligned rows, which hurts on
platforms with slower unaligned accesses. (Also, this lets the compiler
unroll manually, which it seems to do in practice.)
2024-08-01 18:44:01 +03:00
Rémi Denis-Courmont
656a9664bf checkasm/riscv: preserve T1 whilst calling...
This preserves T1 whilst calling the instrumented function. In a Sci-Fi
setting where type-based Control Flow Integrity (CFI) is supported, the
calling code (i.e., the `checkasm` test case) will set T1 to the expected
value of the landing pad label (LPL) of the instrumented function.

The call wrapper will always use LPL zero which is a wild card. We should
preserve the value of T1 at least until the indirect call to the
instrumented function. Of course this is Sci-Fi, because:
1) there is no hardware (or even QEMU) support yet,
2) all our assembler functions currently use LPL zero anyway.

This uses T3 rather than T2 because indirect branches with T2 is reserved
for notionally direct calls made with an indirect call instruction (e.g.
due to GOT indirection), and are exempted from forward-edge CFI checks.
2024-08-01 18:44:01 +03:00
Rémi Denis-Courmont
54b1970c60 lavu/riscv: fix return type 2024-08-01 18:44:01 +03:00
Rémi Denis-Courmont
54ae270213 lavc/rv34dsp: use saturating add/sub for R-V V DC add
T-Head C908 (cycles):
rv34_idct_dc_add_c:      113.2
rv34_idct_dc_add_rvv_i32: 48.5 (before)
rv34_idct_dc_add_rvv_i32: 39.5 (after)
2024-08-01 18:43:04 +03:00
Rémi Denis-Courmont
952b426f3b lavc/bswapdsp: add RV Zvbb bswap16 and bswap32 2024-08-01 18:43:04 +03:00
James Almer
f4daf633b2 avcodec/aacps_tablegen_template: don't redefine CONFIG_HARDCODED_TABLES
Fixes relevant warnings when compiling with --enable-hardcoded-tables

Signed-off-by: James Almer <jamrial@gmail.com>
2024-08-01 12:13:53 -03:00
James Almer
6f8e365a2a avutil/hwcontext_vaapi: use the correct type for VASurfaceAttribExternalBuffers.buffers
Should fix ticket #11115.

Signed-off-by: James Almer <jamrial@gmail.com>
2024-08-01 12:13:53 -03:00
Marvin Scholz
ca7fcf5089 avutil/hwcontext_videotoolbox: Fix build with older SDKs
The previous fix was not sufficient.
To make things easier to reason about, split the function and
add the guards there instead of complicating the call site more.

Signed-off-by: Zhao Zhili <zhilizhao@tencent.com>
2024-08-01 20:58:27 +08:00
Anton Khirnov
bcf08c1171 lavc/ffv1: change FFV1SliceContext.plane into a RefStruct object
Frame threading in the FFV1 decoder works in a very unusual way - the
state that needs to be propagated from the previous frame is not decoded
pixels(¹), but each slice's entropy coder state after decoding the slice.

For that purpose, the decoder's update_thread_context() callback stores
a pointer to the previous frame thread's private data. Then, when
decoding each slice, the frame thread uses the standard progress
mechanism to wait for the corresponding slice in the previous frame to
be completed, then copies the entropy coder state from the
previously-stored pointer.

This approach is highly dubious, as update_thread_context() should be
the only point where frame-thread contexts come into direct contact.
There are no guarantees that the stored pointer will be valid at all, or
will contain any particular data after update_thread_context() finishes.

More specifically, this code can break due to the fact that keyframes
reset entropy coder state and thus do not need to wait for the previous
frame. As an example, consider a decoder process with 2 frame threads -
thread 0 with its context 0, and thread 1 with context 1 - decoding a
previous frame P, current frame F, followed by a keyframe K. Then
consider concurrent execution consistent with the following sequence of
events:
* thread 0 starts decoding P
* thread 0 reads P's slice header, then calls
  ff_thread_finish_setup() allowing next frame thread to start
* main thread calls update_thread_context() to transfer state from
  context 0 to context 1; context 1 stores a pointer to context 0's private
  data
* thread 1 starts decoding F
* thread 1 reads F's slice header, then calls
  ff_thread_finish_setup() allowing the next frame thread to start
  decoding
* thread 0 finishes decoding P
* thread 0 starts decoding K; since K is a keyframe, it does not
  wait for F and reallocates the arrays holding entropy coder state
* thread 0 finishes decoding K
* thread 1 reads entropy coder state from its stored pointer to context
  0, however it finds state from K rather than from P

This execution is currently prevented by special-casing FFV1 in the
generic frame threading code, however that is supremely ugly. It also
involves unnecessary copies of the state arrays, when in fact they can
only be used by one thread at a time.

This commit addresses these deficiencies by changing the array of
PlaneContext (each of which contains the allocated state arrays)
embedded in FFV1SliceContext into a RefStruct object. This object can
then be propagated across frame threads in standard manner. Since the
code structure guarantees only one thread accesses it at a time, no
copies are necessary. It is also re-created for keyframes, solving the
above issue cleanly.

Special-casing of FFV1 in the generic frame threading code will be
removed in a later commit.

(¹) except in the case of a damaged slice, when previous frame's pixels
    are used directly
2024-08-01 10:09:26 +02:00
Anton Khirnov
c335218a81 lavc/ffv1dec: inline copy_fields() into update_thread_context()
It is now only called from a single place, so there is no point in it
being a separate function.
2024-08-01 10:09:26 +02:00
Anton Khirnov
d44812f7cf lavc/ffv1dec: stop using per-slice FFV1Context
All remaining accesses to them are for fields that have the same value
in the main encoder context.

Drop now-unused FFV1Context.slice_contexts.
2024-08-01 10:09:26 +02:00
Anton Khirnov
2b21cdff6e lavc/ffv1dec: move slice_damaged to per-slice context 2024-08-01 10:09:26 +02:00
Anton Khirnov
f2aeba56c4 lavc/ffv1dec: move slice_reset_contexts to per-slice context 2024-08-01 10:09:26 +02:00
Anton Khirnov
84dda32202 lavc/ffv1enc: stop using per-slice FFV1Context
All remaining accesses to them are for fields that have the same value
in the main encoder context.
2024-08-01 10:09:26 +02:00
Anton Khirnov
96e8af6c4d lavc/ffv1: move ac_byte_count to per-slice context 2024-08-01 10:09:26 +02:00
Anton Khirnov
e7d0f44138 lavc/ffv1enc: store per-slice rc_stat(2?) in FFV1SliceContext
Instead of the per-slice FFV1Context, which will be removed in future
commits.
2024-08-01 10:09:26 +02:00
Anton Khirnov
7b2bfba55d lavc/ffv1: move RangeCoder to per-slice context 2024-08-01 10:09:26 +02:00
Anton Khirnov
28769f6bc1 lavc/ffv1: move FFV1Context.plane to per-slice context 2024-08-01 10:09:26 +02:00
Anton Khirnov
9b86ba5a92 lavc/ffv1: always use the main context values of ac
It cannot change between slices.
2024-08-01 10:09:26 +02:00
Anton Khirnov
a57c88d67b lavc/ffv1: move FFV1Context.slice_{coding_mode,rct_.y_coef} to per-slice context 2024-08-01 10:09:26 +02:00
Anton Khirnov
39486a2b29 lavc/ffv1: always use the main context values of plane_count/transparency
They cannot change between slices.
2024-08-01 10:09:26 +02:00
Anton Khirnov
492df65201 lavc/ffv1: drop write-only PlaneContext.interlace_bit_state 2024-08-01 10:09:26 +02:00
Anton Khirnov
a411fc5a84 lavc/ffv1: drop redundant PlaneContext.quant_table
It is a copy of FFV1Context.quant_tables[quant_table_index].
2024-08-01 10:09:26 +02:00
Anton Khirnov
4b9f7c7e3a lavc/ffv1: drop redundant FFV1Context.quant_table
In all cases except decoding version 1 it's either not used, or contains
a copy of a table from quant_tables, which we can just as well use
directly.

When decoding version 1, we can just as well decode into
quant_tables[0], which would otherwise be unused.
2024-08-01 10:09:26 +02:00
Anton Khirnov
d2f507233a lavc/ffv1enc: move bit writer to per-slice context 2024-08-01 10:09:26 +02:00
Anton Khirnov
889faedd26 lavc/ffv1dec: move the bitreader to stack
There is no reason to place it in persistent state.
2024-08-01 10:09:25 +02:00
Anton Khirnov
19e9f3d5f2 lavc/ffv1: move run_index to the per-slice context 2024-08-01 10:09:25 +02:00
Anton Khirnov
91d3c1ac47 lavc/ffv1: move sample_buffer to the per-slice context 2024-08-01 10:09:25 +02:00
Anton Khirnov
54aa33f116 lavc/ffv1: add a per-slice context
FFV1 decoder and encoder currently use the same struct - FFV1Context -
both as codec private data and per-slice context. For this purpose
FFV1Context contains an array of pointers to per-slice FFV1Context
instances.

This pattern is highly confusing, as it is not clear which fields are
per-slice and which per-codec.

Address this by adding a new struct storing only per-slice data. Start
by moving slice_{x,y,width,height} to it.
2024-08-01 10:09:25 +02:00
Anton Khirnov
d845ea49c5 lavc/ffv1dec: move copy_fields() under HAVE_THREADS
It is unused otherwise
2024-08-01 10:09:25 +02:00
Anton Khirnov
3a5c814b19 lavc/ffv1dec: drop a pointless variable in decode_slice()
fsdst is by construction always equal to fs, there is even an
av_assert1() checking that. Just use fs directly.
2024-08-01 10:09:25 +02:00
Anton Khirnov
4da146ba83 lavc/ffv1dec: drop FFV1Context.cur
It is merely a pointer to FFV1Context.picture.f, which can just as well
be used directly.
2024-08-01 10:09:25 +02:00
Anton Khirnov
e1fa107fd1 lavc/ffv1dec: simplify slice index calculation 2024-08-01 10:09:25 +02:00
Anton Khirnov
d776fa4e4d lavc/ffv1dec: declare loop variables in the loop where possible 2024-08-01 10:09:25 +02:00
Anton Khirnov
8e19c24634 tests/fate/vcodec: add vsynth tests for FFV1 version 2 2024-08-01 10:09:25 +02:00
Michael Niedermayer
06f5ed40f8
avcodec/snow: Fix off by 1 error in run_buffer
Fixes: out of array access
Fixes: 70741/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_SNOW_fuzzer-5703668010647552

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2024-08-01 00:18:02 +02:00
Michael Niedermayer
58fbeb59e7
avcodec/utils: apply the same alignment to YUV410 as we do to YUV420 for snow
The snow encoder uses block based motion estimation which can read out of array if
insufficient alignment is used

It may be better to only apply this for the encoder, as it would safe a few bytes of memory
for the decoder. Until then, this fixes the issue in a simple way.

Fixes: out of array access
Fixes: 68963/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_SNOW_fuzzer-4979988435632128
Fixes: 68969/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_SNOW_fuzzer-6239933667803136.fuzz
Fixed: 70497/clusterfuzz-testcase-minimized-ffmpeg_AV_CODEC_ID_SNOW_fuzzer-5751882631413760

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2024-08-01 00:18:02 +02:00
Michael Niedermayer
ed96ac87a9
avformat/iamf_parse: Check for 0 samples
Fixes: division by zero
Fixes: 70561/clusterfuzz-testcase-minimized-ffmpeg_IO_DEMUXER_fuzzer-6199435013455872
Fixes: 70565/clusterfuzz-testcase-minimized-ffmpeg_dem_MOV_fuzzer-5783790316748800

Found-by: continuous fuzzing process https://github.com/google/oss-fuzz/tree/master/projects/ffmpeg
Reviewed-by: James Almer <jamrial@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2024-08-01 00:18:02 +02:00
Nathan E. Egge
8280ec7a32 lavu/riscv: Revert d808070, removing AV_READ_TIME
The implementation of ff_read_time() for RISC-V uses rdtime which has
 precision on existing hardware too low (!) for benchmarking purposes.
Deleting this implementation falls back on clock_gettime() which was
 added as the default ff_read_time() implementation in 33e4cc9.
Below are metrics gathered on SpacemiT K1, before and after this commit:

Before:

$ tests/checkasm/checkasm --bench
benchmarking with native FFmpeg timers
nop: 0.0
checkasm: using random seed 3473665261
checkasm: bench runs 1024 (1 << 10)
RVI:
 - pixblockdsp.get_pixels                [OK]
 - vc1dsp.mspel_pixels                   [OK]
RVF:
 - audiodsp.audiodsp                     [OK]
checkasm: all 4 tests passed
audiodsp.vector_clipf_c: 1388.7
audiodsp.vector_clipf_rvf: 261.5
get_pixels_c: 2.0
get_pixels_rvi: 1.5
vc1dsp.put_vc1_mspel_pixels_tab[0][0]_c: 8.0
vc1dsp.put_vc1_mspel_pixels_tab[0][0]_rvi: 1.0
vc1dsp.put_vc1_mspel_pixels_tab[1][0]_c: 2.0
vc1dsp.put_vc1_mspel_pixels_tab[1][0]_rvi: 0.5

After:

$ tests/checkasm/checkasm --bench
benchmarking with native FFmpeg timers
nop: 56.4
checkasm: using random seed 1021411603
checkasm: bench runs 1024 (1 << 10)
RVI:
 - pixblockdsp.get_pixels                [OK]
 - vc1dsp.mspel_pixels                   [OK]
RVF:
 - audiodsp.audiodsp                     [OK]
checkasm: all 4 tests passed
audiodsp.vector_clipf_c: 23236.4
audiodsp.vector_clipf_rvf: 11038.4
get_pixels_c: 79.6
get_pixels_rvi: 48.4
vc1dsp.put_vc1_mspel_pixels_tab[0][0]_c: 329.6
vc1dsp.put_vc1_mspel_pixels_tab[0][0]_rvi: 38.1
vc1dsp.put_vc1_mspel_pixels_tab[1][0]_c: 89.9
vc1dsp.put_vc1_mspel_pixels_tab[1][0]_rvi: 17.1

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-07-31 17:48:50 +03:00
James Almer
ab5c612137 avcodec/Makefile: use the correct path for aacdec_fixed.o when setting its dependencies
Fixes ticket #11112

Signed-off-by: James Almer <jamrial@gmail.com>
2024-07-31 11:32:56 -03:00
Anton Khirnov
43f702a253 lavfi/framesync: avoid forcing frame writability unnecessarily
Callers of ff_framesync_get_frame() generally do not expect the result
to be writable, those that do (e.g. ff_framesync_dualinput_get_writable())
ensure writability themselves.

Significantly reduces memory consumption in complex graphs with
framesync-based filters (e.g. scale, ssim).

Reported-By: Mark Shwartzman
2024-07-31 11:12:45 +02:00