Moves much of the setup logic for VAAPI decoding into lavc; the user
now need only provide the hw_frames_ctx.
(cherry picked from commit 123ccd07c5)
(cherry picked from commit 5e879b54a3)
(cherry picked from commit 0aec37e625)
(cherry picked from commit cfa4eb4fba)
* commit 'f450cc7bc595155bacdb9f5d2414a076ccf81b4a':
h264: eliminate decode_postinit()
Also includes fixes from 1f7b4f9abc and e344e65109.
Original patch replace H264Context.next_output_pic (H264Picture *) by
H264Context.output_frame (AVFrame *). This change is discarded as it
is incompatible with the frame reconstruction and motion vectors
display code which needs the extra information from the H264Picture.
Merged-by: Clément Bœsch <u@pkh.me>
Merged-by: Matthieu Bouron <matthieu.bouron@gmail.com>
We can pick the correct slice index directly from the ID3D11VideoDecoderOutputView
casted from data[3].
Also added myself as maintainer for DXVA2 and D3D11VA.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
No need to loop through the known surfaces, we'll use the requested surface
anyway.
The loop is only done for DXVA2.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This fixes heap-buffer-overflows in libopenmpt caused by interpreting
the negative size value as unsigned size_t.
Signed-off-by: Andreas Cadhalpun <Andreas.Cadhalpun@googlemail.com>
Reviewed-by: Jörn Heusipp <osmanx@problemloesungsmaschine.de>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
When support for this was added the details weren't yet finalized.
This is no longer the case.
Fixes writing of mkv/webm files with HDR.
Reported-by: Kagami Hiiragi <kagami@genshiken.org>
Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
Reviewed-by: James Almer <jamrial@gmail.com>
This work is sponsored by, and copyright, Google.
Previously all subpartitions except the eob=1 (DC) case ran with
the same runtime:
vp9_inv_dct_dct_16x16_sub16_add_neon: 1373.2
vp9_inv_dct_dct_32x32_sub32_add_neon: 8089.0
By skipping individual 8x16 or 8x32 pixel slices in the first pass,
we reduce the runtime of these functions like this:
vp9_inv_dct_dct_16x16_sub1_add_neon: 235.3
vp9_inv_dct_dct_16x16_sub2_add_neon: 1036.7
vp9_inv_dct_dct_16x16_sub4_add_neon: 1036.7
vp9_inv_dct_dct_16x16_sub8_add_neon: 1036.7
vp9_inv_dct_dct_16x16_sub12_add_neon: 1372.1
vp9_inv_dct_dct_16x16_sub16_add_neon: 1372.1
vp9_inv_dct_dct_32x32_sub1_add_neon: 555.1
vp9_inv_dct_dct_32x32_sub2_add_neon: 5190.2
vp9_inv_dct_dct_32x32_sub4_add_neon: 5180.0
vp9_inv_dct_dct_32x32_sub8_add_neon: 5183.1
vp9_inv_dct_dct_32x32_sub12_add_neon: 6161.5
vp9_inv_dct_dct_32x32_sub16_add_neon: 6155.5
vp9_inv_dct_dct_32x32_sub20_add_neon: 7136.3
vp9_inv_dct_dct_32x32_sub24_add_neon: 7128.4
vp9_inv_dct_dct_32x32_sub28_add_neon: 8098.9
vp9_inv_dct_dct_32x32_sub32_add_neon: 8098.8
I.e. in general a very minor overhead for the full subpartition case due
to the additional cmps, but a significant speedup for the cases when we
only need to process a small part of the actual input data.
This is cherrypicked from libav commits
cad42fadcd and
a0c443a398.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This work is sponsored by, and copyright, Google.
Previously all subpartitions except the eob=1 (DC) case ran with
the same runtime:
Cortex A7 A8 A9 A53
vp9_inv_dct_dct_16x16_sub16_add_neon: 3188.1 2435.4 2499.0 1969.0
vp9_inv_dct_dct_32x32_sub32_add_neon: 18531.7 16582.3 14207.6 12000.3
By skipping individual 4x16 or 4x32 pixel slices in the first pass,
we reduce the runtime of these functions like this:
vp9_inv_dct_dct_16x16_sub1_add_neon: 274.6 189.5 211.7 235.8
vp9_inv_dct_dct_16x16_sub2_add_neon: 2064.0 1534.8 1719.4 1248.7
vp9_inv_dct_dct_16x16_sub4_add_neon: 2135.0 1477.2 1736.3 1249.5
vp9_inv_dct_dct_16x16_sub8_add_neon: 2446.7 1828.7 1993.6 1494.7
vp9_inv_dct_dct_16x16_sub12_add_neon: 2832.4 2118.3 2266.5 1735.1
vp9_inv_dct_dct_16x16_sub16_add_neon: 3211.7 2475.3 2523.5 1983.1
vp9_inv_dct_dct_32x32_sub1_add_neon: 756.2 456.7 862.0 553.9
vp9_inv_dct_dct_32x32_sub2_add_neon: 10682.2 8190.4 8539.2 6762.5
vp9_inv_dct_dct_32x32_sub4_add_neon: 10813.5 8014.9 8518.3 6762.8
vp9_inv_dct_dct_32x32_sub8_add_neon: 11859.6 9313.0 9347.4 7514.5
vp9_inv_dct_dct_32x32_sub12_add_neon: 12946.6 10752.4 10192.2 8280.2
vp9_inv_dct_dct_32x32_sub16_add_neon: 14074.6 11946.5 11001.4 9008.6
vp9_inv_dct_dct_32x32_sub20_add_neon: 15269.9 13662.7 11816.1 9762.6
vp9_inv_dct_dct_32x32_sub24_add_neon: 16327.9 14940.1 12626.7 10516.0
vp9_inv_dct_dct_32x32_sub28_add_neon: 17462.7 15776.1 13446.2 11264.7
vp9_inv_dct_dct_32x32_sub32_add_neon: 18575.5 17157.0 14249.3 12015.1
I.e. in general a very minor overhead for the full subpartition case due
to the additional loads and cmps, but a significant speedup for the cases
when we only need to process a small part of the actual input data.
In common VP9 content in a few inspected clips, 70-90% of the non-dc-only
16x16 and 32x32 IDCTs only have nonzero coefficients in the upper left
8x8 or 16x16 subpartitions respectively.
This is cherrypicked from libav commit
9c8bc74c2b.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This avoids reloading them if they haven't been clobbered, if the
first pass also was idct.
This is similar to what was done in the aarch64 version.
This is cherrypicked from libav commit
3c87039a40.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Since the same parameter is used for both input and output,
the name inout is more fitting.
This matches the naming used below in the dmbutterfly macro.
This is cherrypicked from libav commit
79566ec8c7.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
The clobbering tests in checkasm are only invoked when testing
correctness, so this bug didn't show up when benchmarking the
dc-only version.
This is cherrypicked from libav commit
4d960a1185.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This is one instruction less for thumb, and only have got
1/2 arm/thumb specific instructions.
This is cherrypicked from libav commit
e5b0fc170f.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
The latter is 1 cycle faster on a cortex-53 and since the operands are
bytewise (or larger) bitmask (impossible to overflow to zero) both are
equivalent.
This is cherrypicked from libav commit
e7ae8f7a71.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Since aarch64 has enough free general purpose registers use them to
branch to the appropiate storage code. 1-2 cycles faster for the
functions using loop_filter 8/16, ... on a cortex-a53. Mixed results
(up to 2 cycles faster/slower) on a cortex-a57.
This is cherrypicked from libav commit
d7595de0b2.
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Seemingly ff_clear_block_sse assumed that the block array is aligned,
so make sure it is.
Fixes ticket #6079
Signed-off-by: James Almer <jamrial@gmail.com>