Commit Graph

909 Commits

Author SHA1 Message Date
Aurelien Jacobs f677718bc8 sbcenc: add armv6 and neon asm optimizations
This was originally based on libsbc, and was fully integrated into ffmpeg.
2018-03-07 22:26:53 +01:00
Michael Niedermayer 7dbbb75ee3 avcodec/arm/sbrdsp_neon: Use a free register instead of putting 2 things in one
Fixes high pitched shriek
Fixes: 25420848_1478428308873746_4255813235963330560_n.mp4

Reported-by: Dale Curtis <dalecurtis@google.com>
Reviewed-by: Dale Curtis <dalecurtis@chromium.org>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
2018-01-12 22:45:02 +01:00
James Almer 36de24d5b7 arm/hevc_idct: fix compilation on Android
Compilation error "out of range" fixed for armeabi-v7a. Compilation failed
trying to build libvlc.aar for ARM7 android on ubuntu 16.04 host. Error
messages is "Offset out of range". The reason of the error is assembler LDR
directives in function "ff_hevc_transform_luma_4x4_neon_8" need local storage
in range <1k, but no such storage provided.

Based on a patch by Ihor Bobalo <bob@eleks.com>

Suggested-by: wbs
Signed-off-by: James Almer <jamrial@gmail.com>
2017-12-09 21:46:34 +02:00
James Almer 68e479e3ad Merge commit 'b487add7ecf78efda36d49815f8f8757bd24d4cb'
* commit 'b487add7ecf78efda36d49815f8f8757bd24d4cb':
  arm: Remove a redundant check in fmtconvert_init_arm.c

Merged-by: James Almer <jamrial@gmail.com>
2017-11-11 23:30:31 -03:00
James Almer 640073eceb Merge commit '9dde6ab06c48f9447cd16f39bee33569cddb7be4'
* commit '9dde6ab06c48f9447cd16f39bee33569cddb7be4':
  arm: Fix SIGBUS on ARM when compiled with binutils 2.29

Merged-by: James Almer <jamrial@gmail.com>
2017-11-11 13:44:07 -03:00
James Almer 921993503b Merge commit 'd7320ca3ed10f0d35b3740fa03341161e74275ea'
* commit 'd7320ca3ed10f0d35b3740fa03341161e74275ea':
  arm: Avoid using .dn register aliases

Merged-by: James Almer <jamrial@gmail.com>
2017-10-30 21:00:51 -03:00
James Almer 62d86c41b7 Merge commit 'ce080f47b8b55ab3d41eb00487b138d9906d114d'
* commit 'ce080f47b8b55ab3d41eb00487b138d9906d114d':
  hevc: Add NEON 32x32 IDCT

Merged-by: James Almer <jamrial@gmail.com>
2017-10-30 19:59:01 -03:00
James Almer e9e7e1cc6b Merge commit '118dd4a321a2d67f67c21b076abd0b4d939ab642'
* commit '118dd4a321a2d67f67c21b076abd0b4d939ab642':
  hevc: 16x16 NEON idct: Use the right element size for loads/stores

Merged-by: James Almer <jamrial@gmail.com>
2017-10-30 19:56:29 -03:00
James Almer 31a4112936 Merge commit 'edbf0fffb15dde7a1de70b05855529d5fc769f14'
* commit 'edbf0fffb15dde7a1de70b05855529d5fc769f14':
  hevc: Add NEON add_residual for bitdepth 10

Merged-by: James Almer <jamrial@gmail.com>
2017-10-30 18:07:31 -03:00
James Almer 05beee44c6 Merge commit 'e1c2453a4fac1f7116244d0d05310935c20887e6'
* commit 'e1c2453a4fac1f7116244d0d05310935c20887e6':
  arm: hevc_idct: Tune the add_res_8x8 and add_res_32x32 functions

Merged-by: James Almer <jamrial@gmail.com>
2017-10-30 17:41:08 -03:00
James Almer 999c2271a5 Merge commit '0d4d43513786f1df4d561e1fac924fb0722c6700'
* commit '0d4d43513786f1df4d561e1fac924fb0722c6700':
  hevc: Add NEON add_residual for bitdepth 8

See 03cecf45c1

Merged-by: James Almer <jamrial@gmail.com>
2017-10-30 17:39:37 -03:00
James Almer f9c3fbc00c Merge commit '3d69dd65c6771c28d3bf4e8e53a905aa8cd01fd9'
* commit '3d69dd65c6771c28d3bf4e8e53a905aa8cd01fd9':
  hevc: Add support for bitdepth 10 for IDCT DC

Merged-by: James Almer <jamrial@gmail.com>
2017-10-30 16:03:27 -03:00
James Almer cc8c2d3609 Merge commit '358adef0305618219522858e471edf7e0cb4043e'
* commit '358adef0305618219522858e471edf7e0cb4043e':
  hevc: Add NEON IDCT DC functions for bitdepth 8

See 03cecf45c1

Merged-by: James Almer <jamrial@gmail.com>
2017-10-30 15:58:40 -03:00
James Almer 9840ca70e7 Merge commit '89d9869d2491d4209d707a8e7f29c58227ae5a4e'
* commit '89d9869d2491d4209d707a8e7f29c58227ae5a4e':
  hevc: Add NEON 16x16 IDCT

Merged-by: James Almer <jamrial@gmail.com>
2017-10-27 18:22:39 -03:00
James Almer c0683dce89 Merge commit '0b9a237b2386ff84a6f99716bd58fa27a1b767e7'
* commit '0b9a237b2386ff84a6f99716bd58fa27a1b767e7':
  hevc: Add NEON 4x4 and 8x8 IDCT

[15:12:59] <@ubitux> hevc_idct_4x4_8_c: 389.1
[15:13:00] <@ubitux> hevc_idct_4x4_8_neon: 126.6
[15:13:02] <@ubitux> our ^
[15:13:06] <@ubitux> hevc_idct_4x4_8_c: 389.3
[15:13:08] <@ubitux> hevc_idct_4x4_8_neon: 107.8
[15:13:10] <@ubitux> hevc_idct_4x4_10_c: 418.6
[15:13:12] <@ubitux> hevc_idct_4x4_10_neon: 108.1
[15:13:14] <@ubitux> libav ^
[15:13:30] <@ubitux> so yeah, we can probably trash our versions here

Merged-by: James Almer <jamrial@gmail.com>
2017-10-24 19:10:22 -03:00
Martin Storsjö b487add7ec arm: Remove a redundant check in fmtconvert_init_arm.c
This was missed in e2710e790c, where have_vfp && !have_vfpv3 were
converted into have_vfp_vm.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-10-24 09:07:01 +03:00
Martin Storsjö 9dde6ab06c arm: Fix SIGBUS on ARM when compiled with binutils 2.29
In binutils 2.29, the behavior of the ADR instruction changed so that 1 is
added to the address of a Thumb function (previously nothing was added). This
allows the loaded address to be passed to a BLX instruction and the correct
mode change will occur.

See: https://sourceware.org/bugzilla/show_bug.cgi?id=21458

By using adr with a label that isn't annotated as a thumb function,
we avoid the new behaviour in binutils 2.29 and get the same behaviour
as in prior releases, and as in other assemblers (ms armasm.exe,
clang's built in assembler) - an idea that Janne Grunau came up with.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-09-02 22:18:20 +03:00
Muhammad Faiz 0780ad9c68 avcodec/rdft: remove sintable
It is redundant with costable. The first half of sintable is
identical with the second half of costable. The second half
of sintable is negative value of the first half of sintable.

The computation is changed to handle sign of sin values, in
C code and ARM assembly code.

Signed-off-by: Muhammad Faiz <mfcc64@gmail.com>
2017-07-11 13:22:02 +07:00
Clément Bœsch b12a36170b lavc/aacpsdsp: use ptrdiff_t for stride in hybrid_analysis 2017-06-28 12:22:39 +02:00
Clément Bœsch e4a27e2f2d lavc/arm: fix lack of precision in ff_ps_stereo_interpolate_neon
The code originally pre-multiply by 2 the steps, causing the running sum
of the h factors to drift away due to the lack of precision. It quickly
causes an inaccuracy > 0.01.

I tried diverse approaches such as multiply by 2.0 (instead of adding
the value itself) without success.

I'm unable to bench the impact of this change, feel free to compare.

This commit fixes the incoming aacpsdsp tests.

Following is an alternative simplified function (matching the incoming
AArch64 code) that may be used:

function ff_ps_stereo_interpolate_neon, export=1
        vld1.32         {q0}, [r2]
        vld1.32         {q1}, [r3]
        ldr             r12, [sp]
        vmov.f32        q8, q0
        vmov.f32        q9, q1
        vzip.32         q8, q0
        vzip.32         q9, q1
1:
        vld1.32         {d4}, [r0,:64]
        vld1.32         {d6}, [r1,:64]
        vadd.f32        q8, q8, q9
        vadd.f32        q0, q0, q1
        vmov.f32        d5, d4
        vmov.f32        d7, d6
        vmul.f32        q2, q2, q8
        vmla.f32        q2, q3, q0
        vst1.32         {d4}, [r0,:64]!
        vst1.32         {d5}, [r1,:64]!
        subs            r12, r12, #1
        bgt             1b
        bx              lr
endfunc
2017-06-28 11:59:34 +02:00
Martin Storsjö d7320ca3ed arm: Avoid using .dn register aliases
clang now (in the upcoming 5.0 version) is capable of building our
arm assembly without relying on gas-preprocessor, although clang/LLVM
doesn't support .dn register aliases.

The VC1 MC assembly was only built and used if the chosen assembler
supported the .dn directives though. This was supported as long as
gas-preprocessor was used.

This means that VC1 decoding got a speed regression on clang 5.0,
unless the user manually chose using gas-preprocessor again.

By avoiding using the .dn register aliases, we can build the VC1 MC
assembly with the latest clang version.

Support for the .dn/.qn directives in clang/LLVM isn't actively planned,
see https://bugs.llvm.org/show_bug.cgi?id=18199.

This partially reverts 896a5bff64.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-05-15 09:52:18 +03:00
Alexandra Hájková ce080f47b8 hevc: Add NEON 32x32 IDCT
Signed-off-by: Martin Storsjö <martin@martin.st>
2017-05-04 14:08:39 +02:00
Alexandra Hájková 118dd4a321 hevc: 16x16 NEON idct: Use the right element size for loads/stores
This doesn't change the actual behaviour of the code but improves
readability.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-05-04 14:08:27 +02:00
Alexandra Hájková edbf0fffb1 hevc: Add NEON add_residual for bitdepth 10
Signed-off-by: Martin Storsjö <martin@martin.st>
2017-05-01 23:39:55 +03:00
Martin Storsjö e1c2453a4f arm: hevc_idct: Tune the add_res_8x8 and add_res_32x32 functions
Before:              Cortex     A7      A8      A9     A53
hevc_add_res_8x8_8_neon:     116.0    58.7    80.2    90.7
hevc_add_res_32x32_8_neon:  1230.0   737.5  1187.5   974.4
After:
hevc_add_res_8x8_8_neon:      97.7    57.0    73.7    80.0
hevc_add_res_32x32_8_neon:  1216.0   698.7  1127.5   827.1

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-04-28 12:02:14 +03:00
Seppo Tomperi 0d4d435137 hevc: Add NEON add_residual for bitdepth 8
Optimized by Alexandra Hájková.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-04-27 23:05:27 +03:00
Alexandra Hájková 3d69dd65c6 hevc: Add support for bitdepth 10 for IDCT DC
Signed-off-by: Martin Storsjö <martin@martin.st>
2017-04-25 22:48:45 +03:00
Seppo Tomperi 358adef030 hevc: Add NEON IDCT DC functions for bitdepth 8
Signed-off-by: Alexandra Hájková <alexandra@khirnov.net>
Signed-off-by: Martin Storsjö <martin@martin.st>
2017-04-25 22:48:45 +03:00
Alexandra Hájková 89d9869d24 hevc: Add NEON 16x16 IDCT
The speedup vs C code is around 6-13x.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-04-12 22:40:54 +03:00
Ronald S. Bultje 40cbd686dc idct_arm: remove use of ff_put/add_pixels_clamped function pointer.
Instead, hardcode the use of the _arm implementation of add_pixels,
and use the C version for put_pixels (as no arm-optimized version
exists). Since there's separate implementations of idct{,_put,_add}
for neon, this has no practical impact on performance.
2017-04-06 10:03:27 -04:00
Ronald S. Bultje 0c46641784 vp9: split out generic decoding skeleton interface API from VP9 types.
This allows vp9dsp.h to only include the VP9 types header, and not the
decoder skeleton interface which is for hardware decoders (dxva2/vaapi).
2017-03-28 18:04:27 -04:00
Ronald S. Bultje f8c019944d vp9: re-split the decoder/format/dsp interface header files.
The advantage here is that the internal software decoder interface is
not exposed to the DSP functions or the hardware accelerations.
2017-03-28 18:04:26 -04:00
Martin Storsjö fbc6f190a6 arm: Always build the hevcdsp_init_arm.c file
The main hevcdsp.c file calls this init function if HAVE_ARM is set,
regardless of whether neon support is available or not.

This fixes builds where neon isn't supported by the build tools at all.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-28 11:36:01 +03:00
Alexandra Hájková 0b9a237b23 hevc: Add NEON 4x4 and 8x8 IDCT
Optimized by Martin Storsjö <martin@martin.st>.

The speedup vs C code is around 3.2-4.4x.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-27 22:56:23 +03:00
Clément Bœsch 1c9f4b5078 lavc/vp9: split into vp9{block,data,mvs}
This is following Libav layout to ease merges.
2017-03-27 21:38:21 +02:00
James Almer 9a0fbb9ca9 Merge commit '2caa93b813adc5dbb7771dfe615da826a2947d18'
* commit '2caa93b813adc5dbb7771dfe615da826a2947d18':
  mpegaudiodsp: Change type of array stride parameters to ptrdiff_t

Merged-by: James Almer <jamrial@gmail.com>
2017-03-21 16:04:22 -03:00
James Almer a8474df944 Merge commit 'e4a94d8b36c48d95a7d412c40d7b558422ff659c'
* commit 'e4a94d8b36c48d95a7d412c40d7b558422ff659c':
  h264chroma: Change type of stride parameters to ptrdiff_t

Merged-by: James Almer <jamrial@gmail.com>
2017-03-21 15:20:45 -03:00
James Almer 5a49097b42 Merge commit '2ec9fa5ec60dcd10e1cb10d8b4e4437e634ea428'
* commit '2ec9fa5ec60dcd10e1cb10d8b4e4437e634ea428':
  idct: Change type of array stride parameters to ptrdiff_t

Merged-by: James Almer <jamrial@gmail.com>
2017-03-21 14:29:52 -03:00
Clément Bœsch 51b5672f49 Merge commit '92c5755a185086067fe49e7e64c23a8e7011be31'
* commit '92c5755a185086067fe49e7e64c23a8e7011be31':
  hpeldsp: arm: Update comments left behind in 25841dfe80

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-21 15:10:46 +01:00
Clément Bœsch ad98af27f7 Merge commit 'de2ae3c1fae5a2eb539b9abd7bc2a9ca8c286ff0'
* commit 'de2ae3c1fae5a2eb539b9abd7bc2a9ca8c286ff0':
  lavc: add clobber tests for the new encoding/decoding API

The merge only re-order what we already have.

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-21 14:43:53 +01:00
Clément Bœsch 83cd80d10a Merge commit '12004a9a7f20e44f4da2ee6c372d5e1794c8d6c5'
* commit '12004a9a7f20e44f4da2ee6c372d5e1794c8d6c5':
  audiodsp/x86: yasmify vector_clipf_sse
  audiodsp: reorder arguments for vector_clipf

Merged the version from Libav after a discussion with James Almer on
IRC:

19:22 <ubitux> jamrial: opinion on 12004a9a7f20e44f4da2ee6c372d5e1794c8d6c5?
19:23 <ubitux> it was apparently yasmified differently
19:23 <ubitux> (it depends on the previous commit arg shuffle)
19:24 <ubitux> i don't see the magic movsxdifnidn in your port btw
19:24 <ubitux> it's a port from 1d36defe94
19:25 <jamrial> seems better thanks to said arg shuffle
19:25 <jamrial> the loop is the same, but init is simpler
19:25 <jamrial> probably worth merging
19:25 <ubitux> OK
19:25 <ubitux> thanks
19:26 <jamrial> curious they didn't make len ptrdiff_t after the previous bunch of commits, heh
19:26 <ubitux> yeah indeed

Both commits are merged at the same time to prevent a conflict with our
existing yasmified ff_vector_clipf_sse.

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-20 22:35:07 +01:00
Clément Bœsch b78243c504 lavc/arm: fix indent in blockdsp_init_neon 2017-03-20 19:01:25 +01:00
Clément Bœsch e07fa3008b Merge commit 'de452e503734ebb0fdbce86e9d16693b3530fad3'
* commit 'de452e503734ebb0fdbce86e9d16693b3530fad3':
  pixblockdsp: Change type of stride parameters to ptrdiff_t

Merged-by: Clément Bœsch <u@pkh.me>
2017-03-20 15:58:32 +01:00
Martin Storsjö eabc5abf94 arm: vp9itxfm16: Do a simpler half/quarter idct16/idct32 when possible
This work is sponsored by, and copyright, Google.

This avoids loading and calculating coefficients that we know will
be zero, and avoids filling the temp buffer with zeros in places
where we know the second pass won't read.

This gives a pretty substantial speedup for the smaller subpartitions.

The code size increases from 14516 bytes to 22484 bytes.

The idct16/32_end macros are moved above the individual functions; the
instructions themselves are unchanged, but since new functions are added
at the same place where the code is moved from, the diff looks rather
messy.

Before:                                 Cortex A7       A8       A9      A53
vp9_inv_dct_dct_16x16_sub1_add_10_neon:     454.0    270.7    418.5    295.4
vp9_inv_dct_dct_16x16_sub2_add_10_neon:    3840.2   3244.8   3700.1   2337.9
vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4212.5   3575.4   3996.9   2571.6
vp9_inv_dct_dct_16x16_sub8_add_10_neon:    5174.4   4270.5   4615.5   3031.9
vp9_inv_dct_dct_16x16_sub12_add_10_neon:   5676.0   4908.5   5226.5   3491.3
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6403.9   5589.0   5839.8   3948.5
vp9_inv_dct_dct_32x32_sub1_add_10_neon:    1710.7    944.7   1582.1   1045.4
vp9_inv_dct_dct_32x32_sub2_add_10_neon:   21040.7  16706.1  18687.7  13193.1
vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22197.7  18282.7  19577.5  13918.6
vp9_inv_dct_dct_32x32_sub8_add_10_neon:   24511.5  20911.5  21472.5  15367.5
vp9_inv_dct_dct_32x32_sub12_add_10_neon:  26939.5  24264.3  23239.1  16830.3
vp9_inv_dct_dct_32x32_sub16_add_10_neon:  29419.5  26845.1  25020.6  18259.9
vp9_inv_dct_dct_32x32_sub20_add_10_neon:  31146.4  29633.5  26803.3  19721.7
vp9_inv_dct_dct_32x32_sub24_add_10_neon:  33376.3  32507.8  28642.4  21174.2
vp9_inv_dct_dct_32x32_sub28_add_10_neon:  35629.4  35439.6  30416.5  22625.7
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37269.9  37914.9  32271.9  24078.9

After:
vp9_inv_dct_dct_16x16_sub1_add_10_neon:     454.0    276.0    418.5    295.1
vp9_inv_dct_dct_16x16_sub2_add_10_neon:    2336.2   1886.0   2251.0   1458.6
vp9_inv_dct_dct_16x16_sub4_add_10_neon:    2531.0   2054.7   2402.8   1591.1
vp9_inv_dct_dct_16x16_sub8_add_10_neon:    3848.6   3491.1   3845.7   2554.8
vp9_inv_dct_dct_16x16_sub12_add_10_neon:   5703.8   4831.6   5230.8   3493.4
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6399.5   5567.0   5832.4   3951.5
vp9_inv_dct_dct_32x32_sub1_add_10_neon:    1722.1    938.5   1577.3   1044.5
vp9_inv_dct_dct_32x32_sub2_add_10_neon:   15003.5  11576.8  13105.8   9602.2
vp9_inv_dct_dct_32x32_sub4_add_10_neon:   15768.5  12677.2  13726.0  10138.1
vp9_inv_dct_dct_32x32_sub8_add_10_neon:   17278.8  14825.4  14907.5  11185.7
vp9_inv_dct_dct_32x32_sub12_add_10_neon:  22335.7  21544.5  20379.5  15019.8
vp9_inv_dct_dct_32x32_sub16_add_10_neon:  24165.6  23881.7  21938.6  16308.2
vp9_inv_dct_dct_32x32_sub20_add_10_neon:  31082.2  30860.9  26835.3  19711.3
vp9_inv_dct_dct_32x32_sub24_add_10_neon:  33102.6  31922.8  28638.3  21161.0
vp9_inv_dct_dct_32x32_sub28_add_10_neon:  35104.9  34867.5  30411.7  22621.2
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37438.1  39103.4  32217.8  24067.6

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:54:33 +02:00
Martin Storsjö 0ea603203d arm: vp9itxfm16: Make the larger core transforms standalone functions
This work is sponsored by, and copyright, Google.

This reduces the code size of libavcodec/arm/vp9itxfm_16bpp_neon.o from
17500 to 14516 bytes.

This gives a small slowdown of a couple tens of cycles, up to around
150 cycles for the full case of the largest transform, but makes
it more feasible to add more optimized versions of these transforms.

Before:                                 Cortex A7       A8       A9      A53
vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4237.4   3561.5   3971.8   2525.3
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6371.9   5452.0   5779.3   3910.5
vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22068.8  17867.5  19555.2  13871.6
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37268.9  38684.2  32314.2  23969.0

After:
vp9_inv_dct_dct_16x16_sub4_add_10_neon:    4375.1   3571.9   4283.8   2567.2
vp9_inv_dct_dct_16x16_sub16_add_10_neon:   6415.6   5578.9   5844.6   3948.3
vp9_inv_dct_dct_32x32_sub4_add_10_neon:   22653.7  18079.7  19603.7  13905.3
vp9_inv_dct_dct_32x32_sub32_add_10_neon:  37593.2  38862.2  32235.8  24070.9

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:54:19 +02:00
Martin Storsjö 32e273c111 arm: vp9itxfm16: Avoid reloading the idct32 coefficients
Keep the idct32 coefficients in narrow form in q6-q7, and idct16
coefficients in lengthened 32 bit form in q0-q3. Avoid clobbering
q0-q3 in the pass1 function, and squeeze the idct16 coefficients
into q0-q1 in the pass2 function to avoid reloading them.

The idct16 coefficients are clobbered and reloaded within idct32_odd
though, since that turns out to be faster than narrowing them and
swapping them into q6-q7.

Before:                            Cortex       A7        A8        A9      A53
vp9_inv_dct_dct_32x32_sub4_add_10_neon:    22653.8   18268.4   19598.0  14079.0
vp9_inv_dct_dct_32x32_sub32_add_10_neon:   37699.0   38665.2   32542.3  24472.2
After:
vp9_inv_dct_dct_32x32_sub4_add_10_neon:    22270.8   18159.3   19531.0  13865.0
vp9_inv_dct_dct_32x32_sub32_add_10_neon:   37523.3   37731.6   32181.7  24071.2

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:53:57 +02:00
Martin Storsjö c1619318e5 arm: vp9itxfm16: Fix vertical alignment
Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:53:48 +02:00
Martin Storsjö b46d37e93a arm: vp9itxfm16: Use the right lane size
This makes the code slightly clearer, but doesn't make any functional
difference.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:53:43 +02:00
Martin Storsjö 21c89f3a26 arm/aarch64: vp9: Fix vertical alignment
Align the second/third operands as they usually are.

Due to the wildly varying sizes of the written out operands
in aarch64 assembly, the column alignment is usually not as clear
as in arm assembly.

This is cherrypicked from libav commit
7995ebfad1.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:53:32 +02:00
Martin Storsjö 70317b25aa arm/aarch64: vp9itxfm: Skip loading the min_eob pointer when it won't be used
In the half/quarter cases where we don't use the min_eob array, defer
loading the pointer until we know it will be needed.

This is cherrypicked from libav commit
3a0d5e206d.

Signed-off-by: Martin Storsjö <martin@martin.st>
2017-03-19 22:53:28 +02:00