Commit Graph

10 Commits

Author SHA1 Message Date
Diego Biurrun e2b5b09789 x86: rv40dsp: Use PAVGB instruction macro where appropriate 2013-11-04 21:14:39 +01:00
Diego Biurrun 26301caaa1 x86: mmx2 ---> mmxext in asm constructs 2012-11-14 00:58:51 +01:00
Diego Biurrun 04581c8c77 x86: yasm: Use complete source path for macro helper %includes
This is more consistent with the way we handle C #includes and
it simplifies the build system.
2012-10-31 00:37:42 +01:00
Diego Biurrun 6860b4081d x86: include x86inc.asm in x86util.asm
This is necessary to allow refactoring some x86util macros with cpuflags.
2012-10-31 00:37:42 +01:00
Mans Rullgard 2b140a3d09 x86: use 32-bit source registers with movd instruction
yasm tolerates mismatch between movd/movq and source register size,
adjusting the instruction according to the register.  nasm is more
strict.

Signed-off-by: Mans Rullgard <mans@mansr.com>
2012-08-07 15:21:20 +01:00
Michael Kostylev 6797d1948b x86: rv40: Mark rv40_weight functions as MMX2; they use MMX2 instructions. 2012-05-15 23:54:08 +02:00
Christophe Gisquet 110d0cdc9d rv40dsp x86: MMX/MMX2/3DNow/SSE2/SSSE3 implementations of MC
Code mostly inspired by vp8's MC, however:
- its MMX2 horizontal filter is worse because it can't take advantage of
  the coefficient redundancy
- that same coefficient redundancy allows better code for non-SSSE3 versions

Benchmark (rounded to tens of unit):
        V8x8  H8x8  2D8x8  V16x16  H16x16  2D16x16
C       445    358   985    1785    1559    3280
MMX*    219    271   478     714     929    1443
SSE2    131    158   294     425     515     892
SSSE3   120    122   248     387     390     763

End result is overall around a 15% speedup for SSSE3 version (on 6 sequences);
all loop filter functions now take around 55% of decoding time, while luma MC
dsp functions are around 6%, chroma ones are 1.3% and biweight around 2.3%.

Signed-off-by: Diego Biurrun <diego@biurrun.de>
2012-05-10 18:42:43 +02:00
Christophe GISQUET 2130bd8f5b rv40dsp x86: use only one register, for both increment and loop counter
Around 10 cycles faster for luma.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-10 10:07:09 -07:00
Christophe GISQUET 272b252c01 rv40dsp: implement prescaled versions for biweight.
Quite often, the original weights are multiple of 512. By prescaling them
by 1/512 when they are computed (once per frame), no intermediate shifting
is needed, and no prescaling on each call either.

The x86 code already used that trick.

Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
2012-04-10 10:06:48 -07:00
Christophe Gisquet e5c9de2ab7 rv40: x86 SIMD for biweight
Provide MMX, SSE2 and SSSE3 versions, with a fast-path when the weights are
multiples of 512 (which is often the case when the values round up nicely).

*_TIMER report for the 16x16 and 8x8 cases:
C:
9015 decicycles in 16, 524257 runs, 31 skips
2656 decicycles in 8, 524271 runs, 17 skips
MMX:
4156 decicycles in 16, 262090 runs, 54 skips
1206 decicycles in 8, 262131 runs, 13 skips
MMX on fast-path:
2760 decicycles in 16, 524222 runs, 66 skips
995 decicycles in 8, 524252 runs, 36 skips
SSE2:
2163 decicycles in 16, 262131 runs, 13 skips
832 decicycles in 8, 262137 runs, 7 skips
SSE2 with fast path:
1783 decicycles in 16, 524276 runs, 12 skips
711 decicycles in 8, 524283 runs, 5 skips
SSSE3:
2117 decicycles in 16, 262136 runs, 8 skips
814 decicycles in 8, 262143 runs, 1 skips
SSSE3 with fast path:
1315 decicycles in 16, 524285 runs, 3 skips
578 decicycles in 8, 524286 runs, 2 skips

This means around a 4% speedup for some sequences.

Signed-off-by: Diego Biurrun <diego@biurrun.de>
2012-01-30 23:58:25 +01:00