mirror of https://github.com/mpv-player/mpv
130 lines
7.1 KiB
Plaintext
130 lines
7.1 KiB
Plaintext
DIRECT RENDERING METHODS -- by A'rpi
|
|
======================== (based on a mail to -dev-eng)
|
|
|
|
Ok. It seems none of you really knows what direct rendering means...
|
|
I'll try to explain now! :)
|
|
|
|
At first, there are 2 different way, both called direct rendering.
|
|
The main point is the same, but they work different.
|
|
|
|
method 1: decoding directly to externally provided buffers.
|
|
so, the codec decodes macroblocks directly to the buffer provided by the
|
|
caller. as this buffer will be read later (for MC of next frame) it's not
|
|
a good idea to place such buffers in slow video ram. but.
|
|
there are many video out drivers using buffers in system ram, and using some
|
|
way of memcpy or DMA to blit it to video ram at display time.
|
|
for example, Xv and X11 (normal and Shm too) are such thingie.
|
|
XImage will be a buffer in system ram (!) and X*PutImage will copy it to
|
|
video ram. Only nvidia and ati rage128 Xv drivers use DMA, others just
|
|
memcpy it. Also some opengl drivers (including Matrox) uses DMA to copy from
|
|
texsubimage to video ram.
|
|
The current mplayer way mean: codec allocates some buffer, and decode image
|
|
to that buffer. then this buffer is copied to X11's buffer. then Xserver
|
|
copies this buffer to video ram. So one more memcpy than required...
|
|
direct rendering can remove this extra memcpy, and use Xserver's memory
|
|
buffers for decoding buffer. Note again: it helps only if the external
|
|
buffer is in fast system ram.
|
|
|
|
method 2: decoding to internal buffers, but blit after each macroblocks,
|
|
including optional colorspace conversion.
|
|
advantages: it can blit into video ram, as it keeps the copy in its internal
|
|
buffers for next frame's MC. skipped macroblocks won't be copied again to
|
|
video ram (except if video buffer address changes between frames -> hw
|
|
double/triple buffering)
|
|
Just avoiding blitting of skipped MBs mean about 100% speedup (2 times
|
|
faster) for low bitrate (<700kbit) divxes. It even makes possible to watch
|
|
VCD resolution divx on p200mmx with DGA.
|
|
how does it work? the codec works as normally, decodes macroblocks into its
|
|
internal buffer. but after each decoded macroblock, it immediatelly copies
|
|
this macroblock to the video ram. it's in the L1 cache, so it will be fast.
|
|
skipped macroblocks can be skipped easily -> less vram write -> more speedup.
|
|
but, as it copies directly to video ram, it must do colorspace conversion if
|
|
needed (for example divx -> rgb DGA), and cannot be used with scaling.
|
|
another interesting question of such direct rendering is the planar formats.
|
|
Eugene K. of Divx4 told me that he experienced worse performance blittig
|
|
yv12 blocks (copied 3 blocks to 3 different (Y,U,V) buffers) than doing
|
|
(really unneeded) yv12->yuy2 conversion on-the-fly.
|
|
so, divx4 codec (with -vc divx4 api) converts from its internal yv12 buffer
|
|
to the external yuy2.
|
|
|
|
method 2a:
|
|
libmpeg2 already uses simplified variation of this: when it finish decoding a
|
|
slice (a horizontal line of MBs) it copies it to external (video ram) buffer
|
|
(using callback to libvo), so at least it copies from L2 cache instead of
|
|
slow ram. for non-predictive (B) frames it can re-use this cached memory
|
|
for the next slice - so it uses less memory and has better cache utilization:
|
|
it gave me 23% -> 20% VOB decoding speedup on p3. libavcodec supports
|
|
per-slice callbacks too, but no slice-memory reusing for B frames yet.
|
|
|
|
method 2b:
|
|
some codecs (indeo vfw 3/4 using IF09, and libavcodec) can export the 'bitmap'
|
|
of skipped macroblocks - so libvo driver can do selective blitting: copy only
|
|
the changed macroblocks to slow vram.
|
|
|
|
so, again: the main difference between method 1 and 2:
|
|
method1 stores decoded data only once: in the external read/write buffer.
|
|
method2 stores decoded data twice: in its internal read/write buffer (for
|
|
later reading) and in the write-only slow video ram.
|
|
|
|
both methods can make big speedup, depending on codec behaviour and libvo
|
|
driver. for example, IPB mpegs could combine these, use method 2 for I/P
|
|
frames and method 1 for B frams. mpeg2dec does already this.
|
|
for I-only type video (like mjpeg) method 1 is better. for I/P type video
|
|
with MC (like divx, h263 etc) method 2 is the best choice.
|
|
for I/P type videos without MC (like FLI, CVID) could use method 1 with
|
|
static buffer or method 2 with double/triple buffering.
|
|
|
|
i hope it is clear now.
|
|
and i hope even nick understand what are we talking about...
|
|
|
|
ah, and at the end, the abilities of codecs:
|
|
libmpeg2,libavcodec: can do method 1 and 2 (but slice level copy, not MB level)
|
|
vfw, dshow: can do method 2, with static or variable address external buffer
|
|
odivx, and most native codecs like fli, cvid, rle: can do method 1
|
|
divx4: can do method 2 (with old odivx api it does method 1)
|
|
xanim: they currently can't do DR, but they exports their
|
|
internal buffers. but it's very easy to implement menthod 1 support,
|
|
and a bit harder but possible without any rewrite to do method 2.
|
|
|
|
so, dshow and divx4 already implements all requirements of method 2.
|
|
libmpeg2 and libavcodec implements method 1 and 2a (lavc 2b too)
|
|
|
|
anyway, in the ideal world, we need all codecs support both methods.
|
|
anyway 2: in ideal world, there are no libvo drivers having buffer in system
|
|
ram and memcpy to video ram...
|
|
anyway 3: in our really ideal world, all libvo driver has its buffers in
|
|
fast sytem ram and does blitting with DMA... :)
|
|
|
|
============================================================================
|
|
MPlayer NOW! -- The libmpcodecs way.
|
|
|
|
libmpcodecs replaced old draw callbacks with mpi (mplayer image) struct.
|
|
steps of decoding with libmpcodecs:
|
|
1. codec requests an mpi from libmpcodecs core (vd.c)
|
|
2. vd creates an mpi struct filled by codec's requirements (size, stride,
|
|
colorspace, flags, type)
|
|
3. vd asks libvo (control(VOCTRL_GET_IMAGE)), if it can provide such buffer:
|
|
- if it can -> do direct rendering
|
|
- it it can not -> allocate system ram area with memalign()/malloc()
|
|
Note: codec may request EXPORT buffer, it means buffer allocation is
|
|
done inside the codec, so we cannot do DR :(
|
|
4. codec decodes one frame to the mpi struct (system ram or direct rendering)
|
|
5. if it isn't DR, we call libvo's draw functions to blit image to video ram
|
|
|
|
current possible buffer setups:
|
|
- EXPORT - codec handles buffer allocation and it exports its buffer pointers
|
|
used for opendivx, xanim and libavcodec
|
|
- STATIC - codec requires a single static buffer with constant preserved content
|
|
used by codecs which do partial updating of image, but doesn't require reading
|
|
of previous frame. most rle-based codecs, like cvid, rle8, qtrle, qtsmc etc.
|
|
- TEMP - codec requires a buffer, but it doesn't depend on previous frame at all
|
|
used for I-only codecs (like mjpeg) and for codecs supporting method-2 direct
|
|
rendering with variable buffer address (vfw, dshow, divx4).
|
|
- IP - codec requires 2 (or more) read/write buffers. it's for codecs supporting
|
|
method-1 direct rendering but using motion compensation (ie. reading from
|
|
previous frame buffer). could be used for libavcodec (divx3/4,h263).
|
|
IP buffer stays from 2 (or more) STATIC buffers.
|
|
- IPB - similar to IP, but also have one (or more) TEMP buffers for B frames.
|
|
it will be used for libmpeg2 and libavcodec (mpeg1/2/4).
|
|
IPB buffer stays from 2 (or more) STATIC buffers and 1 (or more) TEMP buffer.
|