diff --git a/DOCS/tech/dr-methods.txt b/DOCS/tech/dr-methods.txt new file mode 100644 index 0000000000..c0e2d8126f --- /dev/null +++ b/DOCS/tech/dr-methods.txt @@ -0,0 +1,120 @@ +DIRECT RENDERING METHODS -- by A'rpi +======================== (based on a mail to -dev-eng) + +Ok. It seems none of you really knows what direct rendering means... +I'll try to explain now! :) + +At first, there are 2 different way, both called direct rendering. +The main point is the same, but they work different. + +method 1: decoding directly to externally provided buffers. +so, the codec decodes macroblocks directly to the buffer provided by teh +caller. as this buffer will be readed later (for MC of next frame) it's not +a good idea to place such buffers in slow video ram. but. +there are many video out drivers using buffers in system ram, and using some +way of memcpy or DMA to blit it to vieo ram at display time. +for example, Xv and X11 (normal and Shm too) are such thingie. +XImage will be a buffer in system ram (!) and X*PutImage will copy it to +video ram. Only nvidia and ati rage128 Xv drivers use DMA, others just +memcpy it. Also some opengl drivers (including Matrox) uses DMA to copy from +subteximage to video ram. +The current mpalyer way mean: codec allocates some buffer, and decode image +to that buffer. then this buffer is copied to X11's buffer. then Xserver +copies this buffer to video ram. So one more memcpy than required... +direct rendering can remove this extar memcpy, and use the Xserver's memory +buffers for decoding buffer. Note again: it helps only if the external +buffer is in fast system ram. + +method 2: decoding to internal buffers, but blit after each macroblocks, +including optional colorspace conversion. +advantages: it can blit into video ram, as it keeps the copy in its internal +buffers for next frame's MC. skipped macroblocks won't be copied again to +video ram (except if video buffer address changes between frames -> hw +double/triple buffering) +Just avoiding blitting of skipped MBs mean about 100% speedup (2 times +faster) for low bitrate (<700kbit) divxes. It even makes possible to watch +VCD resolution divx on p200mmx with DGA. +how does it work? the codec works as normally, decodes macroblocks into its +internal buffer. but after each decoded macroblock, it immediatelly copies +this macroblock to the video ram. it's in the L1 cache, so it will be fast. +skipped macroblocks can be skipped easily -> less vram write -> more speedup. +but, as it copies directly to video ram, it must do colorspace conversion if +needed (for example divx -> rgb DGA), and cannot be used with scaling. +another interesting question of such direct rendering is the planar formats. +Eugene K. of Divx4 told me that he experienced worse performance blittig +yv12 blocks (copied 3 blocks to 3 different (Y,U,V) buffers) than doing +(really unneeded) yv12->yuy2 conversion on-the-fly. +so, divx4 codec (with -vc divx4 api) converts from its internal yv12 buffer +to the external yuy2. + +libmpeg2 already uses simplified variation of this: when it finish decoding a +slice (a horizontal line of MBs) it copies it to external (video ram) buffer +(using callback to libvo), so at least it copies from L2 cache instead of +slow ram. it gave me 23% -> 20% VOB decoding speedup on p3. + +so, again: the main difference between method 1 and 2: +method1 stores decoded data only once: in the external read/write buffer. +method2 stores decoded data twice: in its internal read/write buffer (for +later reading) and in the write-only slow video ram. + +both methods can make big speedup, depending on codec behaviour and libvo +driver. for example, IPB mpegs could combine these, use method 2 for I/P +frames and method 1 for B frams. mpeg2dec does already this. +for I-only type video (like mjpeg) method 1 is better. for I/P type video +with MC (like divx, h263 etc) method 2 is the best choice. +for I/P type videos without MC (like FLI, CVID) could use method 1 with +static buffer or method 2 with double/triple buffering. + +i hope it is clear now. +and i hope even nick understand what are we talking about... + +ah, and at the end, the abilities of codecs: +libmpeg2: can do method 1 and 2 (but slice level copy, not MB level) +vfw, dshow: can do method 2, with static or variable address external buffer +odivx, and most native codecs like fli, cvid, rle: can do method 1 +divx4: can do method 2 (with old odivx api it does method 1) +libavcodec, xanim: they currently can't do DR, but they exports their +internal buffers. but it's very easy to implement menthod 1 support, +and a bit harder but possible without any rewrite to do method 2. + +so, dshow and divx4 already implements all requirements of method 2. +libmpeg2 implements method 1, and it's easy to add to libavcodec. + +anyway, in the ideal world, we need all codecs support both methods. +anyway 2: in ideal world, there are no libvo drivers having buffer in system +ram and memcpy to video ram... +anyway 3: in our really ideal world, all libvo driver has its buffers in +fast sytem ram and does blitting with DMA... :) + +============================================================================ +MPlayer NOW! -- The libmpcodecs way. + +libmpcodecs replaced old draw callbacks with mpi (mplayer image) struct. +steps of decoding with libmpcodecs: +1. codec requests an mpi from libmpcodecs core (vd.c) +2. vd creates an mpi struct filled by codec's requirements (size, stride, + colorspace, flags, type) +3. vd asks libvo (control(VOCTRL_GET_IMAGE)), if it can provide such buffer: + - if it can -> do direct rendering + - it it can not -> allocate system ram area with memalign()/malloc() + Note: codec may request EXPORT buffer, it means buffer allocation is + done inside the codec, so we cannot do DR :( +4. codec decodes frame to the mpi struct (system ram or direct rendering) +5. if it isn't DR, we call libvo's draw functions to blit image to video ram + +current possible buffer setups: +- EXPORT - codec handles buffer allocation and it exports its buffer pointers + used for opendivx, xanim and libavcodec +- STATIC - codec requires a single static buffer with constant preserved content + used by codecs which do partial updating of image, but doesn't require reading + of previous frame. most rle-based codecs, like cvid, rle8, qtrle, qtsmc etc. +- TEMP - codec requires a buffer, but it doesn't depend on previous frame at all + used for I-only codecs (like mjpeg) and for codecs supporting method-2 direct + rendering with variable buffer address (vfw, dshow, divx4). +- IP - codec requires 2 (or more) read/write buffers. it's for codecs supporting + method-1 direct rendering but using motion compensation (ie. reading from + previous frame buffer). could be used for libavcodec (divx3/4,h263) later. + IP buffer stays from 2 (or more) STATIC buffers. +- IPB - similar to IP, but also have one (or more) TEMP buffers for B frames. + it will be used for libmpeg2 and libavcodec (mpeg1/2/4). + IPB buffer stays from 2 (or more) STATIC buffers and 1 (or more) TEMP buffer.