mpv/DOCS/tech/dr-methods.txt

DIRECT RENDERING METHODS -- by A'rpi
======================== (based on a mail to -dev-eng)

Ok. It seems none of you really knows what direct rendering means...
I'll try to explain now! :)

At first, there are 2 different way, both called direct rendering.
The main point is the same, but they work different.

method 1: decoding directly to externally provided buffers.
so, the codec decodes macroblocks directly to the buffer provided by teh
caller. as this buffer will be readed later (for MC of next frame) it's not
a good idea to place such buffers in slow video ram. but.
there are many video out drivers using buffers in system ram, and using some
way of memcpy or DMA to blit it to vieo ram at display time.
for example, Xv and X11 (normal and Shm too) are such thingie.
XImage will be a buffer in system ram (!) and X*PutImage will copy it to
video ram. Only nvidia and ati rage128 Xv drivers use DMA, others just
memcpy it. Also some opengl drivers (including Matrox) uses DMA to copy from
subteximage to video ram.
The current mpalyer way mean: codec allocates some buffer, and decode image
to that buffer. then this buffer is copied to X11's buffer. then Xserver
copies this buffer to video ram. So one more memcpy than required...
direct rendering can remove this extar memcpy, and use the Xserver's memory
buffers for decoding buffer. Note again: it helps only if the external
buffer is in fast system ram.

method 2: decoding to internal buffers, but blit after each macroblocks,
including optional colorspace conversion.
advantages: it can blit into video ram, as it keeps the copy in its internal
buffers for next frame's MC. skipped macroblocks won't be copied again to
video ram (except if video buffer address changes between frames -> hw
double/triple buffering)
Just avoiding blitting of skipped MBs mean about 100% speedup (2 times
faster) for low bitrate (<700kbit) divxes. It even makes possible to watch
VCD resolution divx on p200mmx with DGA.
how does it work? the codec works as normally, decodes macroblocks into its
internal buffer. but after each decoded macroblock, it immediatelly copies
this macroblock to the video ram. it's in the L1 cache, so it will be fast.
skipped macroblocks can be skipped easily -> less vram write -> more speedup.
but, as it copies directly to video ram, it must do colorspace conversion if
needed (for example divx -> rgb DGA), and cannot be used with scaling.
another interesting question of such direct rendering is the planar formats.
Eugene K. of Divx4 told me that he experienced worse performance blittig
yv12 blocks (copied 3 blocks to 3 different (Y,U,V) buffers) than doing
(really unneeded) yv12->yuy2 conversion on-the-fly.
so, divx4 codec (with -vc divx4 api) converts from its internal yv12 buffer
to the external yuy2.

libmpeg2 already uses simplified variation of this: when it finish decoding a
slice (a horizontal line of MBs) it copies it to external (video ram) buffer
(using callback to libvo), so at least it copies from L2 cache instead of
slow ram. it gave me 23% -> 20% VOB decoding speedup on p3.

so, again: the main difference between method 1 and 2:
method1 stores decoded data only once: in the external read/write buffer.
method2 stores decoded data twice: in its internal read/write buffer (for
later reading) and in the write-only slow video ram.

both methods can make big speedup, depending on codec behaviour and libvo
driver. for example, IPB mpegs could combine these, use method 2 for I/P
frames and method 1 for B frams. mpeg2dec does already this.
for I-only type video (like mjpeg) method 1 is better. for I/P type video
with MC (like divx, h263 etc) method 2 is the best choice.
for I/P type videos without MC (like FLI, CVID) could use method 1 with
static buffer or method 2 with double/triple buffering.

i hope it is clear now.
and i hope even nick understand what are we talking about...

ah, and at the end, the abilities of codecs:
libmpeg2: can do method 1 and 2 (but slice level copy, not MB level)
vfw, dshow: can do method 2, with static or variable address external buffer
odivx, and most native codecs like fli, cvid, rle: can do method 1
divx4: can do method 2 (with old odivx api it does method 1)
libavcodec, xanim: they currently can't do DR, but they exports their
internal buffers. but it's very easy to implement menthod 1 support,
and a bit harder but possible without any rewrite to do method 2.

so, dshow and divx4 already implements all requirements of method 2.
libmpeg2 implements method 1, and it's easy to add to libavcodec.

anyway, in the ideal world, we need all codecs support both methods.
anyway 2: in ideal world, there are no libvo drivers having buffer in system
ram and memcpy to video ram...
anyway 3: in our really ideal world, all libvo driver has its buffers in
fast sytem ram and does blitting with DMA... :)

============================================================================
MPlayer NOW! -- The libmpcodecs way.

libmpcodecs replaced old draw callbacks with mpi (mplayer image) struct.
steps of decoding with libmpcodecs:
1. codec requests an mpi from libmpcodecs core (vd.c)
2. vd creates an mpi struct filled by codec's requirements (size, stride,
   colorspace, flags, type)
3. vd asks libvo (control(VOCTRL_GET_IMAGE)), if it can provide such buffer:
   - if it can -> do direct rendering
   - it it can not -> allocate system ram area with memalign()/malloc()
   Note: codec may request EXPORT buffer, it means buffer allocation is 
   done inside the codec, so we cannot do DR :(
4. codec decodes frame to the mpi struct (system ram or direct rendering)
5. if it isn't DR, we call libvo's draw functions to blit image to video ram

current possible buffer setups:
- EXPORT - codec handles buffer allocation and it exports its buffer pointers
  used for opendivx, xanim and libavcodec
- STATIC - codec requires a single static buffer with constant preserved content
  used by codecs which do partial updating of image, but doesn't require reading
  of previous frame. most rle-based codecs, like cvid, rle8, qtrle, qtsmc etc.
- TEMP - codec requires a buffer, but it doesn't depend on previous frame at all
  used for I-only codecs (like mjpeg) and for codecs supporting method-2 direct
  rendering with variable buffer address (vfw, dshow, divx4).
- IP - codec requires 2 (or more) read/write buffers. it's for codecs supporting
  method-1 direct rendering but using motion compensation (ie. reading from
  previous frame buffer). could be used for libavcodec (divx3/4,h263) later.
  IP buffer stays from 2 (or more) STATIC buffers.
- IPB - similar to IP, but also have one (or more) TEMP buffers for B frames.
  it will be used for libmpeg2 and libavcodec (mpeg1/2/4).
  IPB buffer stays from 2 (or more) STATIC buffers and 1 (or more) TEMP buffer.
dr + libmpcodecs git-svn-id: svn://svn.mplayerhq.hu/mplayer/trunk@4995 b3059339-0415-0410-9bf9-f77b7e298cf2 2002-03-08 20:17:33 +00:00			`DIRECT RENDERING METHODS -- by A'rpi`
			`======================== (based on a mail to -dev-eng)`

			`Ok. It seems none of you really knows what direct rendering means...`
			`I'll try to explain now! :)`

			`At first, there are 2 different way, both called direct rendering.`
			`The main point is the same, but they work different.`

			`method 1: decoding directly to externally provided buffers.`
			`so, the codec decodes macroblocks directly to the buffer provided by teh`
			`caller. as this buffer will be readed later (for MC of next frame) it's not`
			`a good idea to place such buffers in slow video ram. but.`
			`there are many video out drivers using buffers in system ram, and using some`
			`way of memcpy or DMA to blit it to vieo ram at display time.`
			`for example, Xv and X11 (normal and Shm too) are such thingie.`
			`XImage will be a buffer in system ram (!) and X*PutImage will copy it to`
			`video ram. Only nvidia and ati rage128 Xv drivers use DMA, others just`
			`memcpy it. Also some opengl drivers (including Matrox) uses DMA to copy from`
			`subteximage to video ram.`
			`The current mpalyer way mean: codec allocates some buffer, and decode image`
			`to that buffer. then this buffer is copied to X11's buffer. then Xserver`
			`copies this buffer to video ram. So one more memcpy than required...`
			`direct rendering can remove this extar memcpy, and use the Xserver's memory`
			`buffers for decoding buffer. Note again: it helps only if the external`
			`buffer is in fast system ram.`

			`method 2: decoding to internal buffers, but blit after each macroblocks,`
			`including optional colorspace conversion.`
			`advantages: it can blit into video ram, as it keeps the copy in its internal`
			`buffers for next frame's MC. skipped macroblocks won't be copied again to`
			`video ram (except if video buffer address changes between frames -> hw`
			`double/triple buffering)`
			`Just avoiding blitting of skipped MBs mean about 100% speedup (2 times`
			`faster) for low bitrate (<700kbit) divxes. It even makes possible to watch`
			`VCD resolution divx on p200mmx with DGA.`
			`how does it work? the codec works as normally, decodes macroblocks into its`
			`internal buffer. but after each decoded macroblock, it immediatelly copies`
			`this macroblock to the video ram. it's in the L1 cache, so it will be fast.`
			`skipped macroblocks can be skipped easily -> less vram write -> more speedup.`
			`but, as it copies directly to video ram, it must do colorspace conversion if`
			`needed (for example divx -> rgb DGA), and cannot be used with scaling.`
			`another interesting question of such direct rendering is the planar formats.`
			`Eugene K. of Divx4 told me that he experienced worse performance blittig`
			`yv12 blocks (copied 3 blocks to 3 different (Y,U,V) buffers) than doing`
			`(really unneeded) yv12->yuy2 conversion on-the-fly.`
			`so, divx4 codec (with -vc divx4 api) converts from its internal yv12 buffer`
			`to the external yuy2.`

			`libmpeg2 already uses simplified variation of this: when it finish decoding a`
			`slice (a horizontal line of MBs) it copies it to external (video ram) buffer`
			`(using callback to libvo), so at least it copies from L2 cache instead of`
			`slow ram. it gave me 23% -> 20% VOB decoding speedup on p3.`

			`so, again: the main difference between method 1 and 2:`
			`method1 stores decoded data only once: in the external read/write buffer.`
			`method2 stores decoded data twice: in its internal read/write buffer (for`
			`later reading) and in the write-only slow video ram.`

			`both methods can make big speedup, depending on codec behaviour and libvo`
			`driver. for example, IPB mpegs could combine these, use method 2 for I/P`
			`frames and method 1 for B frams. mpeg2dec does already this.`
			`for I-only type video (like mjpeg) method 1 is better. for I/P type video`
			`with MC (like divx, h263 etc) method 2 is the best choice.`
			`for I/P type videos without MC (like FLI, CVID) could use method 1 with`
			`static buffer or method 2 with double/triple buffering.`

			`i hope it is clear now.`
			`and i hope even nick understand what are we talking about...`

			`ah, and at the end, the abilities of codecs:`
			`libmpeg2: can do method 1 and 2 (but slice level copy, not MB level)`
			`vfw, dshow: can do method 2, with static or variable address external buffer`
			`odivx, and most native codecs like fli, cvid, rle: can do method 1`
			`divx4: can do method 2 (with old odivx api it does method 1)`
			`libavcodec, xanim: they currently can't do DR, but they exports their`
			`internal buffers. but it's very easy to implement menthod 1 support,`
			`and a bit harder but possible without any rewrite to do method 2.`

			`so, dshow and divx4 already implements all requirements of method 2.`
			`libmpeg2 implements method 1, and it's easy to add to libavcodec.`

			`anyway, in the ideal world, we need all codecs support both methods.`
			`anyway 2: in ideal world, there are no libvo drivers having buffer in system`
			`ram and memcpy to video ram...`
			`anyway 3: in our really ideal world, all libvo driver has its buffers in`
			`fast sytem ram and does blitting with DMA... :)`

			`============================================================================`
			`MPlayer NOW! -- The libmpcodecs way.`

			`libmpcodecs replaced old draw callbacks with mpi (mplayer image) struct.`
			`steps of decoding with libmpcodecs:`
			`1. codec requests an mpi from libmpcodecs core (vd.c)`
			`2. vd creates an mpi struct filled by codec's requirements (size, stride,`
			`colorspace, flags, type)`
			`3. vd asks libvo (control(VOCTRL_GET_IMAGE)), if it can provide such buffer:`
			`- if it can -> do direct rendering`
			`- it it can not -> allocate system ram area with memalign()/malloc()`
			`Note: codec may request EXPORT buffer, it means buffer allocation is`
			`done inside the codec, so we cannot do DR :(`
			`4. codec decodes frame to the mpi struct (system ram or direct rendering)`
			`5. if it isn't DR, we call libvo's draw functions to blit image to video ram`

			`current possible buffer setups:`
			`- EXPORT - codec handles buffer allocation and it exports its buffer pointers`
			`used for opendivx, xanim and libavcodec`
			`- STATIC - codec requires a single static buffer with constant preserved content`
			`used by codecs which do partial updating of image, but doesn't require reading`
			`of previous frame. most rle-based codecs, like cvid, rle8, qtrle, qtsmc etc.`
			`- TEMP - codec requires a buffer, but it doesn't depend on previous frame at all`
			`used for I-only codecs (like mjpeg) and for codecs supporting method-2 direct`
			`rendering with variable buffer address (vfw, dshow, divx4).`
			`- IP - codec requires 2 (or more) read/write buffers. it's for codecs supporting`
			`method-1 direct rendering but using motion compensation (ie. reading from`
			`previous frame buffer). could be used for libavcodec (divx3/4,h263) later.`
			`IP buffer stays from 2 (or more) STATIC buffers.`
			`- IPB - similar to IP, but also have one (or more) TEMP buffers for B frames.`
			`it will be used for libmpeg2 and libavcodec (mpeg1/2/4).`
			`IPB buffer stays from 2 (or more) STATIC buffers and 1 (or more) TEMP buffer.`