Hardware decoding things often need access to additional handles from
the windowing system, such as the X11 or Wayland display when using
vaapi. The opengl-cb had nothing dedicated for this, and used the weird
GL_MP_MPGetNativeDisplay GL extension (which was mpv specific and not
officially registered with OpenGL).
This was awkward, and a pain due to having to emulate GL context
behavior (like needing a TLS variable to store context for the pseudo GL
extension function). In addition (and not inherently due to this), we
could pass only one resource from mpv builtin context backends to
hwdecs. It was also all GL specific.
Replace this with a newer mechanism. It works for all RA backends, not
just GL. the API user can explicitly pass the objects at init time via
mpv_render_context_create(). Multiple resources are naturally possible.
The API uses MPV_RENDER_PARAM_* defines, but internally we use strings.
This is done for 2 reasons: 1. trying to leave libmpv and internal
mechanisms decoupled, 2. not having to add public API for some of the
internal resource types (especially D3D/GL interop stuff).
To remain sane, drop support for obscure half-working opengl-cb things,
like the DRM interop (was missing necessary things), the RPI window
thing (nobody used it), and obscure D3D interop things (not needed with
ANGLE, others were undocumented). In order not to break ABI and the C
API, we don't remove the associated structs from opengl_cb.h.
The parts which are still needed (in particular DRM interop) needs to be
ported to the render API.
This is done in several steps:
1. refactor MPGLContext -> struct ra_ctx
2. move GL-specific stuff in vo_opengl into opengl/context.c
3. generalize context creation to support other APIs, and add --gpu-api
4. rename all of the --opengl- options that are no longer opengl-specific
5. move all of the stuff from opengl/* that isn't GL-specific into gpu/
(note: opengl/gl_utils.h became opengl/utils.h)
6. rename vo_opengl to vo_gpu
7. to handle window screenshots, the short-term approach was to just add
it to ra_swchain_fns. Long term (and for vulkan) this has to be moved to
ra itself (and vo_gpu altered to compensate), but this was a stop-gap
measure to prevent this commit from getting too big
8. move ra->fns->flush to ra_gl_ctx instead
9. some other minor changes that I've probably already forgotten
Note: This is one half of a major refactor, the other half of which is
provided by rossy's following commit. This commit enables support for
all linux platforms, while his version enables support for all non-linux
platforms.
Note 2: vo_opengl_cb.c also re-uses ra_gl_ctx so it benefits from the
--opengl- options like --opengl-early-flush, --opengl-finish etc. Should
be a strict superset of the old functionality.
Disclaimer: Since I have no way of compiling mpv on all platforms, some
of these ports were done blindly. Specifically, the blind ports included
context_mali_fbdev.c and context_rpi.c. Since they're both based on
egl_helpers, the port should have gone smoothly without any major
changes required. But if somebody complains about a compile error on
those platforms (assuming anybody actually uses them), you know where to
complain.
- tex_uploads args are moved to a struct
- the ability to directly upload texture data without going through a
buffer is made explicit
- the concept of buffer updates and buffer polling is made more explicit
and generalized to buf_update as well (not just mapped buffers)
- the ability to call tex_upload/buf_update on a tex/buf is made
explicit during tex/buf creation
- uploading from buffers now uses an explicit offset instead of
implicitly comparing *src against buf->data, because not all buffers
may actually be persistently mapped
- the initial_data = immutable requirement is dropped. (May be re-added
later for D3D11 if that ever becomes a thing)
This change helps the vulkan abstraction immensely and also helps move
common code (like the PBO pooling) out of ra_gl and into the
opengl/utils.c
This also technically has the side-benefit / side-constraint of using
PBOs for OSD texture uploads as well, which actually seems to help
performance on machines where --opengl-pbo is faster than the naive code
path. Because of this, I decided to hook up the OSD code to the
opengl-pbo option as well.
One drawback of this refactor is that the GL_STREAM_COPY hack for
texture uploads "got lost", but I think I'm happy with that going away
anyway since DR almost fully deprecates it, and it's not the "right
thing" anyway - but instead an nvidia-only hack to make this stuff work
somewhat better on NUMA systems with discrete GPUs.
Another change is that due to the way fencing works with ra_buf (we get
one fence per ra_buf per upload) we have to use multiple ra_bufs instead
of offsets into a shared buffer. But for OpenGL this is probably better
anyway. It's possible that in future, we could support having
independent “buffer slices” (each with their own fence/sync object), but
this would be an optimization more than anything. I also think that we
could address the underlying problem (memory closeness) differently by
making the ra_vk memory allocator smart enough to chunk together
allocations under the hood.
Move multiple GL-specific things from the renderer to other places like
vo_opengl.c, vo_opengl_cb.c, and ra_gl.c.
The vp_w/vp_h parameters to gl_video_resize() make no sense anymore, and
are implicitly part of struct fbodst.
Checking the main framebuffer depth is moved to vo_opengl.c. For
vo_opengl_cb.c it always assumes 8. The API user now has to override
this manually. The previous heuristic didn't make much sense anyway.
The only remaining dependency on GL is the hwdec stuff, which is harder
to change.
The radius check was not strict enough, especially not for all
platforms. To fix this, actually check the hardware capabilities instead
of relying on a hard-coded maximum radius.
Mesa 17.1 supports compute shader but not full specs of OpenGL 4.3.
Change the code to detect OpenGL extension "GL_ARB_compute_shader"
rather than OpenGL version 4.3.
HDR peak detection requires SSBO, and polar scaler requires 2D array
extension. Add these extensions as requirement as well.
This is done via compute shaders. As a consequence, the tone mapping
algorithms had to be rewritten to compute their known constants in GLSL
(ahead of time), instead of doing it once. Didn't affect performance.
Using shmem/SSBO atomics in this way is extremely fast on nvidia, but it
might be slow on other platforms. Needs testing.
Unfortunately, setting up the SSBO still requires OpenGL calls, which
means I can't have it in video_shaders.c, where it belongs. But I'll
defer worrying about that until the backend refactor, since then I'll be
breaking up the video/video_shaders structure anyway.
These can either be invoked as dispatch_compute to do a single
computation, or finish_pass_fbo (after setting compute_size_minimum) to
render to a new texture using a compute shader. To make this stuff all
work transparently, we try really, really hard to make compute shaders
as identical to fragment shaders as possible in their behavior.
Can be enabled via --vd-lavc-dr=yes. See manpage additions for what it
does.
This reminds of the MPlayer -dr flag, but the implementation is
completely different. It's the same basic concept: letting the decoder
render into a GPU buffer to avoid a copy. Unlike MPlayer, this doesn't
try to go through filters (libavfilter doesn't support this anyway).
Unless a filter can work in-place, DR will be silently disabled. MPlayer
had very complex semantics about buffer types and management (which
apparently nobody ever understood) and weird restrictions that mostly
limited it to mpeg2 style codecs. The mpv code does not do any of this,
and just lets the decoder allocate an arbitrary number of untyped
images. (No MPlayer code was used.)
Parts of the code based on work by atomnuker (starting point for the
generic code) and haasn (some GL definitions, some basic PBO code, and
correct fencing).
Performance seems pretty much unchanged but I no longer get nasty spikes
on NUMA systems, probably because glBufferSubData runs in the driver or
something.
As a simplification of the code, we also just size the PBO to always
have the full size, even for cropped textures. This seems slower but not
by relevant amounts, and only affects e.g. --vf=crop. It also slightly
increases VRAM usage for textures with big strides.
This new code path is especially nice because it no longer depends on
GL_ARB_map_buffer_range, and no longer uses any functions that can
possibly fail, thus simplifying control flow and seemingly deprecating
the manpage's claim about possible image corruption.
In theory we could also reduce NUM_PBO_BUFFERS since it doesn't seem
like we're streaming uploads anyway, but leave it in there just in
case some drivers disagree...
TLS is a headache. We should avoid it if we can.
The involved mechanism is unfortunately entangled with the unfortunate
libmpv API for returning pointers to host API objects. This has to be
kept until we change the API somehow.
Practically untested out of pure laziness. I'm sure I'll get a bunch of
reports if it's broken.
Mostly because of ANGLE (sadly).
The implementation became unpleasantly big, but at least it's relatively
self-contained.
I'm not sure to what degree shaders from different drivers are
compatible as in whether a driver would randomly misbehave if it's fed
a binary created by another driver. The useless binayFormat parameter
won't help it, as they can probably easily clash. As usual, OpenGL is
pretty shit here.
gl_headers.h is basically header_fixes.h done consequently. It contains
all OpenGL defines (and some typedefs) we need. We don't include GL
headers provided by the system anymore.
Some care has to be taken by certain windowing APIs including all of
gl.h anyway. Then the definitions could clash. Fortunately, redefining
preprocessor symbols to the same content is allowed and ignored. Also,
redefining typedefs to the same thing is allowed in C11. Apparently the
latter is not allowed in C99, so there is an imperfect attempt to avoid
the typedefs if required API symbols are apparently present already.
The nost risky part about this are the standard typedefs and GLAPIENTRY.
The latter is different only on win32 (and at least consistently so).
The typedefs are mostly based on stdint.h typedefs, which khrplatform.h
clumsily emulates on platforms which don't have it. The biggest
difference is that we define GLsizeiptr directly to ptrdiff_t, instead
of checking for the _WIN64 symbol and defining it to long or long long.
This also typedefs GLsync to __GLsync, just like the khronos headers.
Although symbols prefixed with __ are implementation reserved, khronos
also violates this rule, and having the same definition as khronos will
avoid problems on duplicate definitions.
We can simplify the build scripts too. The ios-gl check seems a bit
wrong now (what we really want to test for is EAGLContext), but I can't
test and thus can't improve it.
cuda_dynamic.h redefined two GL symbols; just include the new headers
directly instead.
With the recent GLES3 header detection, and if ANGLE is in the search
path, the ANGLE headers will be used over the desktop GL ones. It
appears the ANGLE headers do not include <windows.h>, which leads to the
dxinterop code to fail building. Oops.
Fix this by including <windows.h> is dxinterop is compiled in.
In some cases, such as when using the libmpv opengl-cb API, or with
certain vo_opengl backends, the main framebuffer is never accessed.
Instead, rendering is done to a FBO that acts as back buffer. This meant
an incorrect/broken bit depth could be used for dithering.
Change it to read the framebuffer depth lazily on the first render call.
Also move the main FBO field out of the GL struct to MPGLContext,
because the renderer's init function does not need to access it anymore.
This overlay support specifically skips the OpenGL rendering chain, and
uses GL rendering only for OSD/subtitles. This is for devices which
don't have performant GL support.
hwdec_rpi.c contains code ported from vo_rpi.c. vo_rpi.c is going to be
deprecated. I left in the code for uploading sw surfaces (as it might
be slightly more efficient for rendering sw decoded video), although
it's dead code for now.
Until now, we've always converted vdpau video surfaces to RGB, and then
mapped the resulting RGB texture. Change this so that the surface is
mapped as NV12 plane textures.
The reason this wasn't done until now is because vdpau surfaces are
mapped in an "interlaced" way as separate fields, even for progressive
video. This requires messy reinterleraving. It turns out that even
though it's an extra processing step, the result can be faster than
going through the video mixer for RGB conversion.
Other than some potential speed-gain, doing this has multiple other
advantages. We can apply our own color conversion, which is important in
more complex cases. We can correctly apply debanding and potentially
other processing that requires chroma-specific or in-YUV handling.
If deinterlacing is enabled, this switches back to the old RGB
conversion method. Until we have at least a primitive deinterlacer in
vo_opengl, this will stay this way. The d3d11 and vaapi code paths are
similar. (Of course these don't require any crazy field reinterleaving.)
Until now, we've used system-specific API (GLX, EGL, etc.) to retrieve
the depth of the default framebuffer. (We equal this to display depth
and use the determined depth for dithering.)
We can actually retrieve this value through standard GL API, and it
works everywhere (except GLES 2 of course). This simplifies everything a
great deal.
egl_helpers.c is empty now. But I expect that some EGL boilerplate will
be moved to it, so don't remove it yet.
To avoid blocking the CPU, we use 8 time objects and rotate through
them, only blocking until the last possible moment (before we need
access to them on the next iteration through the ring buffer). I tested
it out on my machine and 4 query objects were enough to guarantee
block-free querying, but the extra margin shouldn't hurt.
Frame render times are just output at the end of each frame, via MP_DBG.
This might be improved in the future. (In particular, I want to expose
these numbers as properties so that users get some more visible feedback
about render times)
Currently, we measure pass_render_frame and pass_draw_to_screen
separately because the former might be called multiple times due to
interpolation. Doing it this way gives more faithful numbers. Same goes
for frame upload times.
Some of these checks became pointless after dropping ES 2.0 support for
extended filtering.
GL_EXT_texture_rg is part of core in ES 3.0, and we already check for
this version, so testing for the extension is redundant.
GL_OES_texture_half_float_linear is also always available, at least as
far as our needs go.
The functionality we need from GL_EXT_color_buffer_half_float is always
available in ES 3.2, and we explicitly check for ES 3.2, so reject this
extension if the ES version is new enough.
For some reason, GLES has no glMapBuffer, only glMapBufferRange.
GLES 2 has no buffer mapping at all, and GL 2.1 does not always have
glMapBufferRange. On those PBOs remain unsupported (there's no reason to
care about GL 2.1 without the extension).
This doesn't actually work on ANGLE, and I have no idea why. (There are
artifacts on OSD, as if parts of the OSD data weren't copied.) It works
on desktop OpenGL and at least 1 other ES 3 implementation. Don't enable
it on ANGLE, I guess.
Not sure how much can be gained with this, as we can't use it properly
yet. For now, this is used only before rendering, which probably does
overwhelmingly nothing.
In the future, this should be used after temporary passes, which could
possibly reduce memory usage and even memory bandwidth usage, depending
on the drivers.
This merges all knowledge about texture format into a central table.
Most of the work done here is actually identifying which formats exactly
are supported by OpenGL(ES) under which circumstances, and keeping this
information in the format table in a somewhat declarative way. (Although
only to the extend needed by mpv.) In particular, ES and float formats
are a horrible mess.
Again this is a big refactor that might cause regression on "obscure"
configurations.
This gives us 16 bit fixed-point integer texture formats, including
ability to sample from them with linear filtering, and using them as FBO
attachments.
The integer texture format path is still there for the sake of ANGLE,
which does not support GL_EXT_texture_norm16 yet.
The change to pass_dither() is needed, because the code path using
GL_R16 for the dither texture relies on glTexImage2D being able to
convert from GL_FLOAT to GL_R16. GLES does not allow this. This could be
trivially fixed by doing the conversion ourselves, but I'm too lazy to
do this now.
Until now, we've let the windowing backend decide. But since they
usually require premultiplied alpha, and premultiplied alpha is easier
to handle, hardcode it.
Do this to make the license situation less confusing.
This change should be of no consequence, since LGPL is compatible with
GPL anyway, and making it LGPL-only does not restrict the use with GPL
code.
Additionally, the wording implies that this is allowed, and that we can
just remove the GPL part.
Now common.c only contains the code for the function loader, while
context.c contains the backend loader/dispatcher.
Not calling it "backend.c", because the central struct is called
MPGLContext.
Store the determined framebuffer depth in struct GL instead of
MPGLContext. This means gl_video_set_output_depth() can be removed, and
also justifies adding new fields describing framebuffer/backend
properties to struct GL instead of having to add more functions just to
shovel the information around.
Keep in mind that mpgl_load_functions() will wipe struct GL, so the
new fields must be set before calling it.
WGL_NV_DX_interop is widely supported by Nvidia and AMD drivers. It
allows a texture to be shared between Direct3D and WGL, so that
rendering can be done with WGL and presentation can be done with
Direct3D. This should allow us to work around some persistent WGL
issues, such as dropped frames with some driver/OS combos, drivers that
buffer frames to increase performance at the cost of latency, and the
inability to disable exclusive fullscreen mode when using WGL to render
to a fullscreen window.
The addition of a DX_interop backend might also enable some cool
Direct3D-specific enhancements in the future, such as using the
GetPresentStatistics API to get accurate frame presentation timestamps.
Note that due to a driver bug, this backend is currently broken on
Intel. It will appear to work as long as the window is not resized too
often, but after a few changes of size it will be unable to share the
newly created renderbuffer with GL. See:
https://software.intel.com/en-us/forums/graphics-driver-bug-reporting/topic/562051
Running mpv with default config will now pick up ANGLE by default. Since
some think ANGLE is still not good enough for hq features, extend the
"es" option to reject GLES backends, and add to to the opengl-hq preset.
One consequence is that mpv will by default use libswscale to convert
10 bit video to 8 bit, before it reaches the VO.
In the display-sync, non-interpolation case, and if the display refresh
rate is higher than the video framerate, we duplicate display frames by
rendering exactly the same screen again. The redrawing is cached with a
FBO to speed up the repeat.
Use glBlitFramebuffer() instead of another shader pass. It should be
faster.
For some reason, post-process was run again on each display refresh.
Stop doing this, which should also be slightly faster. The only
disadvantage is that temporal dithering will be run only once per video
frame, but I can live with this.
One aspect is messy: clearing the background is done at the start on the
target framebuffer, so to avoid clearing twice and duplicating the code,
only copy the part of the framebuffer that contains the rendered video.
(Which also gets slightly messy - needs to compensate for coordinate
system flipping.)
Implement NNEDI3, a neural network based deinterlacer.
The shader is reimplemented in GLSL and supports both 8x4 and 8x6
sampling window now. This allows the shader to be licensed
under LGPL2.1 so that it can be used in mpv.
The current implementation supports uploading the NN weights (up to
51kb with placebo setting) in two different way, via uniform buffer
object or hard coding into shader source. UBO requires OpenGL 3.1,
which only guarantee 16kb per block. But I find that 64kb seems to be
a default setting for recent card/driver (which nnedi3 is targeting),
so I think we're fine here (with default nnedi3 setting the size of
weights is 9kb). Hard-coding into shader requires OpenGL 3.3, for the
"intBitsToFloat()" built-in function. This is necessary to precisely
represent these weights in GLSL. I tried several human readable
floating point number format (with really high precision as for
single precision float), but for some reason they are not working
nicely, bad pixels (with NaN value) could be produced with some
weights set.
We could also add support to upload these weights with texture, just
for compatibility reason (etc. upscaling a still image with a low end
graphics card). But as I tested, it's rather slow even with 1D
texture (we probably had to use 2D texture due to dimension size
limitation). Since there is always better choice to do NNEDI3
upscaling for still image (vapoursynth plugin), it's not implemented
in this commit. If this turns out to be a popular demand from the
user, it should be easy to add it later.
For those who wants to optimize the performance a bit further, the
bottleneck seems to be:
1. overhead to upload and access these weights, (in particular,
the shader code will be regenerated for each frame, it's on CPU
though).
2. "dot()" performance in the main loop.
3. "exp()" performance in the main loop, there are various fast
implementation with some bit tricks (probably with the help of the
intBitsToFloat function).
The code is tested with nvidia card and driver (355.11), on Linux.
Closes#2230
Yet another relatively useless option that tries to make OpenGL's sync
behavior somewhat sane. The results are not too encouraging. With a
value of 1, vsync jitter is gone on nVidia, but there are frame drops
(less than with glfinish). With 2, I get the usual vsync jitter _and_
frame drops.
There's still some hope that it might prevent too deep queuing with some
GPUs, I guess.
The timeout for the wait call is 1 second. The value is pretty
arbitrary; it should just not be too high to freeze the process (if
the GPU is un-nice), and not too low to trigger the timeout in normal
cases, even if the GPU load is very high. So I guess 1 second is ok
as a timeout.
The idea to use fences this way to control the queue depth was stolen
from RetroArch:
df01279cf3/gfx/drivers/gl.c (L1856)
This parameter has been unused for years (the last flag was removed in
commit d658b115). Get rid of it.
This affects the general VO API, as well as the vo_opengl backend API,
so it touches a lot of files.
The VOFLAGs are still used to control OpenGL context creation, so move
them to the OpenGL backend code.