On AMD devices, we only get one graphics pipe but several compute pipes
which can (in theory) run independently. As such, we should prefer
compute shaders over fragment shaders in scenarios where we expect them
to be better for parallelism.
This is amusingly trivial to do, and actually improves performance even
in a single-queue scenario.
This is especially interesting for vulkan since it allows completely
skipping the layout transition as part of the renderpass. Unfortunately,
that also means it needs to be put into renderpass_params, as opposed to
renderpass_run_params (unlike #4777).
Closes#4777.
I've decided that MP_TRACE means “noisy spam per frame”, whereas
MP_DBG just means “more verbose debugging messages than MSGL_V”.
Basically, MSGL_DBG shouldn't create spam per frame like it currently
does, and MSGL_V should make sense to the end-user and provide mostly
additional informational output.
MP_DBG is basically what I want to make the new default for --log-file,
so the cut-off point for MP_DBG is if we probably want to know if for
debugging purposes but the user most likely doesn't care about on the
terminal.
Also, the debug callbacks for libass and ffmpeg got bumped in their
verbosity levels slightly, because being external components they're a
bit less relevant to mpv debugging, and a bit too over-eager in what
they consider to be relevant information.
I exclusively used the "try it on my machine and remove messages from
MSGL_* until it does what I want it to" approach of refactoring, so
YMMV.
The testing_only field is not referenced anymore with vaglx removed and
the previous commit dropping all uses.
The ra_hwdec_driver.api field became unused with the previous commit,
but all hwdec interop drivers still initialized it.
Since this touches highly OS-specific code, build regressions are
possible (plus the previous commit might break hw decoding at runtime).
At least hwdec_cuda.c still used the .api field, other than initializing
it.
Make the VO<->decoder interface capable of supporting multiple hwdec
APIs at once. The main gain is that this simplifies autoprobing a lot.
Before this change, it could happen that the VO loaded the "wrong" hwdec
API, and the decoder was stuck with the choice (breaking hw decoding).
With the change applied, the VO simply loads all available APIs, so
autoprobing trickery is left entirely to the decoder.
In the past, we were quite careful about not accidentally loading the
wrong interop drivers. This was in part to make sure autoprobing works,
but also because libva had this obnoxious bug of dumping garbage to
stderr when using the API. libva was fixed, so this is not a problem
anymore.
The --opengl-hwdec-interop option is changed in various ways (again...),
and renamed to --gpu-hwdec-interop. It does not have much use anymore,
other than debugging. It's notable that the order in the hwdec interop
array ra_hwdec_drivers[] still matters if multiple drivers support the
same image formats, so the option can explicitly force one, if that
should ever be necessary, or more likely, for debugging. One example are
the ra_hwdec_d3d11egl and ra_hwdec_d3d11eglrgb drivers, which both
support d3d11 input.
vo_gpu now always loads the interop lazily by default, but when it does,
it loads them all. vo_opengl_cb now always loads them when the GL
context handle is initialized. I don't expect that this causes any
problems.
It's now possible to do things like changing between vdpau and nvdec
decoding at runtime.
This is also preparation for cleaning up vd_lavc.c hwdec autoprobing.
It's another reason why hwdec_devices_request_all() does not take a
hwdec type anymore.
nvdec aka cuvid aka cuda should work much better than vdpau, and support
newer codecs (such as vp9), and more advanced surface formats (like 10
bit).
This requires moving the d3d hwaccels in the autoprobe order, since on
Windows, d3d decoding should be preferred over nvidia proprietary stuff.
Users of older drivers will need to force --hwdec=vdpau, since it could
happen that the vo_gpu cuda hwdec interop loads (so the vdpau interop is
not loaded), but the hwdec itself doesn't work.
I expect this does not break AMD (which still needs vdpau for vo_gpu
interop, until libva is fixed so it can fully support AMD).
This has stopped being useful a long time ago, and it's the only GPL
source file in the vo_gpu source directories. Recently it wasn't even
loaded at all, unless you forced loading it.
The D3D11_CREATE_DEVICE_BGRA_SUPPORT flag doesn't enable support for
BGRA textures. BGRA textures will be supported whether or not the flag
is passed. The flag just fails device creation if they are not supported
as an API convenience for programs that need BGRA textures, such as
programs that use D2D or D3D9 interop. We can handle devices without
BGRA support fine, so don't bother with the flag.
Seems like the last refactor to this code broke playing flipped images,
at least with --opengl-pbo --gpu-api=opengl.
Flipping is rather a shitmess. The main problem is that OpenGL does not
support flipped uploading. The original vo_gl implementation considered
it important to handle the flipped case efficiently, so instead of
uploading the image line by line backwards, it uploaded it flipped, and
then flipped it in the renderer (basically the upload path ignored the
flipping). The ra code and backends probably have an insane and
inconsistent mix of semantics, so fix this by never passing it flipped
images in the first place.
In the future, the backends should probably support flipped images
directly.
Fixes#5097.
ra_d3d11 uses the SPIR-V compiler to translate GLSL to SPIR-V, which is
then translated to HLSL. This means it always exposes the same GLSL
version that the SPIR-V compiler supports (4.50 for shaderc/glslang.)
Despite claiming to support GLSL 4.50, some features that are tied to
the GLSL version in OpenGL are not supported by ra_d3d11 when targeting
legacy Direct3D feature levels.
This includes two features that mpv relies on:
- Reading from gl_FragCoord in the fragment shader (requires FL 10_0)
- textureGather from any texture component (requires FL 11_0)
These features have been exposed as new RA caps.
This is a new RA/vo_gpu backend that uses Direct3D 11. The GLSL
generated by vo_gpu is cross-compiled to HLSL with SPIRV-Cross.
What works:
- All of mpv's internal shaders should work, including compute shaders.
- Some external shaders have been tested and work, including RAVU and
adaptive-sharpen.
- Non-dumb mode works, even on very old hardware. Most features work at
feature level 9_3 and all features work at feature level 10_0. Some
features also work at feature level 9_1 and 9_2, but without high-bit-
depth FBOs, it's not very useful. (Hardware this old is probably not
fast enough for advanced features anyway.)
Note: This is more compatible than ANGLE, which requires 9_3 to work
at all (GLES 2.0,) and 10_1 for non-dumb-mode (GLES 3.0.)
- Hardware decoding with D3D11VA, including decoding of 10-bit formats
without truncation to 8-bit.
What doesn't work / can be improved:
- PBO upload and direct rendering does not work yet. Direct rendering
requires persistent-mapped PBOs because the decoder needs to be able
to read data from images that have already been decoded and uploaded.
Unfortunately, it seems like persistent-mapped PBOs are fundamentally
incompatible with D3D11, which requires all resources to use driver-
managed memory and requires memory to be unmapped (and hence pointers
to be invalidated) when a resource is used in a draw or copy
operation.
However it might be possible to use D3D11's limited multithreading
capabilities to emulate some features of PBOs, like asynchronous
texture uploading.
- The blit() and clear() operations don't have equivalents in the D3D11
API that handle all cases, so in most cases, they have to be emulated
with a shader. This is currently done inside ra_d3d11, but ideally it
would be done in generic code, so it can take advantage of mpv's
shader generation utilities.
- SPIRV-Cross is used through a NIH C-compatible wrapper library, since
it does not expose a C interface itself.
The library is available here: https://github.com/rossy/crossc
- The D3D11 context could be made to support more modern DXGI features
in future. For example, it should be possible to add support for
high-bit-depth and HDR output with DXGI 1.5/1.6.
Backported from @haasn's change to libplacebo, except in the current RA,
there's nothing to indicate an ra_format can be bound as a storage
image, so there's no way to force all of these formats to have a
glsl_format. Instead, the layout qualifier will be removed if
glsl_format is NULL.
This is needed for the upcoming ra_d3d11 backend. In Direct3D 11, while
loading float values from unorm images often works as expected, it's
technically undefined behaviour, and in Windows 10, it will cause the
debug layer to spam the log with error messages. Also, apparently in
GLSL, the format name must match the image's format exactly (but in
Direct3D, it just has to have the same component type.)
Backported from @haasn's change to libplacebo. More flexible than the
previous "shared || non-shared" distinction. The extra flexibility is
needed for Direct3D 11, but it also doesn't hurt code-wise.
Repeating frames (for display-sync) is not supposed to render the entire
frame again. When using hardware decoding, it unfortunately did: the
renderer uses the frame ID to check whether the frame data changed, and
unmapping the hwdec frame clears it.
Essentially reverts commit 761eeacf54. Back then I probably
thought it would be a good idea to release the hwdec image quickly in
order to return it to the decoder, but they're referenced anyway.
This should increase the performance and reduce GPU work.
Normally such code is didsabled by have_mglsl==false in
check_gl_features(), but apparently not this one.
Just fix it. Seems also more readable.
Fixes#5069.
This is just a dumb consequence of HWDEC_ types somehow being part of
both decoder and VO. Obviously, the VO should only care about supporting
specific hardware surface types or providing specific device types, but
until they are separated, stupid unintuitive mismatches will occur.
See manpage additions.
(In ffmpeg-mpv and Libav, this is still called "cuvid". Libav won't work
yet, because it has no frame params support yet, but this could get
fixed soon.)
params->rc was ignored in the calculation for the buffer size. I fucking
hate this stupid ra_tex_upload signature where *rc is randomly relevant
or not.
Coverity complains about this, but it's probably a false positive.
Anyway, rewrite it in a slightly more readable way. Now it's more
obvious that it is correct.
Comparing mpv's implementation against the ACES ODR reference samples
and algorithms, it seems like they're happy desaturating highlights
_way_ more aggressively than mpv currently does. And indeed, looking at
some example clips like The Redwoods (which is actually well-mastered),
the current desaturation produces unnatural-looking brightness fringes
where the sky meets the treeline.
Adjust the algorithm to make it apply to a much larger, more gradual
brightness region; and change the interpretation of the parameter. As a
bonus, the new parameter is actually sanely scaled (higher values = more
desaturation). Also, make it scale based on the signal level instead of
the luminance, to avoid under-desaturating bright blues.
This commit allows to use the AV_PIX_FMT_DRM_PRIME newly introduced
format in ffmpeg that allows decoders to provide an AVDRMFrameDescriptor
struct.
That struct holds dmabuf fds and information allowing zerocopy rendering
using KMS / DRM Atomic.
This has been tested on RockChip ROCK64 device.
Regression since ec6e8a31e0. Removal of the explicit else case
always applies the conversion to premultiplied alpha in the else branch.
We want to scale with multiplied alpha, but we don't want to multiply
with alpha again on top of it.
Fixes#4983, hopefully.
This should be functionally identical to rgba16f, since the formats only
differ in their representation on the CPU, but it could be useful for RA
backends that don't expose rgba16f, like Vulkan. It's definitely useful
for the WIP D3D11 backend.
With video paused, changing the brightness controls (or similar) would
sometimes not rerender the video frame. So the OSD would redraw, but the
video wouldn't change. This is caused by output caching, and a redraw
request is free to return the cached frame. Change it such to invalidate
the cached frame if any of the options or the equalizer change.
In theory, gl_video_reset_surfaces() could be called if the equalizer
changes - this would apparently force interpolatzion to redraw all
frames. But this looks kind of crappy when changing the equalizer during
playback. It'll "eventually" use the correct settings anyway, and when
paused interpolation is off.
This was confusing at best. Change it to output the actual choices.
(Seems like in the end it's always me who has to clean up other people's
bullshit.)
Context names were not unique - but they should be, so fix it. The whole
point of the original --opengl-backend option was to side-step the
tricky auto-detection, so you know exactly what you get. The goal of
this commit is to make --gpu-context work the same way. Fix the
non-unique names by appending "vk" to the names.
Keep in mind that this was not suitable for slecting the "UI" backend
anyway, since "x11" would force GLX, whereas people on not-NVIDIA
actually want "x11egl". Users trying to use --gpu-context=x11 to force
the X11 backend would always end up with GLX, which would at least break
VAAPI hardware decoding for them. Basically the idea that this option
could select the "UI" type is completely broken - it selects an
implementation, which implies a UI. Selecting the UI type This would
require a separate mechanism. (Although in theory this separate
mechanism could be part of the --gpu-context option - in any case,
someone would have to implement it.)
To achieve help output that can actually be understood, just duplicate
the code. Most of that code is duplicated anyway, and trying to share
just the list code with the result of making the output unreadable
doesn't make too much sense. If we wanted to save code/effort, we could
just remove the help output altogether.
--gpu-api has non-unique entries, and it would be nice to group them
(e.g. list all OpenGL capable contexts with "opengl"), but C makes this
simple idea too much of a pain, so don't do it.
Also remove a stray tab from the android entry on the manpage.
This adds symbol information to the generated SPIR-V, which shows up in
the SPIR-V assembly dump. It's also useful for potential RA backends
that use SPIRV-Cross, since the symbol information is used in the
generated shader source.
Unless FBOs are unsupported, this works. In particular, it's required to
get ICC profiles working in voluntary dumb mode. So instead of
blanket-disabling it, only disable it in the !have_fbo false case.
Seems to be fixed upstream in the nvidia driver, so it's probably a good
idea to 1. force the layout and 2. remove the warning, as it now
actually works. Users with older drivers would run into errors, but they
can still use shaderc as a replacement. (And it's not like the old
status quo was any better)
This was always set to the length of the VAO, but it should have been
set to the number of vertex attribs actually in use for this frame. No
idea how that managed to survive the test framework on nvidia/linux, but
ANGLE caught it.
This has several advantages:
1. no more redundant texcoords when we don't need them
2. no more arbitrary limit on how many textures we can bind
3. (that extends to user shaders as well)
4. no more arbitrary limits on tscale radius
To realize this, the VAO was moved from a hacky stateful approach
(gl_sc_set_vertex_attribs) - which always bothered me since it was
required for compute shaders as well even though they ignored it - to be
a proper parameter of gl_sc_dispatch_draw, and internally plumbed into
gl_sc_generate, which will make a (properly mangled) deep copy into
params.vertex_attribs.
This is apparently required to get storage images working on
windows/vulkan, and probably good practice either way. Not entirely sure
if it's the best idea to be always storing the value as 32-bit float,
but it should hardly matter in practice (since we're only writing one
sample per thread).
(Leaving them implicit requires the shaderStorageImageWriteWithoutFormat
feature to be enabled, which the windows nvidia vulkan driver doesn't
support, at least not for a GTX 670)
This makes the radeon driver shut up about frequently updating
STATIC_DRAW UBOs (--opengl-debug), and also reduces the amount of
synchronization necessary for vulkan uniform buffers.
Also add some extra debugging/tracing code paths. I went with a
flags-based approach in case we ever want to extend this.
In addition to the built-in nvidia compiler, we now also support a
backend based on libshaderc. shaderc is sort of like glslang except it
has a C API and is available as a dynamic library.
The generated SPIR-V is now cached alongside the VkPipeline in the
cached_program. We use a special cache header to ensure validity of this
cache before passing it blindly to the vulkan implementation, since
passing invalid SPIR-V can cause all sorts of nasty things. It's also
designed to self-invalidate if the compiler gets better, by offering a
catch-all `int compiler_version` that implementations can use as a cache
invalidation marker.
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!
Current problems / limitations / improvement opportunities:
1. The swapchain/flipping code violates the vulkan spec, by assuming
that the presentation queue will be bounded (in cases where rendering
is significantly faster than vsync). But apparently, there's simply
no better way to do this right now, to the point where even the
stupid cube.c examples from LunarG etc. do it wrong.
(cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370)
2. The memory allocator could be improved. (This is a universal
constant)
3. Could explore using push descriptors instead of descriptor sets,
especially since we expect to switch descriptors semi-often for some
passes (like interpolation). Probably won't make a difference, but
the synchronization overhead might be a factor. Who knows.
4. Parallelism across frames / async transfer is not well-defined, we
either need to use a better semaphore / command buffer strategy or a
resource pooling layer to safely handle cross-frame parallelism.
(That said, I gave resource pooling a try and was not happy with the
result at all - so I'm still exploring the semaphore strategy)
5. We aggressively use pipeline barriers where events would offer a much
more fine-grained synchronization mechanism. As a result of this, we
might be suffering from GPU bubbles due to too-short dependencies on
objects. (That said, I'm also exploring the use of semaphores as a an
ordering tactic which would allow cross-frame time slicing in theory)
Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.
NOTE: For safety, all use of asynchronous commands / multiple command
pools is currently disabled completely. There are some left-over relics
of this in the code (e.g. the distinction between dev_poll and
pool_poll), but that is kept in place mostly because this will be
re-extended in the future (vulkan rev 2).
The queue count is also currently capped to 1, because of the lack of
cross-frame semaphores means we need the implicit synchronization from
the same-queue semantics to guarantee a correct result.