mpv/video/out/vulkan/common.h

#pragma once

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <stdbool.h>
#include <assert.h>

#include "config.h"

#include "common/common.h"
#include "common/msg.h"

// We need to define all platforms we want to support. Since we have
// our own mechanism for checking this, we re-define the right symbols
#if HAVE_WAYLAND
#define VK_USE_PLATFORM_WAYLAND_KHR
#endif
#if HAVE_X11
#define VK_USE_PLATFORM_XLIB_KHR
#endif
#if HAVE_WIN32_DESKTOP
#define VK_USE_PLATFORM_WIN32_KHR
#endif

#include <vulkan/vulkan.h>

// Vulkan allows the optional use of a custom allocator. We don't need one but
// mark this parameter with a better name in case we ever decide to change this
// in the future. (And to make the code more readable)
#define MPVK_ALLOCATOR NULL

// A lot of things depend on streaming resources across frames. Depending on
// how many frames we render ahead of time, we need to pick enough to avoid
// any conflicts, so make all of these tunable relative to this constant in
// order to centralize them.
#define MPVK_MAX_STREAMING_DEPTH 8

// Shared struct used to hold vulkan context information
struct mpvk_ctx {
    struct mp_log *log;
    VkInstance inst;
    VkPhysicalDevice physd;
    VkDebugReportCallbackEXT dbg;
    VkDevice dev;

    // Surface, must be initialized fter the context itself
    VkSurfaceKHR surf;
    VkSurfaceFormatKHR surf_format; // picked at surface initialization time

    struct vk_malloc *alloc;      // memory allocator for this device
    struct spirv_compiler *spirv; // GLSL -> SPIR-V compiler
    struct vk_cmdpool **pools;    // command pools (one per queue family)
    int num_pools;
    struct vk_cmd *last_cmd;      // most recently submitted command

    // Queued/pending commands. These are shared for the entire mpvk_ctx to
    // ensure submission and callbacks are FIFO
    struct vk_cmd **cmds_queued;  // recorded but not yet submitted
    struct vk_cmd **cmds_pending; // submitted but not completed
    int num_cmds_queued;
    int num_cmds_pending;

    // Pointers into *pools
    struct vk_cmdpool *pool_graphics; // required
    struct vk_cmdpool *pool_compute;  // optional
    struct vk_cmdpool *pool_transfer; // optional

    // Common pool of signals, to avoid having to re-create these objects often
    struct vk_signal **signals;
    int num_signals;

    // Cached capabilities
    VkPhysicalDeviceLimits limits;
    VkPhysicalDeviceFeatures features;
};
vo_gpu: vulkan: initial implementation This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370) 2. The memory allocator could be improved. (This is a universal constant) 3. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows. 4. Parallelism across frames / async transfer is not well-defined, we either need to use a better semaphore / command buffer strategy or a resource pooling layer to safely handle cross-frame parallelism. (That said, I gave resource pooling a try and was not happy with the result at all - so I'm still exploring the semaphore strategy) 5. We aggressively use pipeline barriers where events would offer a much more fine-grained synchronization mechanism. As a result of this, we might be suffering from GPU bubbles due to too-short dependencies on objects. (That said, I'm also exploring the use of semaphores as a an ordering tactic which would allow cross-frame time slicing in theory) Some minor changes to the vo_gpu and infrastructure, but nothing consequential. NOTE: For safety, all use of asynchronous commands / multiple command pools is currently disabled completely. There are some left-over relics of this in the code (e.g. the distinction between dev_poll and pool_poll), but that is kept in place mostly because this will be re-extended in the future (vulkan rev 2). The queue count is also currently capped to 1, because of the lack of cross-frame semaphores means we need the implicit synchronization from the same-queue semantics to guarantee a correct result. 2016-09-14 18:54:18 +00:00			`#pragma once`

			`#include <stdlib.h>`
			`#include <stdio.h>`
			`#include <stdint.h>`
			`#include <stdbool.h>`
			`#include <assert.h>`

			`#include "config.h"`

			`#include "common/common.h"`
			`#include "common/msg.h"`

			`// We need to define all platforms we want to support. Since we have`
			`// our own mechanism for checking this, we re-define the right symbols`
vo_gpu: vulkan: add support for wayland 2017-09-16 02:46:38 +00:00			`#if HAVE_WAYLAND`
			`#define VK_USE_PLATFORM_WAYLAND_KHR`
			`#endif`
vo_gpu: vulkan: initial implementation This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370) 2. The memory allocator could be improved. (This is a universal constant) 3. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows. 4. Parallelism across frames / async transfer is not well-defined, we either need to use a better semaphore / command buffer strategy or a resource pooling layer to safely handle cross-frame parallelism. (That said, I gave resource pooling a try and was not happy with the result at all - so I'm still exploring the semaphore strategy) 5. We aggressively use pipeline barriers where events would offer a much more fine-grained synchronization mechanism. As a result of this, we might be suffering from GPU bubbles due to too-short dependencies on objects. (That said, I'm also exploring the use of semaphores as a an ordering tactic which would allow cross-frame time slicing in theory) Some minor changes to the vo_gpu and infrastructure, but nothing consequential. NOTE: For safety, all use of asynchronous commands / multiple command pools is currently disabled completely. There are some left-over relics of this in the code (e.g. the distinction between dev_poll and pool_poll), but that is kept in place mostly because this will be re-extended in the future (vulkan rev 2). The queue count is also currently capped to 1, because of the lack of cross-frame semaphores means we need the implicit synchronization from the same-queue semantics to guarantee a correct result. 2016-09-14 18:54:18 +00:00			`#if HAVE_X11`
			`#define VK_USE_PLATFORM_XLIB_KHR`
			`#endif`
vo_gpu: vulkan: add support for Windows 2017-09-24 06:45:56 +00:00			`#if HAVE_WIN32_DESKTOP`
			`#define VK_USE_PLATFORM_WIN32_KHR`
			`#endif`
vo_gpu: vulkan: initial implementation This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370) 2. The memory allocator could be improved. (This is a universal constant) 3. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows. 4. Parallelism across frames / async transfer is not well-defined, we either need to use a better semaphore / command buffer strategy or a resource pooling layer to safely handle cross-frame parallelism. (That said, I gave resource pooling a try and was not happy with the result at all - so I'm still exploring the semaphore strategy) 5. We aggressively use pipeline barriers where events would offer a much more fine-grained synchronization mechanism. As a result of this, we might be suffering from GPU bubbles due to too-short dependencies on objects. (That said, I'm also exploring the use of semaphores as a an ordering tactic which would allow cross-frame time slicing in theory) Some minor changes to the vo_gpu and infrastructure, but nothing consequential. NOTE: For safety, all use of asynchronous commands / multiple command pools is currently disabled completely. There are some left-over relics of this in the code (e.g. the distinction between dev_poll and pool_poll), but that is kept in place mostly because this will be re-extended in the future (vulkan rev 2). The queue count is also currently capped to 1, because of the lack of cross-frame semaphores means we need the implicit synchronization from the same-queue semantics to guarantee a correct result. 2016-09-14 18:54:18 +00:00
			`#include <vulkan/vulkan.h>`

			`// Vulkan allows the optional use of a custom allocator. We don't need one but`
			`// mark this parameter with a better name in case we ever decide to change this`
			`// in the future. (And to make the code more readable)`
			`#define MPVK_ALLOCATOR NULL`

			`// A lot of things depend on streaming resources across frames. Depending on`
			`// how many frames we render ahead of time, we need to pick enough to avoid`
			`// any conflicts, so make all of these tunable relative to this constant in`
			`// order to centralize them.`
			`#define MPVK_MAX_STREAMING_DEPTH 8`

			`// Shared struct used to hold vulkan context information`
			`struct mpvk_ctx {`
			`struct mp_log *log;`
			`VkInstance inst;`
			`VkPhysicalDevice physd;`
			`VkDebugReportCallbackEXT dbg;`
			`VkDevice dev;`

			`// Surface, must be initialized fter the context itself`
			`VkSurfaceKHR surf;`
			`VkSurfaceFormatKHR surf_format; // picked at surface initialization time`

vo_gpu: vulkan: support split command pools Instead of using a single primary queue, we generate multiple vk_cmdpools and pick the right one dynamically based on the intent. This has a number of immediate benefits: 1. We can use async texture uploads 2. We can use the DMA engine for buffer updates 3. We can benefit from async compute on AMD GPUs Unfortunately, the major downside is that due to the lack of QF ownership tracking, we need to use CONCURRENT sharing for all resources (buffers and images!). In theory, we could try figuring out a way to get rid of the concurrent sharing for buffers (which is only needed for compute shader UBOs), but even so, the concurrent sharing mode doesn't really seem to have a significant impact over here (nvidia). It's possible that other platforms may disagree. Our deadlock-avoidance strategy is stupidly simple: Just flush the command every time we need to switch queues, and make sure all submission and callbacks happen in FIFO order. This required lifting the cmds_pending and cmds_queued out from vk_cmdpool to mpvk_ctx, and some functions died/got moved as a result, but that's a relatively minor change. On my hardware this is a fairly significant performance boost, mainly due to async transfers. (Nvidia doesn't expose separate compute queues anyway). On AMD, this should be a performance boost as well due to async compute. 2017-09-24 13:05:24 +00:00			`struct vk_malloc *alloc; // memory allocator for this device`
vo_gpu: vulkan: generalize SPIR-V compiler In addition to the built-in nvidia compiler, we now also support a backend based on libshaderc. shaderc is sort of like glslang except it has a C API and is available as a dynamic library. The generated SPIR-V is now cached alongside the VkPipeline in the cached_program. We use a special cache header to ensure validity of this cache before passing it blindly to the vulkan implementation, since passing invalid SPIR-V can cause all sorts of nasty things. It's also designed to self-invalidate if the compiler gets better, by offering a catch-all `int compiler_version` that implementations can use as a cache invalidation marker. 2017-09-13 01:09:48 +00:00			`struct spirv_compiler *spirv; // GLSL -> SPIR-V compiler`
vo_gpu: vulkan: support split command pools Instead of using a single primary queue, we generate multiple vk_cmdpools and pick the right one dynamically based on the intent. This has a number of immediate benefits: 1. We can use async texture uploads 2. We can use the DMA engine for buffer updates 3. We can benefit from async compute on AMD GPUs Unfortunately, the major downside is that due to the lack of QF ownership tracking, we need to use CONCURRENT sharing for all resources (buffers and images!). In theory, we could try figuring out a way to get rid of the concurrent sharing for buffers (which is only needed for compute shader UBOs), but even so, the concurrent sharing mode doesn't really seem to have a significant impact over here (nvidia). It's possible that other platforms may disagree. Our deadlock-avoidance strategy is stupidly simple: Just flush the command every time we need to switch queues, and make sure all submission and callbacks happen in FIFO order. This required lifting the cmds_pending and cmds_queued out from vk_cmdpool to mpvk_ctx, and some functions died/got moved as a result, but that's a relatively minor change. On my hardware this is a fairly significant performance boost, mainly due to async transfers. (Nvidia doesn't expose separate compute queues anyway). On AMD, this should be a performance boost as well due to async compute. 2017-09-24 13:05:24 +00:00			`struct vk_cmdpool **pools; // command pools (one per queue family)`
			`int num_pools;`
			`struct vk_cmd *last_cmd; // most recently submitted command`

			`// Queued/pending commands. These are shared for the entire mpvk_ctx to`
			`// ensure submission and callbacks are FIFO`
			`struct vk_cmd **cmds_queued; // recorded but not yet submitted`
			`struct vk_cmd **cmds_pending; // submitted but not completed`
			`int num_cmds_queued;`
			`int num_cmds_pending;`

			`// Pointers into *pools`
			`struct vk_cmdpool *pool_graphics; // required`
			`struct vk_cmdpool *pool_compute; // optional`
			`struct vk_cmdpool *pool_transfer; // optional`
vo_gpu: vulkan: initial implementation This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370) 2. The memory allocator could be improved. (This is a universal constant) 3. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows. 4. Parallelism across frames / async transfer is not well-defined, we either need to use a better semaphore / command buffer strategy or a resource pooling layer to safely handle cross-frame parallelism. (That said, I gave resource pooling a try and was not happy with the result at all - so I'm still exploring the semaphore strategy) 5. We aggressively use pipeline barriers where events would offer a much more fine-grained synchronization mechanism. As a result of this, we might be suffering from GPU bubbles due to too-short dependencies on objects. (That said, I'm also exploring the use of semaphores as a an ordering tactic which would allow cross-frame time slicing in theory) Some minor changes to the vo_gpu and infrastructure, but nothing consequential. NOTE: For safety, all use of asynchronous commands / multiple command pools is currently disabled completely. There are some left-over relics of this in the code (e.g. the distinction between dev_poll and pool_poll), but that is kept in place mostly because this will be re-extended in the future (vulkan rev 2). The queue count is also currently capped to 1, because of the lack of cross-frame semaphores means we need the implicit synchronization from the same-queue semantics to guarantee a correct result. 2016-09-14 18:54:18 +00:00
vo_gpu: vulkan: add a vk_signal abstraction This combines VkSemaphores and VkEvents into a common umbrella abstraction which can resolve to either. We aggressively try to prefer VkEvents over VkSemaphores whenever the conditions are met (1. we can unsignal the semaphore, i.e. it comes from the same frame; and 2. it comes from the same queue). 2017-09-28 22:36:46 +00:00			`// Common pool of signals, to avoid having to re-create these objects often`
			`struct vk_signal **signals;`
			`int num_signals;`

vo_gpu: vulkan: initial implementation This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370) 2. The memory allocator could be improved. (This is a universal constant) 3. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows. 4. Parallelism across frames / async transfer is not well-defined, we either need to use a better semaphore / command buffer strategy or a resource pooling layer to safely handle cross-frame parallelism. (That said, I gave resource pooling a try and was not happy with the result at all - so I'm still exploring the semaphore strategy) 5. We aggressively use pipeline barriers where events would offer a much more fine-grained synchronization mechanism. As a result of this, we might be suffering from GPU bubbles due to too-short dependencies on objects. (That said, I'm also exploring the use of semaphores as a an ordering tactic which would allow cross-frame time slicing in theory) Some minor changes to the vo_gpu and infrastructure, but nothing consequential. NOTE: For safety, all use of asynchronous commands / multiple command pools is currently disabled completely. There are some left-over relics of this in the code (e.g. the distinction between dev_poll and pool_poll), but that is kept in place mostly because this will be re-extended in the future (vulkan rev 2). The queue count is also currently capped to 1, because of the lack of cross-frame semaphores means we need the implicit synchronization from the same-queue semantics to guarantee a correct result. 2016-09-14 18:54:18 +00:00			`// Cached capabilities`
			`VkPhysicalDeviceLimits limits;`
vo_gpu: vulkan: try enabling required features Instead of enabling every feature under the sun, make an effort to just whitelist the ones we actually might use. Turns out the extended storage format support is needed for some of the storage formats we use, in particular rgba16. 2018-02-04 22:08:49 +00:00			`VkPhysicalDeviceFeatures features;`
vo_gpu: vulkan: initial implementation This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop! Current problems / limitations / improvement opportunities: 1. The swapchain/flipping code violates the vulkan spec, by assuming that the presentation queue will be bounded (in cases where rendering is significantly faster than vsync). But apparently, there's simply no better way to do this right now, to the point where even the stupid cube.c examples from LunarG etc. do it wrong. (cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370) 2. The memory allocator could be improved. (This is a universal constant) 3. Could explore using push descriptors instead of descriptor sets, especially since we expect to switch descriptors semi-often for some passes (like interpolation). Probably won't make a difference, but the synchronization overhead might be a factor. Who knows. 4. Parallelism across frames / async transfer is not well-defined, we either need to use a better semaphore / command buffer strategy or a resource pooling layer to safely handle cross-frame parallelism. (That said, I gave resource pooling a try and was not happy with the result at all - so I'm still exploring the semaphore strategy) 5. We aggressively use pipeline barriers where events would offer a much more fine-grained synchronization mechanism. As a result of this, we might be suffering from GPU bubbles due to too-short dependencies on objects. (That said, I'm also exploring the use of semaphores as a an ordering tactic which would allow cross-frame time slicing in theory) Some minor changes to the vo_gpu and infrastructure, but nothing consequential. NOTE: For safety, all use of asynchronous commands / multiple command pools is currently disabled completely. There are some left-over relics of this in the code (e.g. the distinction between dev_poll and pool_poll), but that is kept in place mostly because this will be re-extended in the future (vulkan rev 2). The queue count is also currently capped to 1, because of the lack of cross-frame semaphores means we need the implicit synchronization from the same-queue semantics to guarantee a correct result. 2016-09-14 18:54:18 +00:00			`};`