vo_gpu: vulkan: initial implementation
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!
Current problems / limitations / improvement opportunities:
1. The swapchain/flipping code violates the vulkan spec, by assuming
that the presentation queue will be bounded (in cases where rendering
is significantly faster than vsync). But apparently, there's simply
no better way to do this right now, to the point where even the
stupid cube.c examples from LunarG etc. do it wrong.
(cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370)
2. The memory allocator could be improved. (This is a universal
constant)
3. Could explore using push descriptors instead of descriptor sets,
especially since we expect to switch descriptors semi-often for some
passes (like interpolation). Probably won't make a difference, but
the synchronization overhead might be a factor. Who knows.
4. Parallelism across frames / async transfer is not well-defined, we
either need to use a better semaphore / command buffer strategy or a
resource pooling layer to safely handle cross-frame parallelism.
(That said, I gave resource pooling a try and was not happy with the
result at all - so I'm still exploring the semaphore strategy)
5. We aggressively use pipeline barriers where events would offer a much
more fine-grained synchronization mechanism. As a result of this, we
might be suffering from GPU bubbles due to too-short dependencies on
objects. (That said, I'm also exploring the use of semaphores as a an
ordering tactic which would allow cross-frame time slicing in theory)
Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.
NOTE: For safety, all use of asynchronous commands / multiple command
pools is currently disabled completely. There are some left-over relics
of this in the code (e.g. the distinction between dev_poll and
pool_poll), but that is kept in place mostly because this will be
re-extended in the future (vulkan rev 2).
The queue count is also currently capped to 1, because of the lack of
cross-frame semaphores means we need the implicit synchronization from
the same-queue semantics to guarantee a correct result.
2016-09-14 18:54:18 +00:00
|
|
|
#pragma once
|
|
|
|
|
|
|
|
#include "video/out/vo.h"
|
|
|
|
#include "video/out/gpu/context.h"
|
|
|
|
#include "video/mp_image.h"
|
|
|
|
|
|
|
|
#include "common.h"
|
|
|
|
#include "formats.h"
|
|
|
|
|
|
|
|
#define VK_LOAD_PFN(name) PFN_##name pfn_##name = (PFN_##name) \
|
|
|
|
vkGetInstanceProcAddr(vk->inst, #name);
|
|
|
|
|
2018-09-30 00:56:07 +00:00
|
|
|
#if HAVE_WIN32_DESKTOP
|
|
|
|
#define MP_VK_EXTERNAL_MEMORY_EXPORT_EXTENSION_NAME VK_KHR_EXTERNAL_MEMORY_WIN32_EXTENSION_NAME
|
|
|
|
#else
|
|
|
|
#define MP_VK_EXTERNAL_MEMORY_EXPORT_EXTENSION_NAME VK_KHR_EXTERNAL_MEMORY_FD_EXTENSION_NAME
|
|
|
|
#endif
|
|
|
|
|
vo_gpu: vulkan: initial implementation
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!
Current problems / limitations / improvement opportunities:
1. The swapchain/flipping code violates the vulkan spec, by assuming
that the presentation queue will be bounded (in cases where rendering
is significantly faster than vsync). But apparently, there's simply
no better way to do this right now, to the point where even the
stupid cube.c examples from LunarG etc. do it wrong.
(cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370)
2. The memory allocator could be improved. (This is a universal
constant)
3. Could explore using push descriptors instead of descriptor sets,
especially since we expect to switch descriptors semi-often for some
passes (like interpolation). Probably won't make a difference, but
the synchronization overhead might be a factor. Who knows.
4. Parallelism across frames / async transfer is not well-defined, we
either need to use a better semaphore / command buffer strategy or a
resource pooling layer to safely handle cross-frame parallelism.
(That said, I gave resource pooling a try and was not happy with the
result at all - so I'm still exploring the semaphore strategy)
5. We aggressively use pipeline barriers where events would offer a much
more fine-grained synchronization mechanism. As a result of this, we
might be suffering from GPU bubbles due to too-short dependencies on
objects. (That said, I'm also exploring the use of semaphores as a an
ordering tactic which would allow cross-frame time slicing in theory)
Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.
NOTE: For safety, all use of asynchronous commands / multiple command
pools is currently disabled completely. There are some left-over relics
of this in the code (e.g. the distinction between dev_poll and
pool_poll), but that is kept in place mostly because this will be
re-extended in the future (vulkan rev 2).
The queue count is also currently capped to 1, because of the lack of
cross-frame semaphores means we need the implicit synchronization from
the same-queue semantics to guarantee a correct result.
2016-09-14 18:54:18 +00:00
|
|
|
// Return a human-readable name for various struct mpvk_ctx enums
|
|
|
|
const char* vk_err(VkResult res);
|
|
|
|
|
|
|
|
// Convenience macros to simplify a lot of common boilerplate
|
|
|
|
#define VK_ASSERT(res, str) \
|
|
|
|
do { \
|
|
|
|
if (res != VK_SUCCESS) { \
|
|
|
|
MP_ERR(vk, str ": %s\n", vk_err(res)); \
|
|
|
|
goto error; \
|
|
|
|
} \
|
|
|
|
} while (0)
|
|
|
|
|
|
|
|
#define VK(cmd) \
|
|
|
|
do { \
|
|
|
|
MP_TRACE(vk, #cmd "\n"); \
|
|
|
|
VkResult res ## __LINE__ = (cmd); \
|
|
|
|
VK_ASSERT(res ## __LINE__, #cmd); \
|
|
|
|
} while (0)
|
|
|
|
|
|
|
|
// Uninits everything in the correct order
|
|
|
|
void mpvk_uninit(struct mpvk_ctx *vk);
|
|
|
|
|
|
|
|
// Initialization functions: As a rule of thumb, these need to be called in
|
|
|
|
// this order, followed by vk_malloc_init, followed by RA initialization, and
|
|
|
|
// finally followed by vk_swchain initialization.
|
|
|
|
|
|
|
|
// Create a vulkan instance. Returns VK_NULL_HANDLE on failure
|
2017-09-16 02:46:38 +00:00
|
|
|
bool mpvk_instance_init(struct mpvk_ctx *vk, struct mp_log *log,
|
|
|
|
const char *surf_ext_name, bool debug);
|
vo_gpu: vulkan: initial implementation
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!
Current problems / limitations / improvement opportunities:
1. The swapchain/flipping code violates the vulkan spec, by assuming
that the presentation queue will be bounded (in cases where rendering
is significantly faster than vsync). But apparently, there's simply
no better way to do this right now, to the point where even the
stupid cube.c examples from LunarG etc. do it wrong.
(cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370)
2. The memory allocator could be improved. (This is a universal
constant)
3. Could explore using push descriptors instead of descriptor sets,
especially since we expect to switch descriptors semi-often for some
passes (like interpolation). Probably won't make a difference, but
the synchronization overhead might be a factor. Who knows.
4. Parallelism across frames / async transfer is not well-defined, we
either need to use a better semaphore / command buffer strategy or a
resource pooling layer to safely handle cross-frame parallelism.
(That said, I gave resource pooling a try and was not happy with the
result at all - so I'm still exploring the semaphore strategy)
5. We aggressively use pipeline barriers where events would offer a much
more fine-grained synchronization mechanism. As a result of this, we
might be suffering from GPU bubbles due to too-short dependencies on
objects. (That said, I'm also exploring the use of semaphores as a an
ordering tactic which would allow cross-frame time slicing in theory)
Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.
NOTE: For safety, all use of asynchronous commands / multiple command
pools is currently disabled completely. There are some left-over relics
of this in the code (e.g. the distinction between dev_poll and
pool_poll), but that is kept in place mostly because this will be
re-extended in the future (vulkan rev 2).
The queue count is also currently capped to 1, because of the lack of
cross-frame semaphores means we need the implicit synchronization from
the same-queue semantics to guarantee a correct result.
2016-09-14 18:54:18 +00:00
|
|
|
|
|
|
|
// Generate a VkSurfaceKHR usable for video output. Returns VK_NULL_HANDLE on
|
|
|
|
// failure. Must be called after mpvk_instance_init.
|
|
|
|
bool mpvk_surface_init(struct vo *vo, struct mpvk_ctx *vk);
|
|
|
|
|
|
|
|
// Find a suitable physical device for use with rendering and which supports
|
|
|
|
// the surface.
|
|
|
|
// name: only match a device with this name
|
|
|
|
// sw: also allow software/virtual devices
|
|
|
|
bool mpvk_find_phys_device(struct mpvk_ctx *vk, const char *name, bool sw);
|
|
|
|
|
2018-10-20 04:48:15 +00:00
|
|
|
// Get the UUID for the selected physical device
|
|
|
|
bool mpvk_get_phys_device_uuid(struct mpvk_ctx *vk, uint8_t uuid_out[VK_UUID_SIZE]);
|
|
|
|
|
vo_gpu: vulkan: initial implementation
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!
Current problems / limitations / improvement opportunities:
1. The swapchain/flipping code violates the vulkan spec, by assuming
that the presentation queue will be bounded (in cases where rendering
is significantly faster than vsync). But apparently, there's simply
no better way to do this right now, to the point where even the
stupid cube.c examples from LunarG etc. do it wrong.
(cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370)
2. The memory allocator could be improved. (This is a universal
constant)
3. Could explore using push descriptors instead of descriptor sets,
especially since we expect to switch descriptors semi-often for some
passes (like interpolation). Probably won't make a difference, but
the synchronization overhead might be a factor. Who knows.
4. Parallelism across frames / async transfer is not well-defined, we
either need to use a better semaphore / command buffer strategy or a
resource pooling layer to safely handle cross-frame parallelism.
(That said, I gave resource pooling a try and was not happy with the
result at all - so I'm still exploring the semaphore strategy)
5. We aggressively use pipeline barriers where events would offer a much
more fine-grained synchronization mechanism. As a result of this, we
might be suffering from GPU bubbles due to too-short dependencies on
objects. (That said, I'm also exploring the use of semaphores as a an
ordering tactic which would allow cross-frame time slicing in theory)
Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.
NOTE: For safety, all use of asynchronous commands / multiple command
pools is currently disabled completely. There are some left-over relics
of this in the code (e.g. the distinction between dev_poll and
pool_poll), but that is kept in place mostly because this will be
re-extended in the future (vulkan rev 2).
The queue count is also currently capped to 1, because of the lack of
cross-frame semaphores means we need the implicit synchronization from
the same-queue semantics to guarantee a correct result.
2016-09-14 18:54:18 +00:00
|
|
|
// Pick a suitable surface format that's supported by this physical device.
|
|
|
|
bool mpvk_pick_surface_format(struct mpvk_ctx *vk);
|
|
|
|
|
|
|
|
struct mpvk_device_opts {
|
|
|
|
int queue_count; // number of queues to use
|
2017-10-07 19:36:16 +00:00
|
|
|
int async_transfer; // enable async transfer
|
|
|
|
int async_compute; // enable async compute
|
vo_gpu: vulkan: initial implementation
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!
Current problems / limitations / improvement opportunities:
1. The swapchain/flipping code violates the vulkan spec, by assuming
that the presentation queue will be bounded (in cases where rendering
is significantly faster than vsync). But apparently, there's simply
no better way to do this right now, to the point where even the
stupid cube.c examples from LunarG etc. do it wrong.
(cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370)
2. The memory allocator could be improved. (This is a universal
constant)
3. Could explore using push descriptors instead of descriptor sets,
especially since we expect to switch descriptors semi-often for some
passes (like interpolation). Probably won't make a difference, but
the synchronization overhead might be a factor. Who knows.
4. Parallelism across frames / async transfer is not well-defined, we
either need to use a better semaphore / command buffer strategy or a
resource pooling layer to safely handle cross-frame parallelism.
(That said, I gave resource pooling a try and was not happy with the
result at all - so I'm still exploring the semaphore strategy)
5. We aggressively use pipeline barriers where events would offer a much
more fine-grained synchronization mechanism. As a result of this, we
might be suffering from GPU bubbles due to too-short dependencies on
objects. (That said, I'm also exploring the use of semaphores as a an
ordering tactic which would allow cross-frame time slicing in theory)
Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.
NOTE: For safety, all use of asynchronous commands / multiple command
pools is currently disabled completely. There are some left-over relics
of this in the code (e.g. the distinction between dev_poll and
pool_poll), but that is kept in place mostly because this will be
re-extended in the future (vulkan rev 2).
The queue count is also currently capped to 1, because of the lack of
cross-frame semaphores means we need the implicit synchronization from
the same-queue semantics to guarantee a correct result.
2016-09-14 18:54:18 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
// Create a logical device and initialize the vk_cmdpools
|
|
|
|
bool mpvk_device_init(struct mpvk_ctx *vk, struct mpvk_device_opts opts);
|
|
|
|
|
2017-09-27 15:08:06 +00:00
|
|
|
// Wait for all currently pending commands to have completed. This is the only
|
|
|
|
// function that actually processes the callbacks. Will wait at most `timeout`
|
|
|
|
// nanoseconds for the completion of each command. Using it with a value of
|
|
|
|
// UINT64_MAX effectively means waiting until the pool/device is idle. The
|
|
|
|
// timeout may also be passed as 0, in which case this function will not block,
|
|
|
|
// but only poll for completed commands.
|
vo_gpu: vulkan: support split command pools
Instead of using a single primary queue, we generate multiple
vk_cmdpools and pick the right one dynamically based on the intent.
This has a number of immediate benefits:
1. We can use async texture uploads
2. We can use the DMA engine for buffer updates
3. We can benefit from async compute on AMD GPUs
Unfortunately, the major downside is that due to the lack of QF
ownership tracking, we need to use CONCURRENT sharing for all resources
(buffers *and* images!). In theory, we could try figuring out a way to
get rid of the concurrent sharing for buffers (which is only needed for
compute shader UBOs), but even so, the concurrent sharing mode doesn't
really seem to have a significant impact over here (nvidia). It's
possible that other platforms may disagree.
Our deadlock-avoidance strategy is stupidly simple: Just flush the
command every time we need to switch queues, and make sure all
submission and callbacks happen in FIFO order. This required lifting the
cmds_pending and cmds_queued out from vk_cmdpool to mpvk_ctx, and some
functions died/got moved as a result, but that's a relatively minor
change.
On my hardware this is a fairly significant performance boost, mainly
due to async transfers. (Nvidia doesn't expose separate compute queues
anyway). On AMD, this should be a performance boost as well due to async
compute.
2017-09-24 13:05:24 +00:00
|
|
|
void mpvk_poll_commands(struct mpvk_ctx *vk, uint64_t timeout);
|
|
|
|
|
|
|
|
// Flush all currently queued commands. Call this once per frame, after
|
|
|
|
// submitting all of the command buffers for that frame. Calling this more
|
|
|
|
// often than that is possible but bad for performance.
|
|
|
|
// Returns whether successful. Failed commands will be implicitly dropped.
|
|
|
|
bool mpvk_flush_commands(struct mpvk_ctx *vk);
|
vo_gpu: vulkan: initial implementation
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!
Current problems / limitations / improvement opportunities:
1. The swapchain/flipping code violates the vulkan spec, by assuming
that the presentation queue will be bounded (in cases where rendering
is significantly faster than vsync). But apparently, there's simply
no better way to do this right now, to the point where even the
stupid cube.c examples from LunarG etc. do it wrong.
(cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370)
2. The memory allocator could be improved. (This is a universal
constant)
3. Could explore using push descriptors instead of descriptor sets,
especially since we expect to switch descriptors semi-often for some
passes (like interpolation). Probably won't make a difference, but
the synchronization overhead might be a factor. Who knows.
4. Parallelism across frames / async transfer is not well-defined, we
either need to use a better semaphore / command buffer strategy or a
resource pooling layer to safely handle cross-frame parallelism.
(That said, I gave resource pooling a try and was not happy with the
result at all - so I'm still exploring the semaphore strategy)
5. We aggressively use pipeline barriers where events would offer a much
more fine-grained synchronization mechanism. As a result of this, we
might be suffering from GPU bubbles due to too-short dependencies on
objects. (That said, I'm also exploring the use of semaphores as a an
ordering tactic which would allow cross-frame time slicing in theory)
Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.
NOTE: For safety, all use of asynchronous commands / multiple command
pools is currently disabled completely. There are some left-over relics
of this in the code (e.g. the distinction between dev_poll and
pool_poll), but that is kept in place mostly because this will be
re-extended in the future (vulkan rev 2).
The queue count is also currently capped to 1, because of the lack of
cross-frame semaphores means we need the implicit synchronization from
the same-queue semantics to guarantee a correct result.
2016-09-14 18:54:18 +00:00
|
|
|
|
|
|
|
// Since lots of vulkan operations need to be done lazily once the affected
|
|
|
|
// resources are no longer in use, provide an abstraction for tracking these.
|
|
|
|
// In practice, these are only checked and run when submitting new commands, so
|
|
|
|
// the actual execution may be delayed by a frame.
|
|
|
|
typedef void (*vk_cb)(void *priv, void *arg);
|
|
|
|
|
|
|
|
struct vk_callback {
|
|
|
|
vk_cb run;
|
|
|
|
void *priv;
|
|
|
|
void *arg; // as a convenience, you also get to pass an arg for "free"
|
|
|
|
};
|
|
|
|
|
|
|
|
// Associate a callback with the completion of all currently pending commands.
|
|
|
|
// This will essentially run once the device is completely idle.
|
|
|
|
void vk_dev_callback(struct mpvk_ctx *vk, vk_cb callback, void *p, void *arg);
|
|
|
|
|
|
|
|
// Helper wrapper around command buffers that also track dependencies,
|
|
|
|
// callbacks and synchronization primitives
|
|
|
|
struct vk_cmd {
|
|
|
|
struct vk_cmdpool *pool; // pool it was allocated from
|
2017-09-27 15:08:06 +00:00
|
|
|
VkQueue queue; // the submission queue (for recording/pending)
|
|
|
|
VkCommandBuffer buf; // the command buffer itself
|
|
|
|
VkFence fence; // the fence guards cmd buffer reuse
|
vo_gpu: vulkan: initial implementation
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!
Current problems / limitations / improvement opportunities:
1. The swapchain/flipping code violates the vulkan spec, by assuming
that the presentation queue will be bounded (in cases where rendering
is significantly faster than vsync). But apparently, there's simply
no better way to do this right now, to the point where even the
stupid cube.c examples from LunarG etc. do it wrong.
(cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370)
2. The memory allocator could be improved. (This is a universal
constant)
3. Could explore using push descriptors instead of descriptor sets,
especially since we expect to switch descriptors semi-often for some
passes (like interpolation). Probably won't make a difference, but
the synchronization overhead might be a factor. Who knows.
4. Parallelism across frames / async transfer is not well-defined, we
either need to use a better semaphore / command buffer strategy or a
resource pooling layer to safely handle cross-frame parallelism.
(That said, I gave resource pooling a try and was not happy with the
result at all - so I'm still exploring the semaphore strategy)
5. We aggressively use pipeline barriers where events would offer a much
more fine-grained synchronization mechanism. As a result of this, we
might be suffering from GPU bubbles due to too-short dependencies on
objects. (That said, I'm also exploring the use of semaphores as a an
ordering tactic which would allow cross-frame time slicing in theory)
Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.
NOTE: For safety, all use of asynchronous commands / multiple command
pools is currently disabled completely. There are some left-over relics
of this in the code (e.g. the distinction between dev_poll and
pool_poll), but that is kept in place mostly because this will be
re-extended in the future (vulkan rev 2).
The queue count is also currently capped to 1, because of the lack of
cross-frame semaphores means we need the implicit synchronization from
the same-queue semantics to guarantee a correct result.
2016-09-14 18:54:18 +00:00
|
|
|
// The semaphores represent dependencies that need to complete before
|
|
|
|
// this command can be executed. These are *not* owned by the vk_cmd
|
2017-09-27 15:08:06 +00:00
|
|
|
VkSemaphore *deps;
|
|
|
|
VkPipelineStageFlags *depstages;
|
vo_gpu: vulkan: initial implementation
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!
Current problems / limitations / improvement opportunities:
1. The swapchain/flipping code violates the vulkan spec, by assuming
that the presentation queue will be bounded (in cases where rendering
is significantly faster than vsync). But apparently, there's simply
no better way to do this right now, to the point where even the
stupid cube.c examples from LunarG etc. do it wrong.
(cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370)
2. The memory allocator could be improved. (This is a universal
constant)
3. Could explore using push descriptors instead of descriptor sets,
especially since we expect to switch descriptors semi-often for some
passes (like interpolation). Probably won't make a difference, but
the synchronization overhead might be a factor. Who knows.
4. Parallelism across frames / async transfer is not well-defined, we
either need to use a better semaphore / command buffer strategy or a
resource pooling layer to safely handle cross-frame parallelism.
(That said, I gave resource pooling a try and was not happy with the
result at all - so I'm still exploring the semaphore strategy)
5. We aggressively use pipeline barriers where events would offer a much
more fine-grained synchronization mechanism. As a result of this, we
might be suffering from GPU bubbles due to too-short dependencies on
objects. (That said, I'm also exploring the use of semaphores as a an
ordering tactic which would allow cross-frame time slicing in theory)
Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.
NOTE: For safety, all use of asynchronous commands / multiple command
pools is currently disabled completely. There are some left-over relics
of this in the code (e.g. the distinction between dev_poll and
pool_poll), but that is kept in place mostly because this will be
re-extended in the future (vulkan rev 2).
The queue count is also currently capped to 1, because of the lack of
cross-frame semaphores means we need the implicit synchronization from
the same-queue semantics to guarantee a correct result.
2016-09-14 18:54:18 +00:00
|
|
|
int num_deps;
|
2017-09-28 20:33:31 +00:00
|
|
|
// The signals represent semaphores that fire once the command finishes
|
|
|
|
// executing. These are also not owned by the vk_cmd
|
|
|
|
VkSemaphore *sigs;
|
|
|
|
int num_sigs;
|
vo_gpu: vulkan: initial implementation
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!
Current problems / limitations / improvement opportunities:
1. The swapchain/flipping code violates the vulkan spec, by assuming
that the presentation queue will be bounded (in cases where rendering
is significantly faster than vsync). But apparently, there's simply
no better way to do this right now, to the point where even the
stupid cube.c examples from LunarG etc. do it wrong.
(cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370)
2. The memory allocator could be improved. (This is a universal
constant)
3. Could explore using push descriptors instead of descriptor sets,
especially since we expect to switch descriptors semi-often for some
passes (like interpolation). Probably won't make a difference, but
the synchronization overhead might be a factor. Who knows.
4. Parallelism across frames / async transfer is not well-defined, we
either need to use a better semaphore / command buffer strategy or a
resource pooling layer to safely handle cross-frame parallelism.
(That said, I gave resource pooling a try and was not happy with the
result at all - so I'm still exploring the semaphore strategy)
5. We aggressively use pipeline barriers where events would offer a much
more fine-grained synchronization mechanism. As a result of this, we
might be suffering from GPU bubbles due to too-short dependencies on
objects. (That said, I'm also exploring the use of semaphores as a an
ordering tactic which would allow cross-frame time slicing in theory)
Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.
NOTE: For safety, all use of asynchronous commands / multiple command
pools is currently disabled completely. There are some left-over relics
of this in the code (e.g. the distinction between dev_poll and
pool_poll), but that is kept in place mostly because this will be
re-extended in the future (vulkan rev 2).
The queue count is also currently capped to 1, because of the lack of
cross-frame semaphores means we need the implicit synchronization from
the same-queue semantics to guarantee a correct result.
2016-09-14 18:54:18 +00:00
|
|
|
// Since VkFences are useless, we have to manually track "callbacks"
|
|
|
|
// to fire once the VkFence completes. These are used for multiple purposes,
|
|
|
|
// ranging from garbage collection (resource deallocation) to fencing.
|
|
|
|
struct vk_callback *callbacks;
|
|
|
|
int num_callbacks;
|
|
|
|
};
|
|
|
|
|
|
|
|
// Associate a callback with the completion of the current command. This
|
|
|
|
// bool will be set to `true` once the command completes, or shortly thereafter.
|
|
|
|
void vk_cmd_callback(struct vk_cmd *cmd, vk_cb callback, void *p, void *arg);
|
|
|
|
|
2017-09-28 20:33:31 +00:00
|
|
|
// Associate a raw dependency for the current command. This semaphore must
|
|
|
|
// signal by the corresponding stage before the command may execute.
|
|
|
|
void vk_cmd_dep(struct vk_cmd *cmd, VkSemaphore dep, VkPipelineStageFlags stage);
|
|
|
|
|
|
|
|
// Associate a raw signal with the current command. This semaphore will signal
|
|
|
|
// after the command completes.
|
|
|
|
void vk_cmd_sig(struct vk_cmd *cmd, VkSemaphore sig);
|
vo_gpu: vulkan: initial implementation
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!
Current problems / limitations / improvement opportunities:
1. The swapchain/flipping code violates the vulkan spec, by assuming
that the presentation queue will be bounded (in cases where rendering
is significantly faster than vsync). But apparently, there's simply
no better way to do this right now, to the point where even the
stupid cube.c examples from LunarG etc. do it wrong.
(cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370)
2. The memory allocator could be improved. (This is a universal
constant)
3. Could explore using push descriptors instead of descriptor sets,
especially since we expect to switch descriptors semi-often for some
passes (like interpolation). Probably won't make a difference, but
the synchronization overhead might be a factor. Who knows.
4. Parallelism across frames / async transfer is not well-defined, we
either need to use a better semaphore / command buffer strategy or a
resource pooling layer to safely handle cross-frame parallelism.
(That said, I gave resource pooling a try and was not happy with the
result at all - so I'm still exploring the semaphore strategy)
5. We aggressively use pipeline barriers where events would offer a much
more fine-grained synchronization mechanism. As a result of this, we
might be suffering from GPU bubbles due to too-short dependencies on
objects. (That said, I'm also exploring the use of semaphores as a an
ordering tactic which would allow cross-frame time slicing in theory)
Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.
NOTE: For safety, all use of asynchronous commands / multiple command
pools is currently disabled completely. There are some left-over relics
of this in the code (e.g. the distinction between dev_poll and
pool_poll), but that is kept in place mostly because this will be
re-extended in the future (vulkan rev 2).
The queue count is also currently capped to 1, because of the lack of
cross-frame semaphores means we need the implicit synchronization from
the same-queue semantics to guarantee a correct result.
2016-09-14 18:54:18 +00:00
|
|
|
|
2017-09-28 22:36:46 +00:00
|
|
|
// Signal abstraction: represents an abstract synchronization mechanism.
|
|
|
|
// Internally, this may either resolve as a semaphore or an event depending
|
|
|
|
// on whether the appropriate conditions are met.
|
|
|
|
struct vk_signal {
|
|
|
|
VkSemaphore semaphore;
|
|
|
|
VkEvent event;
|
|
|
|
VkQueue event_source;
|
|
|
|
};
|
|
|
|
|
|
|
|
// Generates a signal after the execution of all previous commands matching the
|
|
|
|
// given the pipeline stage. The signal is owned by the caller, and must be
|
|
|
|
// consumed eith vk_cmd_wait or released with vk_signal_cancel in order to
|
|
|
|
// free the resources.
|
|
|
|
struct vk_signal *vk_cmd_signal(struct mpvk_ctx *vk, struct vk_cmd *cmd,
|
|
|
|
VkPipelineStageFlags stage);
|
|
|
|
|
|
|
|
// Consumes a previously generated signal. This signal must fire by the
|
|
|
|
// indicated stage before the command can run. If *event is not NULL, then it
|
|
|
|
// MAY be set to a VkEvent which the caller MUST manually wait on in the most
|
|
|
|
// appropriate way. This function takes over ownership of the signal (and the
|
|
|
|
// signal will be released/reused automatically)
|
|
|
|
void vk_cmd_wait(struct mpvk_ctx *vk, struct vk_cmd *cmd,
|
|
|
|
struct vk_signal **sigptr, VkPipelineStageFlags stage,
|
|
|
|
VkEvent *out_event);
|
|
|
|
|
|
|
|
// Destroys a currently pending signal, for example if the resource is no
|
|
|
|
// longer relevant.
|
|
|
|
void vk_signal_destroy(struct mpvk_ctx *vk, struct vk_signal **sig);
|
|
|
|
|
vo_gpu: vulkan: initial implementation
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!
Current problems / limitations / improvement opportunities:
1. The swapchain/flipping code violates the vulkan spec, by assuming
that the presentation queue will be bounded (in cases where rendering
is significantly faster than vsync). But apparently, there's simply
no better way to do this right now, to the point where even the
stupid cube.c examples from LunarG etc. do it wrong.
(cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370)
2. The memory allocator could be improved. (This is a universal
constant)
3. Could explore using push descriptors instead of descriptor sets,
especially since we expect to switch descriptors semi-often for some
passes (like interpolation). Probably won't make a difference, but
the synchronization overhead might be a factor. Who knows.
4. Parallelism across frames / async transfer is not well-defined, we
either need to use a better semaphore / command buffer strategy or a
resource pooling layer to safely handle cross-frame parallelism.
(That said, I gave resource pooling a try and was not happy with the
result at all - so I'm still exploring the semaphore strategy)
5. We aggressively use pipeline barriers where events would offer a much
more fine-grained synchronization mechanism. As a result of this, we
might be suffering from GPU bubbles due to too-short dependencies on
objects. (That said, I'm also exploring the use of semaphores as a an
ordering tactic which would allow cross-frame time slicing in theory)
Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.
NOTE: For safety, all use of asynchronous commands / multiple command
pools is currently disabled completely. There are some left-over relics
of this in the code (e.g. the distinction between dev_poll and
pool_poll), but that is kept in place mostly because this will be
re-extended in the future (vulkan rev 2).
The queue count is also currently capped to 1, because of the lack of
cross-frame semaphores means we need the implicit synchronization from
the same-queue semantics to guarantee a correct result.
2016-09-14 18:54:18 +00:00
|
|
|
// Command pool / queue family hybrid abstraction
|
|
|
|
struct vk_cmdpool {
|
|
|
|
VkQueueFamilyProperties props;
|
2017-09-27 15:08:06 +00:00
|
|
|
int qf; // queue family index
|
vo_gpu: vulkan: initial implementation
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!
Current problems / limitations / improvement opportunities:
1. The swapchain/flipping code violates the vulkan spec, by assuming
that the presentation queue will be bounded (in cases where rendering
is significantly faster than vsync). But apparently, there's simply
no better way to do this right now, to the point where even the
stupid cube.c examples from LunarG etc. do it wrong.
(cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370)
2. The memory allocator could be improved. (This is a universal
constant)
3. Could explore using push descriptors instead of descriptor sets,
especially since we expect to switch descriptors semi-often for some
passes (like interpolation). Probably won't make a difference, but
the synchronization overhead might be a factor. Who knows.
4. Parallelism across frames / async transfer is not well-defined, we
either need to use a better semaphore / command buffer strategy or a
resource pooling layer to safely handle cross-frame parallelism.
(That said, I gave resource pooling a try and was not happy with the
result at all - so I'm still exploring the semaphore strategy)
5. We aggressively use pipeline barriers where events would offer a much
more fine-grained synchronization mechanism. As a result of this, we
might be suffering from GPU bubbles due to too-short dependencies on
objects. (That said, I'm also exploring the use of semaphores as a an
ordering tactic which would allow cross-frame time slicing in theory)
Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.
NOTE: For safety, all use of asynchronous commands / multiple command
pools is currently disabled completely. There are some left-over relics
of this in the code (e.g. the distinction between dev_poll and
pool_poll), but that is kept in place mostly because this will be
re-extended in the future (vulkan rev 2).
The queue count is also currently capped to 1, because of the lack of
cross-frame semaphores means we need the implicit synchronization from
the same-queue semantics to guarantee a correct result.
2016-09-14 18:54:18 +00:00
|
|
|
VkCommandPool pool;
|
2017-09-27 15:08:06 +00:00
|
|
|
VkQueue *queues;
|
|
|
|
int num_queues;
|
|
|
|
int idx_queues;
|
vo_gpu: vulkan: support split command pools
Instead of using a single primary queue, we generate multiple
vk_cmdpools and pick the right one dynamically based on the intent.
This has a number of immediate benefits:
1. We can use async texture uploads
2. We can use the DMA engine for buffer updates
3. We can benefit from async compute on AMD GPUs
Unfortunately, the major downside is that due to the lack of QF
ownership tracking, we need to use CONCURRENT sharing for all resources
(buffers *and* images!). In theory, we could try figuring out a way to
get rid of the concurrent sharing for buffers (which is only needed for
compute shader UBOs), but even so, the concurrent sharing mode doesn't
really seem to have a significant impact over here (nvidia). It's
possible that other platforms may disagree.
Our deadlock-avoidance strategy is stupidly simple: Just flush the
command every time we need to switch queues, and make sure all
submission and callbacks happen in FIFO order. This required lifting the
cmds_pending and cmds_queued out from vk_cmdpool to mpvk_ctx, and some
functions died/got moved as a result, but that's a relatively minor
change.
On my hardware this is a fairly significant performance boost, mainly
due to async transfers. (Nvidia doesn't expose separate compute queues
anyway). On AMD, this should be a performance boost as well due to async
compute.
2017-09-24 13:05:24 +00:00
|
|
|
// Command buffers associated with this queue. These are available for
|
|
|
|
// re-recording
|
|
|
|
struct vk_cmd **cmds;
|
|
|
|
int num_cmds;
|
vo_gpu: vulkan: initial implementation
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!
Current problems / limitations / improvement opportunities:
1. The swapchain/flipping code violates the vulkan spec, by assuming
that the presentation queue will be bounded (in cases where rendering
is significantly faster than vsync). But apparently, there's simply
no better way to do this right now, to the point where even the
stupid cube.c examples from LunarG etc. do it wrong.
(cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370)
2. The memory allocator could be improved. (This is a universal
constant)
3. Could explore using push descriptors instead of descriptor sets,
especially since we expect to switch descriptors semi-often for some
passes (like interpolation). Probably won't make a difference, but
the synchronization overhead might be a factor. Who knows.
4. Parallelism across frames / async transfer is not well-defined, we
either need to use a better semaphore / command buffer strategy or a
resource pooling layer to safely handle cross-frame parallelism.
(That said, I gave resource pooling a try and was not happy with the
result at all - so I'm still exploring the semaphore strategy)
5. We aggressively use pipeline barriers where events would offer a much
more fine-grained synchronization mechanism. As a result of this, we
might be suffering from GPU bubbles due to too-short dependencies on
objects. (That said, I'm also exploring the use of semaphores as a an
ordering tactic which would allow cross-frame time slicing in theory)
Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.
NOTE: For safety, all use of asynchronous commands / multiple command
pools is currently disabled completely. There are some left-over relics
of this in the code (e.g. the distinction between dev_poll and
pool_poll), but that is kept in place mostly because this will be
re-extended in the future (vulkan rev 2).
The queue count is also currently capped to 1, because of the lack of
cross-frame semaphores means we need the implicit synchronization from
the same-queue semantics to guarantee a correct result.
2016-09-14 18:54:18 +00:00
|
|
|
};
|
|
|
|
|
2017-09-27 15:08:06 +00:00
|
|
|
// Fetch a command buffer from a command pool and begin recording to it.
|
vo_gpu: vulkan: initial implementation
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!
Current problems / limitations / improvement opportunities:
1. The swapchain/flipping code violates the vulkan spec, by assuming
that the presentation queue will be bounded (in cases where rendering
is significantly faster than vsync). But apparently, there's simply
no better way to do this right now, to the point where even the
stupid cube.c examples from LunarG etc. do it wrong.
(cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370)
2. The memory allocator could be improved. (This is a universal
constant)
3. Could explore using push descriptors instead of descriptor sets,
especially since we expect to switch descriptors semi-often for some
passes (like interpolation). Probably won't make a difference, but
the synchronization overhead might be a factor. Who knows.
4. Parallelism across frames / async transfer is not well-defined, we
either need to use a better semaphore / command buffer strategy or a
resource pooling layer to safely handle cross-frame parallelism.
(That said, I gave resource pooling a try and was not happy with the
result at all - so I'm still exploring the semaphore strategy)
5. We aggressively use pipeline barriers where events would offer a much
more fine-grained synchronization mechanism. As a result of this, we
might be suffering from GPU bubbles due to too-short dependencies on
objects. (That said, I'm also exploring the use of semaphores as a an
ordering tactic which would allow cross-frame time slicing in theory)
Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.
NOTE: For safety, all use of asynchronous commands / multiple command
pools is currently disabled completely. There are some left-over relics
of this in the code (e.g. the distinction between dev_poll and
pool_poll), but that is kept in place mostly because this will be
re-extended in the future (vulkan rev 2).
The queue count is also currently capped to 1, because of the lack of
cross-frame semaphores means we need the implicit synchronization from
the same-queue semantics to guarantee a correct result.
2016-09-14 18:54:18 +00:00
|
|
|
// Returns NULL on failure.
|
|
|
|
struct vk_cmd *vk_cmd_begin(struct mpvk_ctx *vk, struct vk_cmdpool *pool);
|
|
|
|
|
2017-09-28 21:06:56 +00:00
|
|
|
// Finish recording a command buffer and queue it for execution. This function
|
2017-09-27 15:08:06 +00:00
|
|
|
// takes over ownership of *cmd, i.e. the caller should not touch it again.
|
2017-09-28 21:06:56 +00:00
|
|
|
void vk_cmd_queue(struct mpvk_ctx *vk, struct vk_cmd *cmd);
|
vo_gpu: vulkan: initial implementation
This time based on ra/vo_gpu. 2017 is the year of the vulkan desktop!
Current problems / limitations / improvement opportunities:
1. The swapchain/flipping code violates the vulkan spec, by assuming
that the presentation queue will be bounded (in cases where rendering
is significantly faster than vsync). But apparently, there's simply
no better way to do this right now, to the point where even the
stupid cube.c examples from LunarG etc. do it wrong.
(cf. https://github.com/KhronosGroup/Vulkan-Docs/issues/370)
2. The memory allocator could be improved. (This is a universal
constant)
3. Could explore using push descriptors instead of descriptor sets,
especially since we expect to switch descriptors semi-often for some
passes (like interpolation). Probably won't make a difference, but
the synchronization overhead might be a factor. Who knows.
4. Parallelism across frames / async transfer is not well-defined, we
either need to use a better semaphore / command buffer strategy or a
resource pooling layer to safely handle cross-frame parallelism.
(That said, I gave resource pooling a try and was not happy with the
result at all - so I'm still exploring the semaphore strategy)
5. We aggressively use pipeline barriers where events would offer a much
more fine-grained synchronization mechanism. As a result of this, we
might be suffering from GPU bubbles due to too-short dependencies on
objects. (That said, I'm also exploring the use of semaphores as a an
ordering tactic which would allow cross-frame time slicing in theory)
Some minor changes to the vo_gpu and infrastructure, but nothing
consequential.
NOTE: For safety, all use of asynchronous commands / multiple command
pools is currently disabled completely. There are some left-over relics
of this in the code (e.g. the distinction between dev_poll and
pool_poll), but that is kept in place mostly because this will be
re-extended in the future (vulkan rev 2).
The queue count is also currently capped to 1, because of the lack of
cross-frame semaphores means we need the implicit synchronization from
the same-queue semantics to guarantee a correct result.
2016-09-14 18:54:18 +00:00
|
|
|
|
|
|
|
// Predefined structs for a simple non-layered, non-mipped image
|
|
|
|
extern const VkImageSubresourceRange vk_range;
|
|
|
|
extern const VkImageSubresourceLayers vk_layers;
|