Commit Graph

17 Commits

Author SHA1 Message Date
wm4 c6fafbffac vo_opengl: separate hwdec context and mapping, port it to use ra
This does two separate rather intrusive things:

 1. Make the hwdec context (which does initialization, provides the
    device to the decoder, and other basic state) and frame mapping
    (getting textures from a mp_image) separate. This is more
    flexible, and you could map multiple images at once. It will
    help removing some hwdec special-casing from video.c.
 2. Switch all hwdec API use to ra. Of course all code is still
    GL specific, but in theory it would be possible to support other
    backends. The most important change is that the hwdec interop
    returns ra objects, instead of anything GL specific. This removes
    the last dependency on GL-specific header files from video.c.

I'm mixing these separate changes because both requires essentially
rewriting all the glue code, so better do them at once. For the same
reason, this change isn't done incrementally.

hwdec_ios.m is untested, since I can't test it. Apart from superficial
mistakes, this also requires dealing with Apple's texture format
fuckups: they force you to use GL_LUMINANCE[_ALPHA] instead of GL_RED
and GL_RG. We also need to report the correct format via ra_tex to
the renderer, which is done by find_la_variant(). It's unknown whether
this works correctly.

hwdec_rpi.c as well as vo_rpi.c are still broken. (I need to pull my
RPI out of a dusty pile of devices and cables, so, later.)
2017-08-10 21:24:31 +02:00
wm4 b2fb3f1340 vo_opengl: hwdec_cuda: fix filtering mode
Probably explains quality issues in some cases.
2017-08-09 21:02:23 +02:00
Philip Langdale 7424651b96 vo_opengl: hwdec_cuda: Support separate decode and display devices
In a multi GPU scenario, it may be desirable to use different GPUs
for decode and display responsibilities. For example, if a secondary
GPU has better video decoding capabilities.

In such a scenario, we need to initialise a separate context for each
GPU, and use the display context in hwdec_cuda, while passing the
decode context to avcodec.

Once that's done, the actually hand-off between the two GPUs is
transparent to us (It happens during the cuMemcpy2D operation which
copies the decoded frame from a cuda buffer to the OpenGL texture).

In the end, the bulk of the work is around introducing a new
configuration option to specify the decode device.
2017-06-03 16:41:03 +02:00
wm4 91a2ddfdd7 vo_opengl: hwdec_cuda: use new format setup function
Gives us automatically support for all formats vo_opengl supports.
2017-02-17 16:39:10 +01:00
wm4 8a23892d04 vo_opengl: hwdec_cuda: add yuv420p support
Because it allows easier testing of filters + hwdec.

Make the texture setup code a bit more generic so it doesn't get too
much of a mess. We also use the GL renderer utility function
gl_find_unorm_format(), which saves us additional work with OpenGL's
semi-redundant format specifiers.
2017-01-16 16:10:39 +01:00
wm4 6b00663755 vo_opengl: hwdec_cuda: export AVHWDeviceContext
So we can use it for filtering later.
2017-01-16 16:10:39 +01:00
wm4 fb4ae3c06c cuda: use libavutil functions for copying hw surfaces to memory
mp_image_hw_download() is a libavutil wrapper added in the previous
commit. We drop our own code completely, as everything is provided by
libavutil and our helper wrapper.

This breaks the screenshot code, so that has to be adjusted as well.
2017-01-12 13:59:35 +01:00
Philip Langdale 83c5f704e7 vo_opengl: hwdec_cuda: Don't include hwcontext headers
After various simplifications, these includes simply aren't needed
now.
2016-12-04 20:41:42 +01:00
wm4 5aab17f833 vo_opengl: hwdec_cuda: make some init errors verbose
Improves autoprobe behavior. This is equivalent to other hwdec interop
wrappers. If CUDA is just not available, it should remain silent.
2016-11-24 18:14:55 +01:00
pavelxdd e89382f5f6 vo_opengl: hwdec_cuda: fix crash when trying to use hwdec=cuda if cuda SDK is not present
If CUDA SDK wasn't installed, mpv crashed immediately with the message "Failed to load CUDA symbols"
2016-11-24 14:42:30 +01:00
Philip Langdale f5e82d5ed3 vo_opengl: hwdec_cuda: Use dynamic loading for cuda functions
This change applies the pattern used in ffmpeg to dynamically load
cuda, to avoid requiring the CUDA SDK at build time.
2016-11-23 01:07:26 +01:00
Philip Langdale 585c5c34f1 vo_opengl: hwdec_cuda: Support P016 output surfaces
The latest 375.xx nvidia drivers add support for P016 output
surfaces. In combination with an ffmpeg change to return those
surfaces, we can display them.

The bulk of the work is related to knowing which format you're
dealing with at the right time. Once you know, it's straight forward.
2016-11-22 20:19:58 +01:00
Philip Langdale 343da8d73d vo_opengl: hwdec_cuda: get the cuda device from the GL context
Obviously, in the vast majority of cases, there's only one device
in the system, but doing this means we're more likely to get a
usable device in the multi-device case.

cuda would support decoding on one device and displaying on another
but the peer memory handling is not transparent and I have no way
to test it so I can't really write it.
2016-09-24 17:11:09 +02:00
Philip Langdale 441febfcba vo_opengl: hwdec_cuda: directly map GL textures and skip using PBOs
The documentation around this stuff is poor, but I found an nvidia
sample that demonstrates how to use the interop API most efficiently.

(https://github.com/nvpro-samples/gl_cuda_interop_pingpong_st)

Key lessons are:
1) you can register the texture itself and have cuda write to it,
   thereby skipping an additional copy through the PBO.
2) You don't have to be mapped when you do the copy - once you get a
   mapped pointer, it remains valid. Magic!

This lets us throw out the PBOs as well as much of the explicit
alignment and stride handling.

CPU usage is slightly (~3%) lower for 4K content in one test case,
so it makes a detectable difference, and presumably saves memory.
2016-09-24 17:11:06 +02:00
Philip Langdale 440e0a98ab hwdec_cuda: Implement download_image for screenshots
Data has to be copied to system memory for screenshots.
2016-09-10 23:17:47 +02:00
Philip Langdale 76818e3dc7 hwdec_cuda: Use the non-deprecated CUDA-GL interop API
The nvidia examples use the old (as in CUDA 3.x) interop API which
is deprecated, and I think not even functional on recent versions
of CUDA for windows. As I was following the examples, I used this
old API.

So, let's update to the new API, and hopefully, it'll start working
on windows too.
2016-09-10 13:15:27 +02:00
Philip Langdale 2048ad2b8a hwdec/opengl: Add support for CUDA and cuvid/NvDecode
Nvidia's "NvDecode" API (up until recently called "cuvid" is a cross
platform, but nvidia proprietary API that exposes their hardware
video decoding capabilities. It is analogous to their DXVA or VDPAU
support on Windows or Linux but without using platform specific API
calls.

As a rule, you'd rather use DXVA or VDPAU as these are more mature
and well supported APIs, but on Linux, VDPAU is falling behind the
hardware capabilities, and there's no sign that nvidia are making
the investments to update it.

Most concretely, this means that there is no VP8/9 or HEVC Main10
support in VDPAU. On the other hand, NvDecode does export vp8/9 and
partial support for HEVC Main10 (more on that below).

ffmpeg already has support in the form of the "cuvid" family of
decoders. Due to the design of the API, it is best exposed as a full
decoder rather than an hwaccel. As such, there are decoders like
h264_cuvid, hevc_cuvid, etc.

These decoders support two output paths today - in both cases, NV12
frames are returned, either in CUDA device memory or regular system
memory.

In the case of the system memory path, the decoders can be used
as-is in mpv today with a command line like:

mpv --vd=lavc:h264_cuvid foobar.mp4

Doing this will take advantage of hardware decoding, but the cost
of the memcpy to system memory adds up, especially for high
resolution video (4K etc).

To avoid that, we need an hwdec that takes advantage of CUDA's
OpenGL interop to copy from device memory into OpenGL textures.

That is what this change implements.

The process is relatively simple as only basic device context
aquisition needs to be done by us - the CUDA buffer pool is managed
by the decoder - thankfully.

The hwdec looks a bit like the vdpau interop one - the hwdec
maintains a single set of plane textures and each output frame
is repeatedly mapped into these textures to pass on.

The frames are always in NV12 format, at least until 10bit output
supports emerges.

The only slightly interesting part of the copying process is that
CUDA works by associating PBOs, so we need to define these for
each of the textures.

TODO Items:
* I need to add a download_image function for screenshots. This
  would do the same copy to system memory that the decoder's
  system memory output does.
* There are items to investigate on the ffmpeg side. There appears
  to be a problem with timestamps for some content.

Final note: I mentioned HEVC Main10. While there is no 10bit output
support, NvDecode can return dithered 8bit NV12 so you can take
advantage of the hardware acceleration.

This particular mode requires compiling ffmpeg with a modified
header (or possibly the CUDA 8 RC) and is not upstream in ffmpeg
yet.

Usage:

You will need to specify vo=opengl and hwdec=cuda.

Note that hwdec=auto will probably not work as it will try to use
vdpau first.

mpv --hwdec=cuda --vo=opengl foobar.mp4

If you want to use filters that require frames in system memory,
just use the decoder directly without the hwdec, as documented
above.
2016-09-08 16:06:12 +02:00