mpv/misc/charset_conv.h

#ifndef MP_CHARSET_CONV_H
#define MP_CHARSET_CONV_H

#include <stdbool.h>
#include "misc/bstr.h"

struct mp_log;

enum {
    MP_ICONV_VERBOSE = 1,       // print errors instead of failing silently
    MP_ICONV_ALLOW_CUTOFF = 2,  // allow partial input data
    MP_STRICT_UTF8 = 4,         // don't fall back to UTF-8-BROKEN when guessing
};

bool mp_charset_is_utf8(const char *user_cp);
bool mp_charset_is_utf16(const char *user_cp);
bool mp_charset_requires_guess(const char *user_cp);
const char *mp_charset_guess(void *talloc_ctx, struct mp_log *log, bstr buf,
                             const char *user_cp, int flags);
bstr mp_iconv_to_utf8(struct mp_log *log, bstr buf, const char *cp, int flags);

#endif
sub: add subtitle charset conversion This code was once part of subreader.c, then traveled to libass, and now made its way back to the fork of the fork of the original code, MPlayer. It works pretty much the same as subreader.c, except that we have to concatenate some packets to do auto-detection. This is rather annoying, but for all we know the actual source file could be a binary format. Unlike subreader.c, the iconv context is reopened on each packet. This is simpler, and with respect to multibyte encodings, more robust. Reopening is probably not a very fast, but I suspect subtitle charset conversion is not an operation that happens often or has to be fast. Also, this auto-detection is disabled for microdvd - this is the only format we know that has binary data in its packets, but is actually decoded to text. FFmpeg doesn't really allow us to solve this properly, because a) the input packets can be binary, and b) the output will be checked whether it's UTF-8, and if it's not, the output is thrown away and an error message is printed. We could just recode the decoded subtitles before sd_ass if it weren't for that. 2013-06-23 20:15:04 +00:00			`#ifndef MP_CHARSET_CONV_H`
			`#define MP_CHARSET_CONV_H`

			`#include <stdbool.h>`
Move compat/ and bstr/ directory contents somewhere else bstr.c doesn't really deserve its own directory, and compat had just a few files, most of which may as well be in osdep. There isn't really any justification for these extra directories, so get rid of them. The compat/libav.h was empty - just delete it. We changed our approach to API compatibility, and will likely not need it anymore. 2014-08-29 10:09:04 +00:00			`#include "misc/bstr.h"`
sub: add subtitle charset conversion This code was once part of subreader.c, then traveled to libass, and now made its way back to the fork of the fork of the original code, MPlayer. It works pretty much the same as subreader.c, except that we have to concatenate some packets to do auto-detection. This is rather annoying, but for all we know the actual source file could be a binary format. Unlike subreader.c, the iconv context is reopened on each packet. This is simpler, and with respect to multibyte encodings, more robust. Reopening is probably not a very fast, but I suspect subtitle charset conversion is not an operation that happens often or has to be fast. Also, this auto-detection is disabled for microdvd - this is the only format we know that has binary data in its packets, but is actually decoded to text. FFmpeg doesn't really allow us to solve this properly, because a) the input packets can be binary, and b) the output will be checked whether it's UTF-8, and if it's not, the output is thrown away and an error message is printed. We could just recode the decoded subtitles before sd_ass if it weren't for that. 2013-06-23 20:15:04 +00:00
charset_conv: mp_msg conversions 2013-12-21 19:37:16 +00:00			`struct mp_log;`

sub: add subtitle charset conversion This code was once part of subreader.c, then traveled to libass, and now made its way back to the fork of the fork of the original code, MPlayer. It works pretty much the same as subreader.c, except that we have to concatenate some packets to do auto-detection. This is rather annoying, but for all we know the actual source file could be a binary format. Unlike subreader.c, the iconv context is reopened on each packet. This is simpler, and with respect to multibyte encodings, more robust. Reopening is probably not a very fast, but I suspect subtitle charset conversion is not an operation that happens often or has to be fast. Also, this auto-detection is disabled for microdvd - this is the only format we know that has binary data in its packets, but is actually decoded to text. FFmpeg doesn't really allow us to solve this properly, because a) the input packets can be binary, and b) the output will be checked whether it's UTF-8, and if it's not, the output is thrown away and an error message is printed. We could just recode the decoded subtitles before sd_ass if it weren't for that. 2013-06-23 20:15:04 +00:00			`enum {`
			`MP_ICONV_VERBOSE = 1, // print errors instead of failing silently`
			`MP_ICONV_ALLOW_CUTOFF = 2, // allow partial input data`
sub: if charset detection fails, treat it as broken UTF-8 Broken UTF-8 in this context means we treat it as UTF-8, but we also interpret broken UTF-8 sequences as Latin1. Also, run our own UTF-8 check function before the charset detectors. This prevents from ENCA's UTF-8 check possibly messing up (like detecting 7-bit clean UTF-8 as ASCII, or other things). It also takes care of UTF-8 detection if no charset detector (ENCA, libguess) is compiled in, and it lets us deal better with cut-off UTF-8 sequences. 2013-08-15 17:29:42 +00:00			`MP_STRICT_UTF8 = 4, // don't fall back to UTF-8-BROKEN when guessing`
sub: add subtitle charset conversion This code was once part of subreader.c, then traveled to libass, and now made its way back to the fork of the fork of the original code, MPlayer. It works pretty much the same as subreader.c, except that we have to concatenate some packets to do auto-detection. This is rather annoying, but for all we know the actual source file could be a binary format. Unlike subreader.c, the iconv context is reopened on each packet. This is simpler, and with respect to multibyte encodings, more robust. Reopening is probably not a very fast, but I suspect subtitle charset conversion is not an operation that happens often or has to be fast. Also, this auto-detection is disabled for microdvd - this is the only format we know that has binary data in its packets, but is actually decoded to text. FFmpeg doesn't really allow us to solve this properly, because a) the input packets can be binary, and b) the output will be checked whether it's UTF-8, and if it's not, the output is thrown away and an error message is printed. We could just recode the decoded subtitles before sd_ass if it weren't for that. 2013-06-23 20:15:04 +00:00			`};`

sub: don't print detected charset if it's UTF-8 Too noisy. This also fixes that iconv() was called if "utf8" was used as codepage. 2013-08-15 21:13:10 +00:00			`bool mp_charset_is_utf8(const char *user_cp);`
sub: detect charset in demuxer Slightly simpler, and removes the need to pre-read all subtitle packets. This still does the subtitle charset conversion on the packet level (instead converting when parsing the file), so in theory this still could provide a way to change the charset at runtime. But maybe even this should be removed, as FFmpeg is somewhat likely to get its own charset detection and conversion mechanism in the future. (Would have to keep the subtitle file in memory to allow changing the charset on the fly, I guess.) 2015-12-16 22:54:25 +00:00			`bool mp_charset_is_utf16(const char *user_cp);`
sub: add subtitle charset conversion This code was once part of subreader.c, then traveled to libass, and now made its way back to the fork of the fork of the original code, MPlayer. It works pretty much the same as subreader.c, except that we have to concatenate some packets to do auto-detection. This is rather annoying, but for all we know the actual source file could be a binary format. Unlike subreader.c, the iconv context is reopened on each packet. This is simpler, and with respect to multibyte encodings, more robust. Reopening is probably not a very fast, but I suspect subtitle charset conversion is not an operation that happens often or has to be fast. Also, this auto-detection is disabled for microdvd - this is the only format we know that has binary data in its packets, but is actually decoded to text. FFmpeg doesn't really allow us to solve this properly, because a) the input packets can be binary, and b) the output will be checked whether it's UTF-8, and if it's not, the output is thrown away and an error message is printed. We could just recode the decoded subtitles before sd_ass if it weren't for that. 2013-06-23 20:15:04 +00:00			`bool mp_charset_requires_guess(const char *user_cp);`
charset_conv: make it possible to return an allocated string as guess uchardet is written in C++, and thus doesn't appreciate the value of using static strings, and internally stores the guessed charset as allocated std::string. Add a minimal hack to deal with this. (I don't appreciate that the code is potentially harder to understand by returning either a static or allocated string, but I do appreciate for not having to litter the existing code with strdups.) 2015-08-01 21:25:50 +00:00			`const char mp_charset_guess(void talloc_ctx, struct mp_log *log, bstr buf,`
			`const char *user_cp, int flags);`
charset_conv: mp_msg conversions 2013-12-21 19:37:16 +00:00			`bstr mp_iconv_to_utf8(struct mp_log log, bstr buf, const char cp, int flags);`
sub: add subtitle charset conversion This code was once part of subreader.c, then traveled to libass, and now made its way back to the fork of the fork of the original code, MPlayer. It works pretty much the same as subreader.c, except that we have to concatenate some packets to do auto-detection. This is rather annoying, but for all we know the actual source file could be a binary format. Unlike subreader.c, the iconv context is reopened on each packet. This is simpler, and with respect to multibyte encodings, more robust. Reopening is probably not a very fast, but I suspect subtitle charset conversion is not an operation that happens often or has to be fast. Also, this auto-detection is disabled for microdvd - this is the only format we know that has binary data in its packets, but is actually decoded to text. FFmpeg doesn't really allow us to solve this properly, because a) the input packets can be binary, and b) the output will be checked whether it's UTF-8, and if it's not, the output is thrown away and an error message is printed. We could just recode the decoded subtitles before sd_ass if it weren't for that. 2013-06-23 20:15:04 +00:00
			`#endif`