mpv/misc/charset_conv.h

#ifndef MP_CHARSET_CONV_H
#define MP_CHARSET_CONV_H

#include <stdbool.h>
#include "bstr/bstr.h"

enum {
    MP_ICONV_VERBOSE = 1,       // print errors instead of failing silently
    MP_ICONV_ALLOW_CUTOFF = 2,  // allow partial input data
    MP_STRICT_UTF8 = 4,         // don't fall back to UTF-8-BROKEN when guessing
};

bool mp_charset_is_utf8(const char *user_cp);
bool mp_charset_requires_guess(const char *user_cp);
const char *mp_charset_guess(bstr buf, const char *user_cp, int flags);
bstr mp_charset_guess_and_conv_to_utf8(bstr buf, const char *user_cp, int flags);
bstr mp_iconv_to_utf8(bstr buf, const char *cp, int flags);

#endif
sub: add subtitle charset conversion This code was once part of subreader.c, then traveled to libass, and now made its way back to the fork of the fork of the original code, MPlayer. It works pretty much the same as subreader.c, except that we have to concatenate some packets to do auto-detection. This is rather annoying, but for all we know the actual source file could be a binary format. Unlike subreader.c, the iconv context is reopened on each packet. This is simpler, and with respect to multibyte encodings, more robust. Reopening is probably not a very fast, but I suspect subtitle charset conversion is not an operation that happens often or has to be fast. Also, this auto-detection is disabled for microdvd - this is the only format we know that has binary data in its packets, but is actually decoded to text. FFmpeg doesn't really allow us to solve this properly, because a) the input packets can be binary, and b) the output will be checked whether it's UTF-8, and if it's not, the output is thrown away and an error message is printed. We could just recode the decoded subtitles before sd_ass if it weren't for that. 2013-06-23 20:15:04 +00:00			`#ifndef MP_CHARSET_CONV_H`
			`#define MP_CHARSET_CONV_H`

			`#include <stdbool.h>`
Split mpvcore/ into common/, misc/, bstr/ 2013-12-17 01:39:45 +00:00			`#include "bstr/bstr.h"`
sub: add subtitle charset conversion This code was once part of subreader.c, then traveled to libass, and now made its way back to the fork of the fork of the original code, MPlayer. It works pretty much the same as subreader.c, except that we have to concatenate some packets to do auto-detection. This is rather annoying, but for all we know the actual source file could be a binary format. Unlike subreader.c, the iconv context is reopened on each packet. This is simpler, and with respect to multibyte encodings, more robust. Reopening is probably not a very fast, but I suspect subtitle charset conversion is not an operation that happens often or has to be fast. Also, this auto-detection is disabled for microdvd - this is the only format we know that has binary data in its packets, but is actually decoded to text. FFmpeg doesn't really allow us to solve this properly, because a) the input packets can be binary, and b) the output will be checked whether it's UTF-8, and if it's not, the output is thrown away and an error message is printed. We could just recode the decoded subtitles before sd_ass if it weren't for that. 2013-06-23 20:15:04 +00:00
			`enum {`
			`MP_ICONV_VERBOSE = 1, // print errors instead of failing silently`
			`MP_ICONV_ALLOW_CUTOFF = 2, // allow partial input data`
sub: if charset detection fails, treat it as broken UTF-8 Broken UTF-8 in this context means we treat it as UTF-8, but we also interpret broken UTF-8 sequences as Latin1. Also, run our own UTF-8 check function before the charset detectors. This prevents from ENCA's UTF-8 check possibly messing up (like detecting 7-bit clean UTF-8 as ASCII, or other things). It also takes care of UTF-8 detection if no charset detector (ENCA, libguess) is compiled in, and it lets us deal better with cut-off UTF-8 sequences. 2013-08-15 17:29:42 +00:00			`MP_STRICT_UTF8 = 4, // don't fall back to UTF-8-BROKEN when guessing`
sub: add subtitle charset conversion This code was once part of subreader.c, then traveled to libass, and now made its way back to the fork of the fork of the original code, MPlayer. It works pretty much the same as subreader.c, except that we have to concatenate some packets to do auto-detection. This is rather annoying, but for all we know the actual source file could be a binary format. Unlike subreader.c, the iconv context is reopened on each packet. This is simpler, and with respect to multibyte encodings, more robust. Reopening is probably not a very fast, but I suspect subtitle charset conversion is not an operation that happens often or has to be fast. Also, this auto-detection is disabled for microdvd - this is the only format we know that has binary data in its packets, but is actually decoded to text. FFmpeg doesn't really allow us to solve this properly, because a) the input packets can be binary, and b) the output will be checked whether it's UTF-8, and if it's not, the output is thrown away and an error message is printed. We could just recode the decoded subtitles before sd_ass if it weren't for that. 2013-06-23 20:15:04 +00:00			`};`

sub: don't print detected charset if it's UTF-8 Too noisy. This also fixes that iconv() was called if "utf8" was used as codepage. 2013-08-15 21:13:10 +00:00			`bool mp_charset_is_utf8(const char *user_cp);`
sub: add subtitle charset conversion This code was once part of subreader.c, then traveled to libass, and now made its way back to the fork of the fork of the original code, MPlayer. It works pretty much the same as subreader.c, except that we have to concatenate some packets to do auto-detection. This is rather annoying, but for all we know the actual source file could be a binary format. Unlike subreader.c, the iconv context is reopened on each packet. This is simpler, and with respect to multibyte encodings, more robust. Reopening is probably not a very fast, but I suspect subtitle charset conversion is not an operation that happens often or has to be fast. Also, this auto-detection is disabled for microdvd - this is the only format we know that has binary data in its packets, but is actually decoded to text. FFmpeg doesn't really allow us to solve this properly, because a) the input packets can be binary, and b) the output will be checked whether it's UTF-8, and if it's not, the output is thrown away and an error message is printed. We could just recode the decoded subtitles before sd_ass if it weren't for that. 2013-06-23 20:15:04 +00:00			`bool mp_charset_requires_guess(const char *user_cp);`
sub: if charset detection fails, treat it as broken UTF-8 Broken UTF-8 in this context means we treat it as UTF-8, but we also interpret broken UTF-8 sequences as Latin1. Also, run our own UTF-8 check function before the charset detectors. This prevents from ENCA's UTF-8 check possibly messing up (like detecting 7-bit clean UTF-8 as ASCII, or other things). It also takes care of UTF-8 detection if no charset detector (ENCA, libguess) is compiled in, and it lets us deal better with cut-off UTF-8 sequences. 2013-08-15 17:29:42 +00:00			`const char mp_charset_guess(bstr buf, const char user_cp, int flags);`
sub: add subtitle charset conversion This code was once part of subreader.c, then traveled to libass, and now made its way back to the fork of the fork of the original code, MPlayer. It works pretty much the same as subreader.c, except that we have to concatenate some packets to do auto-detection. This is rather annoying, but for all we know the actual source file could be a binary format. Unlike subreader.c, the iconv context is reopened on each packet. This is simpler, and with respect to multibyte encodings, more robust. Reopening is probably not a very fast, but I suspect subtitle charset conversion is not an operation that happens often or has to be fast. Also, this auto-detection is disabled for microdvd - this is the only format we know that has binary data in its packets, but is actually decoded to text. FFmpeg doesn't really allow us to solve this properly, because a) the input packets can be binary, and b) the output will be checked whether it's UTF-8, and if it's not, the output is thrown away and an error message is printed. We could just recode the decoded subtitles before sd_ass if it weren't for that. 2013-06-23 20:15:04 +00:00			`bstr mp_charset_guess_and_conv_to_utf8(bstr buf, const char *user_cp, int flags);`
			`bstr mp_iconv_to_utf8(bstr buf, const char *cp, int flags);`

			`#endif`