Commit Graph

4 Commits

Author SHA1 Message Date
wm4 00f735d5cb bstr: make UTF-8 check stricter
Don't accept overlong sequences. Don't accept codepoints past the
maximum unicode codepoint. Don't accept the UTF-16 surrogate codepoints.

I'm not sure if there are more codepoints that are defined to be
invalid, but we just want to make libavcodec happy, so this is enough.

(libavcodec's subtitle converter checks for valid UTF-8 and throws up
and dies if it's not - now we want to use bstr_sanitize_utf8_latin1() to
force valid UTF-8, so the strictness of our UTF-8 parser has to match at
least that of the libavcodec's check.)

I'm not sure whether the min test is actually 100% correct.

Note that libavcodec also treats BOM codepoints as invalid. This is
definitely a bug: the BOM is really just "zero-width non-breaking space"
redefined by Microsoft, but it is perfectly valid to appear in the
middle of a string. Official Unicode has merely deprecated the old
usage of the BOM codepoint, and didn't make it illegal. Besides, the
string could be from the start of a file, so even this check doesn't
make sense even with libavcodec's insane logic. We don't copy this bug.
2013-08-15 23:40:03 +02:00
wm4 380fa71fc7 bstr: add UTF-8 validation and sanitation functions 2013-08-15 23:40:02 +02:00
Stefano Pigozzi 406241005e core: move contents to mpvcore (2/2)
Followup commit. Fixes all the files references.
2013-08-06 22:52:31 +02:00
Stefano Pigozzi bc27f946c2 core: move contents to mpvcore (1/2)
core is used in many unix systems for core dumps. For that reason some tools
work under the assumption that the file is indeed a core dump (for example
autoconf does this).

This commit just renames the files. The following one will change all the
includes to fix compilation. This is done this way because git has a easier
time tracing file changes if there is a pure rename commit.
2013-08-06 22:48:47 +02:00