[whatwg] Video with MIME type application/octet-stream from Philip Jägenstedt on 2010-09-07 (public-whatwg-archive@w3.org from September 2010)

From: Philip Jägenstedt <philipj@opera.com>
Date: Tue, 07 Sep 2010 10:11:14 +0200
Message-ID: <op.vink0yuhsr6mfa@kirk>
On Tue, 07 Sep 2010 03:56:54 +0200, Boris Zbarsky <bzbarsky at mit.edu> wrote:

> On 9/6/10 3:19 PM, Aryeh Gregor wrote:
>> On Mon, Sep 6, 2010 at 4:14 AM, Philip J?genstedt<philipj at opera.com>   
>> wrote:
>>> The Ogg page begins with the 4 bytes "OggS", which is what Opera  
>>> (GStreamer)
>>> checks for. For additional safety, one could also check for the  
>>> trailing
>>> version indicator, which ought to be a NULL byte for current Ogg. [1]  
>>> [2]
>>
>> "OggS\0" as the first five bytes seems safe to check for.  It's rather
>> short, I guess because it's repeated on every page, but five bytes is
>> long enough that it should occur by random only negligibly often, in
>> either text or binary files.
>
> So if a text file starts with U+4F67 U+6753 (both CJK ideographs) and  
> any ASCII character (can this happen in the real world?) you're OK with  
> treating it as Ogg?  Same for files staring with U+674F U+5367 (both CJK  
> ideographs) and any plane-0 character whose Unicode codepoint is 0 mod  
> 2^16 (plenty of CJK stuff like that)?  Is your CJK good enough that you  
> know text files would never start like this, or are you just assuming  
> that people who are silly enough to use UTF-16 for their text files and  
> aren't in Europe don't matter?  Or that you don't care about people who  
> happen to not use a BOM?

Thanks for pointing out these cases. I hadn't thought about it, but my CJK  
is good enough to say something about them:

'??A' encoded in UTF-16BE is 'OggS\x00A'. However, ?? is nonsensical  
in at least Chinese, neither character is among the 3000 most common  
characters [1]. Search results on Google (4) and Baidu (3) are nonsense  
too. I don't know if things are any different for Japanese, but given the  
Google results I doubt it.

'??' encoded in UTF-16LE is 'OggS', and both of these characters are in  
the top 3000, but together they're nonsense: "apricot crouch". (That's the  
same "crouch" as in Crouching Tiger, Hidden Dragon, but the order is wrong  
so it doesn't mean "Crouching Apricot"). In the Google and Baidu results,  
the only occurrence of the string seems to be in "???????", which  
appears to be a theme of an apricot tree by a pavillion that appears in  
several paintings [2] [3] [4].

All in all, I wouldn't be more worried about this than the risk of random  
binary data matching. Also, UTF-16 isn't a very common encoding for  
simplified Chinese (? is a simplified character), GBK is dominant.

We could also add checking of the 6th byte, which should normally be 0x02  
for "first page of logical bitstream (bos)".

>> It looks like you could check for 0x1a 0x45 0xdf 0xa3 as the first
>> four bytes
>
> U+1A45 is Thai, looks like.  DFA3 is a surrogate, so you're ok there.
>
> U+451A is CJK.  U+A3DF looks like a Yi syllable, so you're more or less  
> ok there too.  I'm assuming you've already checked this byte sequence  
> out in UTF-8 and some other common encodings?

It's garbage in at least UTF-8, Big5 and GBK.

I'm not sure what infrastructure is in place, but perhaps one could *not*  
sniff if Content-Type also indicates an encoding? That way there's a  
solution for those who really want to display the hypothetical false  
positives as text.

[1] http://www.zein.se/patrick/3000char.html
[2]  
http://hi.baidu.com/%BC%C5%D5%AB/blog/item/f0ee8a4c5a5d0c02b3de05aa.html
[3] http://blog.sina.com.cn/s/blog_475be8240100ew5q.html
[4] http://www.zgddhj.cn/zj/bh/zhouhongyi/201007/32053.html

-- 
Philip J?genstedt
Core Developer
Opera Software
Received on Tuesday, 7 September 2010 01:11:14 UTC