- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Sat, 31 Dec 2011 07:34:14 +0100
Anne van Kesteren Fri, 30 Dec 2011 11:54:34 +0100 > On Fri, 30 Dec 2011 05:51:16 +0100, Leif Halvard Silli: >> The Trident cache behaviour is a symptom of its over all UTF-16 >> behaviour: Apart from reading the BOM, it doesn't do any UTF-16 >> sniffing. I suspect that you want Opera/Firefox to become "as bad" at >> 'getting' the UTF-16 encoding as Webkit/IE are? (Note that Webkit is >> worse than IE - just to, once again, emphasize how difficult it is to >> replicate IE.) > > How is WebKit worse than IE? For HTML: If HTTP says 'WINDOWS-1252' but the page is little-endian UTF-16 without the BOM, then IE will render the page as WINDOWS-1252, and this will actually work - at least in some circumstances ... Check: <http://www.acsd.k12.sc.us/wwes/>. (There could be other pages that IE handles, but which doesn't fall into this category.) For XHTML: For 'nude' tests <http://malform.no/testing/utf/#xml-table-1>, then Webkit is worse than Trident <http://malform.no/testing/utf/#xml-table-1-results>. (Trident performs a variant of the sniffing described in XML 1.0, whereas Webkit does not sniff at all unless there is a XML prolog.) > And why should there be UTF-16 sniffing? FIRST: What is 'UTF-16 sniffing'? The BOM is a sniffing form. The HTML5 character encoding *sniffing* algorithm covers UTF-16 as well. Should we single out UTF-16 sniffing as something that should not be sniffed? What do browser vendors think? Based on the tests at <http://malform.no/testing/utf/>, then it seems like IE performs no UTF-16 detection/sniffing beyond using HTTP, using the BOM and - as last resort - reading the META element (including the MS 'unicode' and MS 'unicodeFFFE' values - that Webkit also reads). But for HTML, then Trident - unlike Webkit - does not make use of the XML encoding declaration for detecting encoding: <http://malform.no/testing/utf/#html-table-4>. And for HTML, then Trident - unlike Webkit - does not make use of the the XML prolog (no, not the encoding declaration) for sniffing the endianness of UTF-16 files: <http://malform.no/testing/utf/#html-table-9>. Aligning with IE would mean that Opera, Mozilla and Webkit must 'degenerate' their heuristics. Why would a vendor want to become less compatible with the Web? >> But is the little endian defaulting really important? >> Over all, proper UTF-16 treatment (read: sniffing) on IE/WEbkit's part, >> would probably improve the situation more. > > You mean there are sites that only work in Gecko/Presto? 'Sites' is perhaps a big word - 'UTF-16' pages are often lone pages, it seems. But yes, obviously. E.g. big-endian UTF-16 labelled pages without a BOM. But, oops: It seems like Firefox does not use the META element anymore. It used to use the META element, in Firefox 3. But apparently stopped doing that - may be they misread the HTML5 algorithm ... Nevertheless, I have come across pages that work in Firefox/Opera but not Trident. MS Word, which often make these pages, cane save both big and little endian. >> I know ... And it precisely therefore that it would have been an >> advantage to, for the Web, focus on *requiring* the BOM for UTF-16. > > It seems simpler to focus on promoting only UTF-8. It seems simple enough say that the BOM must be used. Saying something like that is no different from saying that a certain range of WINDOWS-1252 must not be used, is it? >>> Yeah, I'm going to file a new bug so we can reconsider although the >>> octet sequence the various BOMs represent can have legitimate meanings >>> in certain encodings, >> >> You mean: In addition to the BOM meaning, I suppose. > > No. In e.g. windows-1258 there is no BOM and FF FE simply means U+00FF > U+20AB. I think we have the same thing in mind. And btw, Google Search displays many such letters in UTF-16 encoded pages ... instead of displaying the content. Apparently, Google *fails* consider the BOM octets magic ... Or may be it is UTF-16-negative ... >>> it seems in practice people use them for Unicode. >>> (Helped by the fact that Trident/WebKit behave this way of course.) >> >> Don't forget the fact that Presto/Gecko do not move the BOM into the >> <body> when you use UTF-16LE/BE, like they - per the spec of those >> encodings - should do. See: >> <http://bugzilla.validator.nu/show_bug.cgi?id=890> > > Well yes, that's why I'm planning to define utf-16 more in line with > implementations (and render the current text obsolete I suppose). You don't need, for that reason, to follow a strategy that nullifies UTF-16LE/UTF-16BE. I outlined another strategy: Say that all HTML pages are interpreted as being 'UTF-16', even if they mis-labelled with the BOM-less UTF-16LE/UTF-16BE labels. -- Leif H Silli
Received on Friday, 30 December 2011 22:34:14 UTC