- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Wed, 27 May 2009 10:51:15 +0300
- To: Travis Leithead <Travis.Leithead@microsoft.com>, Jonathan Rosenne <rosennej@qsm.co.il>
- Cc: HTML WG <public-html@w3.org>, www-international@w3.org, Richard Ishida <ishida@w3.org>, Ian Hickson <ian@hixie.ch>, Chris Wilson <Chris.Wilson@microsoft.com>, Harley Rosnow <Harley.Rosnow@microsoft.com>, Yair Shmuel <yshmuel@microsoft.com>, Addison Phillips <addison@amazon.com>
On May 27, 2009, at 02:45, Travis Leithead wrote: > The proposal is straight-forward. Only in pages with the HTML5 > doctype: Scoping new behavior to the HTML doctype would be contrary to the goal of specifying HTML processing is such a way that the same processing rules work for both legacy content and new HTML5 content. > 1. Forbid the use of auto-detect heuristics for HTML encodings. IIRC, Firefox ships with the heuristic detector set to off. I don't have enough first-hand experience of browsing content in the affected languages (mainly CJK and Cyrillic) to be competent to say what the user experience impact of removing the detector altogether would be. With the HTML5 parser, though, I've changed things so that the detector only runs for the first 512 bytes when enabled (the same 512 bytes as the <meta> prescan). The HTML5 parser-enabled Gecko builds haven't had enough testing globally to tell yet whether this is enough. My cursory testing of CJK sites suggests 512 bytes is sufficient. (Previously in Gecko, the heuristic detector continued to inspect the data much further into the stream possibly triggering a reparse later on.) I don't want to make the use of the detector conditional on doctype, because the doctype hasn't been parsed yet when the detector runs. > 2. Forbid the use problematic encodings such as UTF7 and EBCDIC. > > Essentially, get rid of the classes of encodings in which > Jscript and tags do not correspond to simple ASCII characters > in the raw byte stream. I support this change (for all of text/html). My understanding is that Firefox and Opera have never had EBCDIC decoders, so getting away with not having them thus far suggests that IE and Safari could get rid of those decoders without ill effects (at least outside intranets). The security issues with UTF-7 are well-documented, so it seems like a good idea to ban it. (HTML email declared as UTF-7 might be an exception. Are there popular MUAs in the real world that send HTML email as UTF-7?) > 3. Only handling the encoding in the first META tag within the > HEAD and I think requiring the meta to be in HEAD for the purposes of the encoding sniffing would complicate things, so I don't support such a consumer implementation requirement. The spec already makes the containment of the encoding meta in head an authoring conformance requirement. > requiring that the HEAD and META tags to appear within > a well-defined, fixed byte distance into the file to take effect. I support making the number of bytes that the prescan applies to a fixed number. I think the number should not be smaller than 512 bytes and not be larger than 1024 bytes. Could someone working on WebKit please comment on the experiences on choosing the number? (I have a vague recollection that WebKit increased the number from 512 to 1024 for some reason.) However, due to existing content, I don't think we can remove the tree builder detecting later encoding <meta>s and causing renavigation to the document. The renavigation is unfortunate, but detecting the situation in the tree builder seems to have a negligible cost on the load times of pages that don't cause the renavigation. > 4. Require the default HTML encoding to be UTF8. This isn't feasible for text/html in general. Given existing content, you need to either allow a locale-dependent default or default to windows-1252 if you want a single global default. It doesn't make sense to change this for HTML5 doctype documents only, because the HTML5 doctype is an author opt-in mechanism and authors already have three widely supported mechanisms for opting into UTF-8: chareset=utf-8 in HTTP header, in <meta> and the BOM. > But, if we could get substantial agreement from the various user > agents to tighten up the behavior covering this handling, we can > greatly improve the Internet in the following regards: Would Microsoft be willing to tighten the IE 5.5, IE7, IE8 Almost Standards and IE8 Standards modes likewise in IE9? What about tightening them in a point release of IE8? > A. HTML5 would no longer be vulnerable to script injection from > encodings such as UTF7 and EBCDIC which then tricks the auto- > detection code to reinterpret the entire page and run the > injected script. It makes sense to eliminate this attack vector, but it doesn't make sense to eliminate it for documents with the HTML5 doctype only, because doing so would still leave browsers vulnerable when the attacker targets a system that emits legacy doctypes. > B. HTML5 would be able to process markup more efficiently by > reducing the scanning and computation required to merely > determine the encoding of the file. Authors can already opt-in to more efficient computation by declaring on the HTTP layer that they use UTF-8. Adding different rules for HTML5 would increase code complexity. > C. Since sometimes the heuristics or default encoding uses > information about the user’s environment, we often see pages > that display quite differently from one region to another. > As much as possible, browsing from across the globe should > give a consistent experience for a given page. (Basically, I > want my children to one day stop seeing garbage when they > browse Japanese web sites from the US.) This is indeed a problem. However, I don't see a way for browsers to force CJK and Cyrillic sites into making the experience for out-of- locale readers better without losing market share within the local when doing so. On May 27, 2009, at 07:39, Jonathan Rosenne wrote: > EBCDIC and its national language variants, including visual encoding > of bidi languages, are in use and will continue to be in use as long > as mainframes are in use. A large quantity of data is stored in > mainframes in EBCDIC and its variants, and the easiest way of > interfacing this data to an HTML UI is by using the encoding > features of HTML. It doesn't follow that mainframes should use EBCDIC variants for interchange with other systems. I think the burden to perform conversion should be on mainframes and other systems shouldn't take on the security risk of supporting encodings that aren't rough ASCII supersets. -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Wednesday, 27 May 2009 07:52:13 UTC