- From: Phillips, Addison <addison@amazon.com>
- Date: Tue, 26 May 2009 16:54:27 -0700
- To: Travis Leithead <Travis.Leithead@microsoft.com>, "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, Richard Ishida <ishida@w3.org>, Ian Hickson <ian@hixie.ch>
- CC: Chris Wilson <Chris.Wilson@microsoft.com>, Harley Rosnow <Harley.Rosnow@microsoft.com>
- Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA01A0ADBEFA@EX-SEA5-D.ant.amazon.com>
Hello Travis, The Internationalization WG is, of course, quite interested in the problem of encoding management and detection in HTML5. I have added your note to the Internationalization WG’s agenda for our upcoming teleconference. Regards, Addison Addison Phillips Globalization Architect -- Lab126 Chair -- W3C Internationalization WG Internationalization is not a feature. It is an architecture. From: www-international-request@w3.org [mailto:www-international-request@w3.org] On Behalf Of Travis Leithead Sent: Tuesday, May 26, 2009 4:46 PM To: public-html@w3.org; www-international@w3.org; Richard Ishida; Ian Hickson Cc: Chris Wilson; Harley Rosnow Subject: Auto-detect and encodings in HTML5 Ian, UA venders, and HTML/I18n mailing list folks: I'd like to present the following feedback from one of our lead Trident developers on the IE team. He and I work on a number of parts of the web platform; the encoding and auto-detect subsystem being the one most relevant to this mail. I'd really like to generate some discussion from the other browser UAs on the this topic. The basic idea is that we feel like there are a few places that the HTML5 spec could make assertions to improve the web's international support and future ease of interoperability regarding encodings and auto-detect. We recognize the need to be as compatible as possible with currently deployed web sites, and the technique proposed to maintain compatibility is by leveraging the "HTML5 doctype". I don't want to focus too much on that particular aspect of the proposal (though it's important), but to also consider the implications and scenarios as well. The proposal is straight-forward. Only in pages with the HTML5 doctype: 1. Forbid the use of auto-detect heuristics for HTML encodings. 2. Forbid the use problematic encodings such as UTF7 and EBCDIC. Essentially, get rid of the classes of encodings in which Jscript and tags do not correspond to simple ASCII characters in the raw byte stream. 3. Only handling the encoding in the first META tag within the HEAD and requiring that the HEAD and META tags to appear within a well-defined, fixed byte distance into the file to take effect. 4. Require the default HTML encoding to be UTF8. I realize these changes depart somewhat from current practice and may seem constraining. But, I was very pleased to see UTF7 already excluded and EBCDIC discouraged in the HTML5 draft. The META tag is supposed to be the first after the HEAD according to the draft. But, if we could get substantial agreement from the various user agents to tighten up the behavior covering this handling, we can greatly improve the Internet in the following regards: A. HTML5 would no longer be vulnerable to script injection from encodings such as UTF7 and EBCDIC which then tricks the auto- detection code to reinterpret the entire page and run the injected script. (Harley: I’ve had to fix a number of issues related to these security vulnerabilities but the problem is systemic in the products and the standard doesn’t help.) B. HTML5 would be able to process markup more efficiently by reducing the scanning and computation required to merely determine the encoding of the file. C. Since sometimes the heuristics or default encoding uses information about the user’s environment, we often see pages that display quite differently from one region to another. As much as possible, browsing from across the globe should give a consistent experience for a given page. (Basically, I want my children to one day stop seeing garbage when they browse Japanese web sites from the US.) D. We’d greatly increase the consistency of implementation of markup handling by the various user agents. These openings for UA-specific heuristics and decisions, undermines the benefits of standards and standardization. Thanks, Travis and Harley Internet Explorer Program Management/Development Microsoft Corporation
Received on Tuesday, 26 May 2009 23:55:06 UTC