- From: M.T. Carrasco Benitez <mtcarrascob@yahoo.com>
- Date: Sun, 31 May 2009 01:18:15 -0700 (PDT)
- To: Travis Leithead <Travis.Leithead@microsoft.com>, Erik van der Poel <erikv@google.com>
- Cc: "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, Richard Ishida <ishida@w3.org>, Ian Hickson <ian@hixie.ch>, Chris Wilson <Chris.Wilson@microsoft.com>, Harley Rosnow <Harley.Rosnow@microsoft.com>
Near to Erik, but UTF8 in worse case: 1) Best: HTTP charset; unambiguous and "external" 2) Agree on ONE public detection algorithm 3) Mandatory declaration as near to the top as possible; if in META, the first in HEAD; within a certain range of bytes (e.g., 512) 4) Default UTF8 could be part of the algorithm; perhaps the last option 5) No BOM or similar Regards Tomas --- On Wed, 27/5/09, Erik van der Poel <erikv@google.com> wrote: > From: Erik van der Poel <erikv@google.com> > Subject: Re: Auto-detect and encodings in HTML5 > To: "Travis Leithead" <Travis.Leithead@microsoft.com> > Cc: "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, "Richard Ishida" <ishida@w3.org>, "Ian Hickson" <ian@hixie.ch>, "Chris Wilson" <Chris.Wilson@microsoft.com>, "Harley Rosnow" <Harley.Rosnow@microsoft.com> > Date: Wednesday, 27 May, 2009, 7:30 PM > Hi Travis, > > First of all, I am really happy to see a browser vendor > offer to get > stricter. :-) > > I wonder whether the doctype is a very clean way to move > forward in > this area, given that the HTTP charset ought to disable > the > auto-detector, but if many authors prefer the META charset, > then the > doctype might be a reasonable compromise. I am still > thinking about > this part. > > However, I object quite strongly to the UTF-8 default. If > an HTML5 > document includes the doctype but excludes the charset, old > clients > might use their auto-detector and get it wrong. So I'd > prefer to make > the charset mandatory with HTML5 doctype, and keep the rule > that the > HTTP charset overrides the META charset for compatibility > with old > clients. > > Erik > > On Tue, May 26, 2009 at 4:45 PM, Travis Leithead > <Travis.Leithead@microsoft.com> > wrote: > > Ian, UA venders, and HTML/I18n mailing list folks: > > > > > > > > I'd like to present the following feedback from one of > our lead > > > > Trident developers on the IE team. He and I work on a > number of > > > > parts of the web platform; the encoding and > auto-detect subsystem > > > > being the one most relevant to this mail. I'd really > like to > > > > generate some discussion from the other browser UAs on > the this > > > > topic. > > > > > > > > The basic idea is that we feel like there are a few > places that > > > > the HTML5 spec could make assertions to improve the > web's > > > > international support and future ease of > interoperability > > > > regarding encodings and auto-detect. We recognize the > need to be > > > > as compatible as possible with currently deployed web > sites, and > > > > the technique proposed to maintain compatibility is by > leveraging > > > > the "HTML5 doctype". I don't want to focus too much on > that > > > > particular aspect of the proposal (though it's > important), but to > > > > also consider the implications and scenarios as well. > > > > > > > > The proposal is straight-forward. Only in pages with > the HTML5 doctype: > > > > > > > > 1. Forbid the use of auto-detect heuristics for HTML > encodings. > > > > > > > > 2. Forbid the use problematic encodings such as UTF7 > and EBCDIC. > > > > > > > > Essentially, get rid of the classes of > encodings in which > > > > Jscript and tags do not correspond to simple > ASCII characters > > > > in the raw byte stream. > > > > > > > > 3. Only handling the encoding in the first META tag > within the > > > > HEAD and requiring that the HEAD and META tags > to appear within > > > > a well-defined, fixed byte distance into the > file to take effect. > > > > > > > > 4. Require the default HTML encoding to be UTF8. > > > > > > > > I realize these changes depart somewhat from current > practice and > > > > may seem constraining. But, I was very pleased to > see UTF7 already > > > > excluded and EBCDIC discouraged in the HTML5 draft. > The META tag > > > > is supposed to be the first after the HEAD according > to the draft. > > > > But, if we could get substantial agreement from the > various user > > > > agents to tighten up the behavior covering this > handling, we can > > > > greatly improve the Internet in the following > regards: > > > > > > > > > > > > A. HTML5 would no longer be vulnerable to script > injection from > > > > encodings such as UTF7 and EBCDIC which then > tricks the auto- > > > > detection code to reinterpret the entire page > and run the > > > > injected script. > > > > > > > > (Harley: I’ve had to fix a number of issues > related to these > > > > security vulnerabilities but the problem is > systemic in the > > > > products and the standard doesn’t help.) > > > > > > > > B. HTML5 would be able to process markup more > efficiently by > > > > reducing the scanning and computation required > to merely > > > > determine the encoding of the file. > > > > > > > > C. Since sometimes the heuristics or default > encoding uses > > > > information about the user’s environment, we > often see pages > > > > that display quite differently from one region > to another. > > > > As much as possible, browsing from across the > globe should > > > > give a consistent experience for a given > page. (Basically, I > > > > want my children to one day stop seeing garbage > when they > > > > browse Japanese web sites from the US.) > > > > > > > > D. We’d greatly increase the consistency of > implementation of > > > > markup handling by the various user agents. > These openings > > > > for UA-specific heuristics and decisions, > undermines the > > > > benefits of standards and standardization. > > > > > > > > Thanks, > > > > > > > > Travis and Harley > > > > > > > > Internet Explorer Program Management/Development > > > > Microsoft Corporation > > > > > >
Received on Sunday, 31 May 2009 08:18:54 UTC