- From: Erik van der Poel <erikv@google.com>
- Date: Sun, 31 May 2009 10:37:45 -0700
- To: Larry Masinter <masinter@adobe.com>
- Cc: "M.T. Carrasco Benitez" <mtcarrascob@yahoo.com>, Travis Leithead <Travis.Leithead@microsoft.com>, "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, Richard Ishida <ishida@w3.org>, Ian Hickson <ian@hixie.ch>, Chris Wilson <Chris.Wilson@microsoft.com>, Harley Rosnow <Harley.Rosnow@microsoft.com>
I agree that it would be interesting if major HTML5 implementers and (the) HTML5 spec writer(s) would agree on a UTF-8 default charset. Just to make the HTML5 "version indicator" a bit more explicit, might this be something like the following HTTP response header? Content-Type: text/html; version=5; charset=gb2312 Erik On Sun, May 31, 2009 at 8:05 AM, Larry Masinter <masinter@adobe.com> wrote: > I believe the stance of most of the participants in the > HTML working group is that no "version indicator" for > HTML5 is necessary, and there is no specific > "HTML5 doctype", against which newer, or stricter, > behavior can be keyed. > > If charset defaulting is a reason for having a specific > HTML5 version indicator, in order to trigger a stricter > interpretation, say, of the default charset, that would > be interesting. > > Larry > -- > http://larry.masinter.net > > > -----Original Message----- > From: public-html-request@w3.org [mailto:public-html-request@w3.org] On Behalf Of M.T. Carrasco Benitez > Sent: Sunday, May 31, 2009 1:18 AM > To: Travis Leithead; Erik van der Poel > Cc: public-html@w3.org; www-international@w3.org; Richard Ishida; Ian Hickson; Chris Wilson; Harley Rosnow > Subject: Re: Auto-detect and encodings in HTML5 > > > Near to Erik, but UTF8 in worse case: > > 1) Best: HTTP charset; unambiguous and "external" > 2) Agree on ONE public detection algorithm > 3) Mandatory declaration as near to the top as possible; if in META, the first in HEAD; within a certain range of bytes (e.g., 512) > 4) Default UTF8 could be part of the algorithm; perhaps the last option > 5) No BOM or similar > > Regards > Tomas > > > --- On Wed, 27/5/09, Erik van der Poel <erikv@google.com> wrote: > >> From: Erik van der Poel <erikv@google.com> >> Subject: Re: Auto-detect and encodings in HTML5 >> To: "Travis Leithead" <Travis.Leithead@microsoft.com> >> Cc: "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, "Richard Ishida" <ishida@w3.org>, "Ian Hickson" <ian@hixie.ch>, "Chris Wilson" <Chris.Wilson@microsoft.com>, "Harley Rosnow" <Harley.Rosnow@microsoft.com> >> Date: Wednesday, 27 May, 2009, 7:30 PM >> Hi Travis, >> >> First of all, I am really happy to see a browser vendor >> offer to get >> stricter. :-) >> >> I wonder whether the doctype is a very clean way to move >> forward in >> this area, given that the HTTP charset ought to disable >> the >> auto-detector, but if many authors prefer the META charset, >> then the >> doctype might be a reasonable compromise. I am still >> thinking about >> this part. >> >> However, I object quite strongly to the UTF-8 default. If >> an HTML5 >> document includes the doctype but excludes the charset, old >> clients >> might use their auto-detector and get it wrong. So I'd >> prefer to make >> the charset mandatory with HTML5 doctype, and keep the rule >> that the >> HTTP charset overrides the META charset for compatibility >> with old >> clients. >> >> Erik >> >> On Tue, May 26, 2009 at 4:45 PM, Travis Leithead >> <Travis.Leithead@microsoft.com> >> wrote: >> > Ian, UA venders, and HTML/I18n mailing list folks: >> > >> > >> > >> > I'd like to present the following feedback from one of >> our lead >> > >> > Trident developers on the IE team. He and I work on a >> number of >> > >> > parts of the web platform; the encoding and >> auto-detect subsystem >> > >> > being the one most relevant to this mail. I'd really >> like to >> > >> > generate some discussion from the other browser UAs on >> the this >> > >> > topic. >> > >> > >> > >> > The basic idea is that we feel like there are a few >> places that >> > >> > the HTML5 spec could make assertions to improve the >> web's >> > >> > international support and future ease of >> interoperability >> > >> > regarding encodings and auto-detect. We recognize the >> need to be >> > >> > as compatible as possible with currently deployed web >> sites, and >> > >> > the technique proposed to maintain compatibility is by >> leveraging >> > >> > the "HTML5 doctype". I don't want to focus too much on >> that >> > >> > particular aspect of the proposal (though it's >> important), but to >> > >> > also consider the implications and scenarios as well. >> > >> > >> > >> > The proposal is straight-forward. Only in pages with >> the HTML5 doctype: >> > >> > >> > >> > 1. Forbid the use of auto-detect heuristics for HTML >> encodings. >> > >> > >> > >> > 2. Forbid the use problematic encodings such as UTF7 >> and EBCDIC. >> > >> > >> > >> > Essentially, get rid of the classes of >> encodings in which >> > >> > Jscript and tags do not correspond to simple >> ASCII characters >> > >> > in the raw byte stream. >> > >> > >> > >> > 3. Only handling the encoding in the first META tag >> within the >> > >> > HEAD and requiring that the HEAD and META tags >> to appear within >> > >> > a well-defined, fixed byte distance into the >> file to take effect. >> > >> > >> > >> > 4. Require the default HTML encoding to be UTF8. >> > >> > >> > >> > I realize these changes depart somewhat from current >> practice and >> > >> > may seem constraining. But, I was very pleased to >> see UTF7 already >> > >> > excluded and EBCDIC discouraged in the HTML5 draft. >> The META tag >> > >> > is supposed to be the first after the HEAD according >> to the draft. >> > >> > But, if we could get substantial agreement from the >> various user >> > >> > agents to tighten up the behavior covering this >> handling, we can >> > >> > greatly improve the Internet in the following >> regards: >> > >> > >> > >> > >> > >> > A. HTML5 would no longer be vulnerable to script >> injection from >> > >> > encodings such as UTF7 and EBCDIC which then >> tricks the auto- >> > >> > detection code to reinterpret the entire page >> and run the >> > >> > injected script. >> > >> > >> > >> > (Harley: I’ve had to fix a number of issues >> related to these >> > >> > security vulnerabilities but the problem is >> systemic in the >> > >> > products and the standard doesn’t help.) >> > >> > >> > >> > B. HTML5 would be able to process markup more >> efficiently by >> > >> > reducing the scanning and computation required >> to merely >> > >> > determine the encoding of the file. >> > >> > >> > >> > C. Since sometimes the heuristics or default >> encoding uses >> > >> > information about the user’s environment, we >> often see pages >> > >> > that display quite differently from one region >> to another. >> > >> > As much as possible, browsing from across the >> globe should >> > >> > give a consistent experience for a given >> page. (Basically, I >> > >> > want my children to one day stop seeing garbage >> when they >> > >> > browse Japanese web sites from the US.) >> > >> > >> > >> > D. We’d greatly increase the consistency of >> implementation of >> > >> > markup handling by the various user agents. >> These openings >> > >> > for UA-specific heuristics and decisions, >> undermines the >> > >> > benefits of standards and standardization. >> > >> > >> > >> > Thanks, >> > >> > >> > >> > Travis and Harley >> > >> > >> > >> > Internet Explorer Program Management/Development >> > >> > Microsoft Corporation >> > >> > >> >> > > > > >
Received on Sunday, 31 May 2009 17:39:24 UTC