W3C home > Mailing lists > Public > public-html@w3.org > May 2009

Re: Auto-detect and encodings in HTML5

From: Erik van der Poel <erikv@google.com>
Date: Wed, 27 May 2009 10:30:34 -0700
Message-ID: <c07a32650905271030r2407ac99hd62f309cc4766577@mail.gmail.com>
To: Travis Leithead <Travis.Leithead@microsoft.com>
Cc: "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, Richard Ishida <ishida@w3.org>, Ian Hickson <ian@hixie.ch>, Chris Wilson <Chris.Wilson@microsoft.com>, Harley Rosnow <Harley.Rosnow@microsoft.com>
Hi Travis,

First of all, I am really happy to see a browser vendor offer to get
stricter. :-)

I wonder whether the doctype is a very clean way to move forward in
this area, given that the HTTP charset ought to disable the
auto-detector, but if many authors prefer the META charset, then the
doctype might be a reasonable compromise. I am still thinking about
this part.

However, I object quite strongly to the UTF-8 default. If an HTML5
document includes the doctype but excludes the charset, old clients
might use their auto-detector and get it wrong. So I'd prefer to make
the charset mandatory with HTML5 doctype, and keep the rule that the
HTTP charset overrides the META charset for compatibility with old
clients.

Erik

On Tue, May 26, 2009 at 4:45 PM, Travis Leithead
<Travis.Leithead@microsoft.com> wrote:
> Ian, UA venders, and HTML/I18n mailing list folks:
>
>
>
> I'd like to present the following feedback from one of our lead
>
> Trident developers on the IE team. He and I work on a number of
>
> parts of the web platform; the encoding and auto-detect subsystem
>
> being the one most relevant to this mail. I'd really like to
>
> generate some discussion from the other browser UAs on the this
>
> topic.
>
>
>
> The basic idea is that we feel like there are a few places that
>
> the HTML5 spec could make assertions to improve the web's
>
> international support and future ease of interoperability
>
> regarding encodings and auto-detect. We recognize the need to be
>
> as compatible as possible with currently deployed web sites, and
>
> the technique proposed to maintain compatibility is by leveraging
>
> the "HTML5 doctype". I don't want to focus too much on that
>
> particular aspect of the proposal (though it's important), but to
>
> also consider the implications and scenarios as well.
>
>
>
> The proposal is straight-forward. Only in pages with the HTML5 doctype:
>
>
>
> 1.  Forbid the use of auto-detect heuristics for HTML encodings.
>
>
>
> 2.  Forbid the use problematic encodings such as UTF7 and EBCDIC.
>
>
>
>     Essentially, get rid of the classes of encodings in which
>
>     Jscript and tags do not correspond to simple ASCII characters
>
>     in the raw byte stream.
>
>
>
> 3.  Only handling the encoding in the first META tag within the
>
>     HEAD and requiring that the HEAD and META tags to appear within
>
>     a well-defined, fixed byte distance into the file to take effect.
>
>
>
> 4.  Require the default HTML encoding to be UTF8.
>
>
>
> I realize these changes depart somewhat from current practice and
>
> may seem constraining.  But, I was very pleased to see UTF7 already
>
> excluded and EBCDIC discouraged in the HTML5 draft.  The META tag
>
> is supposed to be the first after the HEAD according to the draft.
>
> But, if we could get substantial agreement from the various user
>
> agents to tighten up the behavior covering this handling, we can
>
> greatly improve the Internet in the following regards:
>
>
>
>
>
> A.  HTML5 would no longer be vulnerable to script injection from
>
>     encodings such as UTF7 and EBCDIC which then tricks the auto-
>
>     detection code to reinterpret the entire page and run the
>
>     injected script.
>
>
>
>     (Harley: I’ve had to fix a number of issues related to these
>
>     security vulnerabilities but the problem is systemic in the
>
>     products and the standard doesn’t help.)
>
>
>
> B.  HTML5 would be able to process markup more efficiently by
>
>     reducing the scanning and computation required to merely
>
>     determine the encoding of the file.
>
>
>
> C.  Since sometimes the heuristics or default encoding uses
>
>     information about the user’s environment, we often see pages
>
>     that display quite differently from one region to another.
>
>     As much as possible, browsing from across the globe should
>
>     give a consistent experience for a given page.  (Basically, I
>
>     want my children to one day stop seeing garbage when they
>
>     browse Japanese web sites from the US.)
>
>
>
> D.  We’d greatly increase the consistency of implementation of
>
>     markup handling by the various user agents. These openings
>
>     for UA-specific heuristics and decisions, undermines the
>
>     benefits of standards and standardization.
>
>
>
> Thanks,
>
>
>
> Travis and Harley
>
>
>
> Internet Explorer Program Management/Development
>
> Microsoft Corporation
>
>
Received on Wednesday, 27 May 2009 17:31:18 UTC

This archive was generated by hypermail 2.3.1 : Monday, 29 September 2014 09:39:03 UTC