W3C home > Mailing lists > Public > www-international@w3.org > April to June 2009

Re: Auto-detect and encodings in HTML5

From: 신정식 <jshin1987+w3@gmail.com>
Date: Wed, 27 May 2009 11:37:43 -0700
Message-ID: <180832fc0905271137s7af6763bt77949120c7a6bb0c@mail.gmail.com>
To: Erik van der Poel <erikv@google.com>
Cc: Travis Leithead <Travis.Leithead@microsoft.com>, "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, Richard Ishida <ishida@w3.org>, Ian Hickson <ian@hixie.ch>, Chris Wilson <Chris.Wilson@microsoft.com>, Harley Rosnow <Harley.Rosnow@microsoft.com>, Simon Montagu <smontagu@smontagu.org>, ap@webkit.org
Hi,



2009/5/27 Erik van der Poel <erikv@google.com>

> Hi Travis,
>
> First of all, I am really happy to see a browser vendor offer to get
> stricter. :-)
>

So am I :-)


>
> I wonder whether the doctype is a very clean way to move forward in
> this area, given that the HTTP charset ought to disable the
> auto-detector, but if many authors prefer the META charset, then the
> doctype might be a reasonable compromise. I am still thinking about
> this part.
>

My responese inlined below are also contingent on this issue. I'm on the
fence about it.



>
> However, I object quite strongly to the UTF-8 default. If an HTML5
> document includes the doctype but excludes the charset, old clients
> might use their auto-detector and get it wrong. So I'd prefer to make
> the charset mandatory with HTML5 doctype, and keep the rule that the
> HTTP charset overrides the META charset for compatibility with old
> clients.
>




>
> Erik
>
> On Tue, May 26, 2009 at 4:45 PM, Travis Leithead
> <Travis.Leithead@microsoft.com> wrote:
> > Ian, UA venders, and HTML/I18n mailing list folks:
> >
> >
> >
> > I'd like to present the following feedback from one of our lead
> >
> > Trident developers on the IE team. He and I work on a number of
> >
> > parts of the web platform; the encoding and auto-detect subsystem
> >
> > being the one most relevant to this mail. I'd really like to
> >
> > generate some discussion from the other browser UAs on the this
> >
> > topic.
> >
> >
> >
> > The basic idea is that we feel like there are a few places that
> >
> > the HTML5 spec could make assertions to improve the web's
> >
> > international support and future ease of interoperability
> >
> > regarding encodings and auto-detect. We recognize the need to be
> >
> > as compatible as possible with currently deployed web sites, and
> >
> > the technique proposed to maintain compatibility is by leveraging
> >
> > the "HTML5 doctype". I don't want to focus too much on that
> >
> > particular aspect of the proposal (though it's important), but to
> >
> > also consider the implications and scenarios as well.
> >
> >
> >
> > The proposal is straight-forward. Only in pages with the HTML5 doctype:
> >
> >
> >
> > 1.  Forbid the use of auto-detect heuristics for HTML encodings.
> >
> >


As far as I know (Simon will correct me if I'm not up-to-date), Firefox's
charset autodetctor kicks in only when both of the following two conditions
are satisfied:

1) Auto-detection is turned on explicitly by a user. It's OFF by default
2) No charset is specified anywhere.

Even if it's turned ON, Firefox does honor the explicitly specified charset
(http or meta).

Webkit does the same except that it tries to detect the encoding when one of
Japanese encodings is specified (, which I think has to be removed. Chrome
2.0 removed this in its copy of Webkit. So, Chrome 2.0's behavior is
identical to Firefox).

IE's behavior seems to be different, but I haven't managed to figure out
when its auto-detector kicks in. Could you tell us what IE does with
auto-detection?





>
> >
> > 2.  Forbid the use problematic encodings such as UTF7 and EBCDIC.
> >
> >
> >
> >     Essentially, get rid of the classes of encodings in which
> >
> >     Jscript and tags do not correspond to simple ASCII characters
> >
> >     in the raw byte stream.
> >


I wholeheartedly support this. Firefox never supported EBCDIC encodings.

 I'm tempted to go a step further to forbid ISO-2022-XX and GB-HZ as well,
but there might be a compatibility concern here. However, if that
prohibition is triggered by HTML5 doctype, it should be ok.



>
> >
> >
> > 3.  Only handling the encoding in the first META tag within the
> >
> >     HEAD and requiring that the HEAD and META tags to appear within
> >
> >     a well-defined, fixed byte distance into the file to take effect.
> >
> >



There are some web sites with meta tags deeply buried ( > 512 bytes from the
beginning). Webkit even has a layout test for this (currently, it scans the
first 1024 bytes).

By no means, I'm happy with those web pages. So, I agree with you on this
except that I'm not sure of requiring the meta cahrset declaration to be
inside <head>.





> >
> > 4.  Require the default HTML encoding to be UTF8.
>

Although I wish every web page were in UTF-8, I think I'm with Erik
(mandating meta charset with http taking a higher priority).

Aha.. you may have had something else in mind. Even if HTML5 mandates meta
charset with http taking a higher priority, some html5 pages are likely to
be incompliant to the standard. In that case, we have to define the UA
behavior and you want UTF-8 to be always assumed by UA's instead of the
default encoding configurable by a user, which is the current practice (in
Firefox and Webkit) when auto-detector is OFF




>
> >
> >
> > I realize these changes depart somewhat from current practice and
> >
> > may seem constraining.  But, I was very pleased to see UTF7 already
> >
> > excluded and EBCDIC discouraged in the HTML5 draft.  The META tag
> >
> > is supposed to be the first after the HEAD according to the draft.
> >
> > But, if we could get substantial agreement from the various user
> >
> > agents to tighten up the behavior covering this handling, we can
> >
> > greatly improve the Internet in the following regards:
> >
> >
> >
> >
> >
> > A.  HTML5 would no longer be vulnerable to script injection from
> >
> >     encodings such as UTF7 and EBCDIC which then tricks the auto-
> >
> >     detection code to reinterpret the entire page and run the
> >
> >     injected script.
> >
> >
> >
> >     (Harley: I’ve had to fix a number of issues related to these
> >
> >     security vulnerabilities but the problem is systemic in the
> >
> >     products and the standard doesn’t help.)
> >
> >
> >
> > B.  HTML5 would be able to process markup more efficiently by
> >
> >     reducing the scanning and computation required to merely
> >
> >     determine the encoding of the file.
> >
> >
> >
> > C.  Since sometimes the heuristics or default encoding uses
> >
> >     information about the user’s environment, we often see pages
> >
> >     that display quite differently from one region to another.
> >
> >     As much as possible, browsing from across the globe should
> >
> >     give a consistent experience for a given page.  (Basically, I
> >
> >     want my children to one day stop seeing garbage when they
> >
> >     browse Japanese web sites from the US.)




>
> >
> >
> >
> > D.  We’d greatly increase the consistency of implementation of
> >
> >     markup handling by the various user agents. These openings
> >
> >     for UA-specific heuristics and decisions, undermines the
> >
> >     benefits of standards and standardization.
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Travis and Harley
> >
> >
> >
> > Internet Explorer Program Management/Development
> >
> > Microsoft Corporation
>

Jungshik
Received on Thursday, 28 May 2009 06:44:11 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:19 GMT