W3C home > Mailing lists > Public > public-html@w3.org > May 2009

Re: Auto-detect and encodings in HTML5

From: Erik van der Poel <erikv@google.com>
Date: Sun, 31 May 2009 10:37:45 -0700
Message-ID: <c07a32650905311037m2d02bbdekc454800fbc0f988b@mail.gmail.com>
To: Larry Masinter <masinter@adobe.com>
Cc: "M.T. Carrasco Benitez" <mtcarrascob@yahoo.com>, Travis Leithead <Travis.Leithead@microsoft.com>, "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, Richard Ishida <ishida@w3.org>, Ian Hickson <ian@hixie.ch>, Chris Wilson <Chris.Wilson@microsoft.com>, Harley Rosnow <Harley.Rosnow@microsoft.com>
I agree that it would be interesting if major HTML5 implementers and
(the) HTML5 spec writer(s) would agree on a UTF-8 default charset.

Just to make the HTML5 "version indicator" a bit more explicit, might
this be something like the following HTTP response header?

Content-Type: text/html; version=5; charset=gb2312

Erik

On Sun, May 31, 2009 at 8:05 AM, Larry Masinter <masinter@adobe.com> wrote:
> I believe the stance of most of the participants in the
> HTML working group is that no "version indicator" for
> HTML5 is necessary, and there is no specific
> "HTML5 doctype", against which newer, or stricter,
> behavior can be keyed.
>
> If charset defaulting is a reason for having a specific
> HTML5 version indicator, in order to trigger a stricter
> interpretation, say, of the default charset, that would
> be interesting.
>
> Larry
> --
> http://larry.masinter.net
>
>
> -----Original Message-----
> From: public-html-request@w3.org [mailto:public-html-request@w3.org] On Behalf Of M.T. Carrasco Benitez
> Sent: Sunday, May 31, 2009 1:18 AM
> To: Travis Leithead; Erik van der Poel
> Cc: public-html@w3.org; www-international@w3.org; Richard Ishida; Ian Hickson; Chris Wilson; Harley Rosnow
> Subject: Re: Auto-detect and encodings in HTML5
>
>
> Near to Erik, but UTF8 in worse case:
>
> 1) Best: HTTP charset; unambiguous and "external"
> 2) Agree on ONE public detection algorithm
> 3) Mandatory declaration as near to the top as possible; if in META, the first in HEAD; within a certain range of bytes (e.g., 512)
> 4) Default UTF8 could be part of the algorithm; perhaps the last option
> 5) No BOM or similar
>
> Regards
> Tomas
>
>
> --- On Wed, 27/5/09, Erik van der Poel <erikv@google.com> wrote:
>
>> From: Erik van der Poel <erikv@google.com>
>> Subject: Re: Auto-detect and encodings in HTML5
>> To: "Travis Leithead" <Travis.Leithead@microsoft.com>
>> Cc: "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, "Richard Ishida" <ishida@w3.org>, "Ian Hickson" <ian@hixie.ch>, "Chris Wilson" <Chris.Wilson@microsoft.com>, "Harley Rosnow" <Harley.Rosnow@microsoft.com>
>> Date: Wednesday, 27 May, 2009, 7:30 PM
>> Hi Travis,
>>
>> First of all, I am really happy to see a browser vendor
>> offer to get
>> stricter. :-)
>>
>> I wonder whether the doctype is a very clean way to move
>> forward in
>> this area, given that the HTTP charset ought to disable
>> the
>> auto-detector, but if many authors prefer the META charset,
>> then the
>> doctype might be a reasonable compromise. I am still
>> thinking about
>> this part.
>>
>> However, I object quite strongly to the UTF-8 default. If
>> an HTML5
>> document includes the doctype but excludes the charset, old
>> clients
>> might use their auto-detector and get it wrong. So I'd
>> prefer to make
>> the charset mandatory with HTML5 doctype, and keep the rule
>> that the
>> HTTP charset overrides the META charset for compatibility
>> with old
>> clients.
>>
>> Erik
>>
>> On Tue, May 26, 2009 at 4:45 PM, Travis Leithead
>> <Travis.Leithead@microsoft.com>
>> wrote:
>> > Ian, UA venders, and HTML/I18n mailing list folks:
>> >
>> >
>> >
>> > I'd like to present the following feedback from one of
>> our lead
>> >
>> > Trident developers on the IE team. He and I work on a
>> number of
>> >
>> > parts of the web platform; the encoding and
>> auto-detect subsystem
>> >
>> > being the one most relevant to this mail. I'd really
>> like to
>> >
>> > generate some discussion from the other browser UAs on
>> the this
>> >
>> > topic.
>> >
>> >
>> >
>> > The basic idea is that we feel like there are a few
>> places that
>> >
>> > the HTML5 spec could make assertions to improve the
>> web's
>> >
>> > international support and future ease of
>> interoperability
>> >
>> > regarding encodings and auto-detect. We recognize the
>> need to be
>> >
>> > as compatible as possible with currently deployed web
>> sites, and
>> >
>> > the technique proposed to maintain compatibility is by
>> leveraging
>> >
>> > the "HTML5 doctype". I don't want to focus too much on
>> that
>> >
>> > particular aspect of the proposal (though it's
>> important), but to
>> >
>> > also consider the implications and scenarios as well.
>> >
>> >
>> >
>> > The proposal is straight-forward. Only in pages with
>> the HTML5 doctype:
>> >
>> >
>> >
>> > 1.  Forbid the use of auto-detect heuristics for HTML
>> encodings.
>> >
>> >
>> >
>> > 2.  Forbid the use problematic encodings such as UTF7
>> and EBCDIC.
>> >
>> >
>> >
>> >     Essentially, get rid of the classes of
>> encodings in which
>> >
>> >     Jscript and tags do not correspond to simple
>> ASCII characters
>> >
>> >     in the raw byte stream.
>> >
>> >
>> >
>> > 3.  Only handling the encoding in the first META tag
>> within the
>> >
>> >     HEAD and requiring that the HEAD and META tags
>> to appear within
>> >
>> >     a well-defined, fixed byte distance into the
>> file to take effect.
>> >
>> >
>> >
>> > 4.  Require the default HTML encoding to be UTF8.
>> >
>> >
>> >
>> > I realize these changes depart somewhat from current
>> practice and
>> >
>> > may seem constraining.  But, I was very pleased to
>> see UTF7 already
>> >
>> > excluded and EBCDIC discouraged in the HTML5 draft.
>> The META tag
>> >
>> > is supposed to be the first after the HEAD according
>> to the draft.
>> >
>> > But, if we could get substantial agreement from the
>> various user
>> >
>> > agents to tighten up the behavior covering this
>> handling, we can
>> >
>> > greatly improve the Internet in the following
>> regards:
>> >
>> >
>> >
>> >
>> >
>> > A.  HTML5 would no longer be vulnerable to script
>> injection from
>> >
>> >     encodings such as UTF7 and EBCDIC which then
>> tricks the auto-
>> >
>> >     detection code to reinterpret the entire page
>> and run the
>> >
>> >     injected script.
>> >
>> >
>> >
>> >     (Harley: I’ve had to fix a number of issues
>> related to these
>> >
>> >     security vulnerabilities but the problem is
>> systemic in the
>> >
>> >     products and the standard doesn’t help.)
>> >
>> >
>> >
>> > B.  HTML5 would be able to process markup more
>> efficiently by
>> >
>> >     reducing the scanning and computation required
>> to merely
>> >
>> >     determine the encoding of the file.
>> >
>> >
>> >
>> > C.  Since sometimes the heuristics or default
>> encoding uses
>> >
>> >     information about the user’s environment, we
>> often see pages
>> >
>> >     that display quite differently from one region
>> to another.
>> >
>> >     As much as possible, browsing from across the
>> globe should
>> >
>> >     give a consistent experience for a given
>> page.  (Basically, I
>> >
>> >     want my children to one day stop seeing garbage
>> when they
>> >
>> >     browse Japanese web sites from the US.)
>> >
>> >
>> >
>> > D.  We’d greatly increase the consistency of
>> implementation of
>> >
>> >     markup handling by the various user agents.
>> These openings
>> >
>> >     for UA-specific heuristics and decisions,
>> undermines the
>> >
>> >     benefits of standards and standardization.
>> >
>> >
>> >
>> > Thanks,
>> >
>> >
>> >
>> > Travis and Harley
>> >
>> >
>> >
>> > Internet Explorer Program Management/Development
>> >
>> > Microsoft Corporation
>> >
>> >
>>
>>
>
>
>
>
>
Received on Sunday, 31 May 2009 17:39:24 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 9 May 2012 00:16:37 GMT