RE: Auto-detect and encodings in HTML5

I believe the stance of most of the participants in the
HTML working group is that no "version indicator" for
HTML5 is necessary, and there is no specific
"HTML5 doctype", against which newer, or stricter,
behavior can be keyed. 

If charset defaulting is a reason for having a specific
HTML5 version indicator, in order to trigger a stricter 
interpretation, say, of the default charset, that would
be interesting.

Larry
--
http://larry.masinter.net



-----Original Message-----
From: public-html-request@w3.org [mailto:public-html-request@w3.org] On Behalf Of M.T. Carrasco Benitez
Sent: Sunday, May 31, 2009 1:18 AM
To: Travis Leithead; Erik van der Poel
Cc: public-html@w3.org; www-international@w3.org; Richard Ishida; Ian Hickson; Chris Wilson; Harley Rosnow
Subject: Re: Auto-detect and encodings in HTML5


Near to Erik, but UTF8 in worse case:

1) Best: HTTP charset; unambiguous and "external"
2) Agree on ONE public detection algorithm
3) Mandatory declaration as near to the top as possible; if in META, the first in HEAD; within a certain range of bytes (e.g., 512) 
4) Default UTF8 could be part of the algorithm; perhaps the last option
5) No BOM or similar

Regards
Tomas


--- On Wed, 27/5/09, Erik van der Poel <erikv@google.com> wrote:

> From: Erik van der Poel <erikv@google.com>
> Subject: Re: Auto-detect and encodings in HTML5
> To: "Travis Leithead" <Travis.Leithead@microsoft.com>
> Cc: "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, "Richard Ishida" <ishida@w3.org>, "Ian Hickson" <ian@hixie.ch>, "Chris Wilson" <Chris.Wilson@microsoft.com>, "Harley Rosnow" <Harley.Rosnow@microsoft.com>
> Date: Wednesday, 27 May, 2009, 7:30 PM
> Hi Travis,
> 
> First of all, I am really happy to see a browser vendor
> offer to get
> stricter. :-)
> 
> I wonder whether the doctype is a very clean way to move
> forward in
> this area, given that the HTTP charset ought to disable
> the
> auto-detector, but if many authors prefer the META charset,
> then the
> doctype might be a reasonable compromise. I am still
> thinking about
> this part.
> 
> However, I object quite strongly to the UTF-8 default. If
> an HTML5
> document includes the doctype but excludes the charset, old
> clients
> might use their auto-detector and get it wrong. So I'd
> prefer to make
> the charset mandatory with HTML5 doctype, and keep the rule
> that the
> HTTP charset overrides the META charset for compatibility
> with old
> clients.
> 
> Erik
> 
> On Tue, May 26, 2009 at 4:45 PM, Travis Leithead
> <Travis.Leithead@microsoft.com>
> wrote:
> > Ian, UA venders, and HTML/I18n mailing list folks:
> >
> >
> >
> > I'd like to present the following feedback from one of
> our lead
> >
> > Trident developers on the IE team. He and I work on a
> number of
> >
> > parts of the web platform; the encoding and
> auto-detect subsystem
> >
> > being the one most relevant to this mail. I'd really
> like to
> >
> > generate some discussion from the other browser UAs on
> the this
> >
> > topic.
> >
> >
> >
> > The basic idea is that we feel like there are a few
> places that
> >
> > the HTML5 spec could make assertions to improve the
> web's
> >
> > international support and future ease of
> interoperability
> >
> > regarding encodings and auto-detect. We recognize the
> need to be
> >
> > as compatible as possible with currently deployed web
> sites, and
> >
> > the technique proposed to maintain compatibility is by
> leveraging
> >
> > the "HTML5 doctype". I don't want to focus too much on
> that
> >
> > particular aspect of the proposal (though it's
> important), but to
> >
> > also consider the implications and scenarios as well.
> >
> >
> >
> > The proposal is straight-forward. Only in pages with
> the HTML5 doctype:
> >
> >
> >
> > 1.  Forbid the use of auto-detect heuristics for HTML
> encodings.
> >
> >
> >
> > 2.  Forbid the use problematic encodings such as UTF7
> and EBCDIC.
> >
> >
> >
> >     Essentially, get rid of the classes of
> encodings in which
> >
> >     Jscript and tags do not correspond to simple
> ASCII characters
> >
> >     in the raw byte stream.
> >
> >
> >
> > 3.  Only handling the encoding in the first META tag
> within the
> >
> >     HEAD and requiring that the HEAD and META tags
> to appear within
> >
> >     a well-defined, fixed byte distance into the
> file to take effect.
> >
> >
> >
> > 4.  Require the default HTML encoding to be UTF8.
> >
> >
> >
> > I realize these changes depart somewhat from current
> practice and
> >
> > may seem constraining.  But, I was very pleased to
> see UTF7 already
> >
> > excluded and EBCDIC discouraged in the HTML5 draft. 
> The META tag
> >
> > is supposed to be the first after the HEAD according
> to the draft.
> >
> > But, if we could get substantial agreement from the
> various user
> >
> > agents to tighten up the behavior covering this
> handling, we can
> >
> > greatly improve the Internet in the following
> regards:
> >
> >
> >
> >
> >
> > A.  HTML5 would no longer be vulnerable to script
> injection from
> >
> >     encodings such as UTF7 and EBCDIC which then
> tricks the auto-
> >
> >     detection code to reinterpret the entire page
> and run the
> >
> >     injected script.
> >
> >
> >
> >     (Harley: I’ve had to fix a number of issues
> related to these
> >
> >     security vulnerabilities but the problem is
> systemic in the
> >
> >     products and the standard doesn’t help.)
> >
> >
> >
> > B.  HTML5 would be able to process markup more
> efficiently by
> >
> >     reducing the scanning and computation required
> to merely
> >
> >     determine the encoding of the file.
> >
> >
> >
> > C.  Since sometimes the heuristics or default
> encoding uses
> >
> >     information about the user’s environment, we
> often see pages
> >
> >     that display quite differently from one region
> to another.
> >
> >     As much as possible, browsing from across the
> globe should
> >
> >     give a consistent experience for a given
> page.  (Basically, I
> >
> >     want my children to one day stop seeing garbage
> when they
> >
> >     browse Japanese web sites from the US.)
> >
> >
> >
> > D.  We’d greatly increase the consistency of
> implementation of
> >
> >     markup handling by the various user agents.
> These openings
> >
> >     for UA-specific heuristics and decisions,
> undermines the
> >
> >     benefits of standards and standardization.
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Travis and Harley
> >
> >
> >
> > Internet Explorer Program Management/Development
> >
> > Microsoft Corporation
> >
> >
> 
> 


      

Received on Sunday, 31 May 2009 15:06:13 UTC