Re: Auto-detect and encodings in HTML5

Near to Erik, but UTF8 in worse case:

1) Best: HTTP charset; unambiguous and "external"
2) Agree on ONE public detection algorithm
3) Mandatory declaration as near to the top as possible; if in META, the first in HEAD; within a certain range of bytes (e.g., 512) 
4) Default UTF8 could be part of the algorithm; perhaps the last option
5) No BOM or similar

Regards
Tomas


--- On Wed, 27/5/09, Erik van der Poel <erikv@google.com> wrote:

> From: Erik van der Poel <erikv@google.com>
> Subject: Re: Auto-detect and encodings in HTML5
> To: "Travis Leithead" <Travis.Leithead@microsoft.com>
> Cc: "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, "Richard Ishida" <ishida@w3.org>, "Ian Hickson" <ian@hixie.ch>, "Chris Wilson" <Chris.Wilson@microsoft.com>, "Harley Rosnow" <Harley.Rosnow@microsoft.com>
> Date: Wednesday, 27 May, 2009, 7:30 PM
> Hi Travis,
> 
> First of all, I am really happy to see a browser vendor
> offer to get
> stricter. :-)
> 
> I wonder whether the doctype is a very clean way to move
> forward in
> this area, given that the HTTP charset ought to disable
> the
> auto-detector, but if many authors prefer the META charset,
> then the
> doctype might be a reasonable compromise. I am still
> thinking about
> this part.
> 
> However, I object quite strongly to the UTF-8 default. If
> an HTML5
> document includes the doctype but excludes the charset, old
> clients
> might use their auto-detector and get it wrong. So I'd
> prefer to make
> the charset mandatory with HTML5 doctype, and keep the rule
> that the
> HTTP charset overrides the META charset for compatibility
> with old
> clients.
> 
> Erik
> 
> On Tue, May 26, 2009 at 4:45 PM, Travis Leithead
> <Travis.Leithead@microsoft.com>
> wrote:
> > Ian, UA venders, and HTML/I18n mailing list folks:
> >
> >
> >
> > I'd like to present the following feedback from one of
> our lead
> >
> > Trident developers on the IE team. He and I work on a
> number of
> >
> > parts of the web platform; the encoding and
> auto-detect subsystem
> >
> > being the one most relevant to this mail. I'd really
> like to
> >
> > generate some discussion from the other browser UAs on
> the this
> >
> > topic.
> >
> >
> >
> > The basic idea is that we feel like there are a few
> places that
> >
> > the HTML5 spec could make assertions to improve the
> web's
> >
> > international support and future ease of
> interoperability
> >
> > regarding encodings and auto-detect. We recognize the
> need to be
> >
> > as compatible as possible with currently deployed web
> sites, and
> >
> > the technique proposed to maintain compatibility is by
> leveraging
> >
> > the "HTML5 doctype". I don't want to focus too much on
> that
> >
> > particular aspect of the proposal (though it's
> important), but to
> >
> > also consider the implications and scenarios as well.
> >
> >
> >
> > The proposal is straight-forward. Only in pages with
> the HTML5 doctype:
> >
> >
> >
> > 1.  Forbid the use of auto-detect heuristics for HTML
> encodings.
> >
> >
> >
> > 2.  Forbid the use problematic encodings such as UTF7
> and EBCDIC.
> >
> >
> >
> >     Essentially, get rid of the classes of
> encodings in which
> >
> >     Jscript and tags do not correspond to simple
> ASCII characters
> >
> >     in the raw byte stream.
> >
> >
> >
> > 3.  Only handling the encoding in the first META tag
> within the
> >
> >     HEAD and requiring that the HEAD and META tags
> to appear within
> >
> >     a well-defined, fixed byte distance into the
> file to take effect.
> >
> >
> >
> > 4.  Require the default HTML encoding to be UTF8.
> >
> >
> >
> > I realize these changes depart somewhat from current
> practice and
> >
> > may seem constraining.  But, I was very pleased to
> see UTF7 already
> >
> > excluded and EBCDIC discouraged in the HTML5 draft. 
> The META tag
> >
> > is supposed to be the first after the HEAD according
> to the draft.
> >
> > But, if we could get substantial agreement from the
> various user
> >
> > agents to tighten up the behavior covering this
> handling, we can
> >
> > greatly improve the Internet in the following
> regards:
> >
> >
> >
> >
> >
> > A.  HTML5 would no longer be vulnerable to script
> injection from
> >
> >     encodings such as UTF7 and EBCDIC which then
> tricks the auto-
> >
> >     detection code to reinterpret the entire page
> and run the
> >
> >     injected script.
> >
> >
> >
> >     (Harley: I’ve had to fix a number of issues
> related to these
> >
> >     security vulnerabilities but the problem is
> systemic in the
> >
> >     products and the standard doesn’t help.)
> >
> >
> >
> > B.  HTML5 would be able to process markup more
> efficiently by
> >
> >     reducing the scanning and computation required
> to merely
> >
> >     determine the encoding of the file.
> >
> >
> >
> > C.  Since sometimes the heuristics or default
> encoding uses
> >
> >     information about the user’s environment, we
> often see pages
> >
> >     that display quite differently from one region
> to another.
> >
> >     As much as possible, browsing from across the
> globe should
> >
> >     give a consistent experience for a given
> page.  (Basically, I
> >
> >     want my children to one day stop seeing garbage
> when they
> >
> >     browse Japanese web sites from the US.)
> >
> >
> >
> > D.  We’d greatly increase the consistency of
> implementation of
> >
> >     markup handling by the various user agents.
> These openings
> >
> >     for UA-specific heuristics and decisions,
> undermines the
> >
> >     benefits of standards and standardization.
> >
> >
> >
> > Thanks,
> >
> >
> >
> > Travis and Harley
> >
> >
> >
> > Internet Explorer Program Management/Development
> >
> > Microsoft Corporation
> >
> >
> 
> 


      

Received on Sunday, 31 May 2009 08:18:54 UTC