Re: Auto-detect and encodings in HTML5 from Anne van Kesteren on 2009-05-27 (public-html@w3.org from May 2009)

From: Anne van Kesteren <annevk@opera.com>
Date: Wed, 27 May 2009 11:32:48 +0200
To: "Travis Leithead" <Travis.Leithead@microsoft.com>, "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, "Richard Ishida" <ishida@w3.org>, "Ian Hickson" <ian@hixie.ch>
Cc: "Chris Wilson" <Chris.Wilson@microsoft.com>, "Harley Rosnow" <Harley.Rosnow@microsoft.com>
Message-ID: <op.uuk0syzn64w2qv@annevk-t60>

On Wed, 27 May 2009 01:45:53 +0200, Travis Leithead <Travis.Leithead@microsoft.com> wrote:
> A.  HTML5 would no longer be vulnerable to script injection from
>     encodings such as UTF7 and EBCDIC which then tricks the auto-
>     detection code to reinterpret the entire page and run the
>     injected script.

Opera 10 does not support UTF-7, UTF-32, and EBCDIC for Web pages, regardless of rendering mode. So far we haven't run into issues. (I'm not sure EBCDIC was ever supported and UTF-32 support might have been removed earlier on.)

> B.  HTML5 would be able to process markup more efficiently by
>     reducing the scanning and computation required to merely
>     determine the encoding of the file.

As Henri indicates this might be possible for all pages.

> C.  Since sometimes the heuristics or default encoding uses
>     information about the user's environment, we often see pages
>     that display quite differently from one region to another.
>     As much as possible, browsing from across the globe should
>     give a consistent experience for a given page.  (Basically, I
>     want my children to one day stop seeing garbage when they
>     browse Japanese web sites from the US.)

This is something I'd like to see solved as well, but I'd really like it solved in a way that also works for the pages already deployed.

> D.  We'd greatly increase the consistency of implementation of
>     markup handling by the various user agents. These openings
>     for UA-specific heuristics and decisions, undermines the
>     benefits of standards and standardization.

Yeah, ideally we document the exact algorithms used and have a fixed set of encodings user agents must support and also forbid any other encodings. Define exactly how a byte stream labeled with an encoding maps to Unicode, etc. Unfortunately I haven't found much time to look into this more.

-- 
Anne van Kesteren
http://annevankesteren.nl/

Received on Wednesday, 27 May 2009 09:33:30 UTC