Re: Auto-detect and encodings in HTML5

On Wed, 27 May 2009 01:45:53 +0200, Travis Leithead <Travis.Leithead@microsoft.com> wrote:
> A.  HTML5 would no longer be vulnerable to script injection from
>     encodings such as UTF7 and EBCDIC which then tricks the auto-
>     detection code to reinterpret the entire page and run the
>     injected script.

Opera 10 does not support UTF-7, UTF-32, and EBCDIC for Web pages, regardless of rendering mode. So far we haven't run into issues. (I'm not sure EBCDIC was ever supported and UTF-32 support might have been removed earlier on.)


> B.  HTML5 would be able to process markup more efficiently by
>     reducing the scanning and computation required to merely
>     determine the encoding of the file.

As Henri indicates this might be possible for all pages.


> C.  Since sometimes the heuristics or default encoding uses
>     information about the user's environment, we often see pages
>     that display quite differently from one region to another.
>     As much as possible, browsing from across the globe should
>     give a consistent experience for a given page.  (Basically, I
>     want my children to one day stop seeing garbage when they
>     browse Japanese web sites from the US.)

This is something I'd like to see solved as well, but I'd really like it solved in a way that also works for the pages already deployed.


> D.  We'd greatly increase the consistency of implementation of
>     markup handling by the various user agents. These openings
>     for UA-specific heuristics and decisions, undermines the
>     benefits of standards and standardization.

Yeah, ideally we document the exact algorithms used and have a fixed set of encodings user agents must support and also forbid any other encodings. Define exactly how a byte stream labeled with an encoding maps to Unicode, etc. Unfortunately I haven't found much time to look into this more.


-- 
Anne van Kesteren
http://annevankesteren.nl/

Received on Wednesday, 27 May 2009 09:33:33 UTC