Auto-detect and encodings in HTML5

Ian, UA venders, and HTML/I18n mailing list folks:



I'd like to present the following feedback from one of our lead

Trident developers on the IE team. He and I work on a number of

parts of the web platform; the encoding and auto-detect subsystem

being the one most relevant to this mail. I'd really like to

generate some discussion from the other browser UAs on the this

topic.



The basic idea is that we feel like there are a few places that

the HTML5 spec could make assertions to improve the web's

international support and future ease of interoperability

regarding encodings and auto-detect. We recognize the need to be

as compatible as possible with currently deployed web sites, and

the technique proposed to maintain compatibility is by leveraging

the "HTML5 doctype". I don't want to focus too much on that

particular aspect of the proposal (though it's important), but to

also consider the implications and scenarios as well.



The proposal is straight-forward. Only in pages with the HTML5 doctype:



1.  Forbid the use of auto-detect heuristics for HTML encodings.



2.  Forbid the use problematic encodings such as UTF7 and EBCDIC.



    Essentially, get rid of the classes of encodings in which

    Jscript and tags do not correspond to simple ASCII characters

    in the raw byte stream.



3.  Only handling the encoding in the first META tag within the

    HEAD and requiring that the HEAD and META tags to appear within

    a well-defined, fixed byte distance into the file to take effect.



4.  Require the default HTML encoding to be UTF8.



I realize these changes depart somewhat from current practice and

may seem constraining.  But, I was very pleased to see UTF7 already

excluded and EBCDIC discouraged in the HTML5 draft.  The META tag

is supposed to be the first after the HEAD according to the draft.

But, if we could get substantial agreement from the various user

agents to tighten up the behavior covering this handling, we can

greatly improve the Internet in the following regards:





A.  HTML5 would no longer be vulnerable to script injection from

    encodings such as UTF7 and EBCDIC which then tricks the auto-

    detection code to reinterpret the entire page and run the

    injected script.



    (Harley: I've had to fix a number of issues related to these

    security vulnerabilities but the problem is systemic in the

    products and the standard doesn't help.)



B.  HTML5 would be able to process markup more efficiently by

    reducing the scanning and computation required to merely

    determine the encoding of the file.



C.  Since sometimes the heuristics or default encoding uses

    information about the user's environment, we often see pages

    that display quite differently from one region to another.

    As much as possible, browsing from across the globe should

    give a consistent experience for a given page.  (Basically, I

    want my children to one day stop seeing garbage when they

    browse Japanese web sites from the US.)



D.  We'd greatly increase the consistency of implementation of

    markup handling by the various user agents. These openings

    for UA-specific heuristics and decisions, undermines the

    benefits of standards and standardization.



Thanks,



Travis and Harley



Internet Explorer Program Management/Development

Microsoft Corporation

Received on Tuesday, 26 May 2009 23:46:35 UTC