RE: Auto-detect and encodings in HTML5

EBCDIC and its national language variants, including visual encoding of bidi languages, are in use and will continue to be in use as long as mainframes are in use. A large quantity of data is stored in mainframes in EBCDIC and its variants, and the easiest way of interfacing this data to an HTML UI is by using the encoding features of HTML.

 

I have no objection to banning auto-detection and to making the default HTML encoding UTF8.

 

Jony Rosenne

 

 

From: www-international-request@w3.org [mailto:www-international-request@w3.org] On Behalf Of Phillips, Addison
Sent: Wednesday, May 27, 2009 2:54 AM
To: Travis Leithead; public-html@w3.org; www-international@w3.org; Richard Ishida; Ian Hickson
Cc: Chris Wilson; Harley Rosnow
Subject: RE: Auto-detect and encodings in HTML5

 

Hello Travis,

 

The Internationalization WG is, of course, quite interested in the problem of encoding management and detection in HTML5. 

 

I have added your note to the Internationalization WG’s agenda for our upcoming teleconference.

 

Regards,

 

Addison

 

Addison Phillips

Globalization Architect -- Lab126

Chair -- W3C Internationalization WG

 

Internationalization is not a feature.

It is an architecture.

 

From: www-international-request@w3.org [mailto:www-international-request@w3.org] On Behalf Of Travis Leithead
Sent: Tuesday, May 26, 2009 4:46 PM
To: public-html@w3.org; www-international@w3.org; Richard Ishida; Ian Hickson
Cc: Chris Wilson; Harley Rosnow
Subject: Auto-detect and encodings in HTML5

 

Ian, UA venders, and HTML/I18n mailing list folks:

 

I'd like to present the following feedback from one of our lead 

Trident developers on the IE team. He and I work on a number of 

parts of the web platform; the encoding and auto-detect subsystem 

being the one most relevant to this mail. I'd really like to 

generate some discussion from the other browser UAs on the this 

topic.

 

The basic idea is that we feel like there are a few places that 

the HTML5 spec could make assertions to improve the web's 

international support and future ease of interoperability 

regarding encodings and auto-detect. We recognize the need to be 

as compatible as possible with currently deployed web sites, and 

the technique proposed to maintain compatibility is by leveraging 

the "HTML5 doctype". I don't want to focus too much on that 

particular aspect of the proposal (though it's important), but to 

also consider the implications and scenarios as well.

 

The proposal is straight-forward. Only in pages with the HTML5 doctype:

 

1.  Forbid the use of auto-detect heuristics for HTML encodings.

 

2.  Forbid the use problematic encodings such as UTF7 and EBCDIC.

 

    Essentially, get rid of the classes of encodings in which 

    Jscript and tags do not correspond to simple ASCII characters 

    in the raw byte stream.

 

3.  Only handling the encoding in the first META tag within the 

    HEAD and requiring that the HEAD and META tags to appear within

    a well-defined, fixed byte distance into the file to take effect.

 

4.  Require the default HTML encoding to be UTF8.

 

I realize these changes depart somewhat from current practice and 

may seem constraining.  But, I was very pleased to see UTF7 already 

excluded and EBCDIC discouraged in the HTML5 draft.  The META tag 

is supposed to be the first after the HEAD according to the draft.

But, if we could get substantial agreement from the various user 

agents to tighten up the behavior covering this handling, we can 

greatly improve the Internet in the following regards:

 

 

A.  HTML5 would no longer be vulnerable to script injection from 

    encodings such as UTF7 and EBCDIC which then tricks the auto-

    detection code to reinterpret the entire page and run the 

    injected script.  

 

    (Harley: I’ve had to fix a number of issues related to these 

    security vulnerabilities but the problem is systemic in the 

    products and the standard doesn’t help.)

 

B.  HTML5 would be able to process markup more efficiently by 

    reducing the scanning and computation required to merely 

    determine the encoding of the file.

 

C.  Since sometimes the heuristics or default encoding uses 

    information about the user’s environment, we often see pages 

    that display quite differently from one region to another. 

    As much as possible, browsing from across the globe should 

    give a consistent experience for a given page.  (Basically, I 

    want my children to one day stop seeing garbage when they 

    browse Japanese web sites from the US.)

 

D.  We’d greatly increase the consistency of implementation of 

    markup handling by the various user agents. These openings 

    for UA-specific heuristics and decisions, undermines the 

    benefits of standards and standardization.

 

Thanks,

 

Travis and Harley

 

Internet Explorer Program Management/Development

Microsoft Corporation

 

Received on Wednesday, 27 May 2009 04:40:39 UTC