Re: Auto-detect and encodings in HTML5 from Henri Sivonen on 2009-05-27 (public-html@w3.org from May 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 27 May 2009 10:51:15 +0300
To: Travis Leithead <Travis.Leithead@microsoft.com>, Jonathan Rosenne <rosennej@qsm.co.il>
Cc: HTML WG <public-html@w3.org>, www-international@w3.org, Richard Ishida <ishida@w3.org>, Ian Hickson <ian@hixie.ch>, Chris Wilson <Chris.Wilson@microsoft.com>, Harley Rosnow <Harley.Rosnow@microsoft.com>, Yair Shmuel <yshmuel@microsoft.com>, Addison Phillips <addison@amazon.com>
Message-Id: <2F772D67-0E2B-4E1C-8E06-E9850825BA9C@iki.fi>
On May 27, 2009, at 02:45, Travis Leithead wrote:

> The proposal is straight-forward. Only in pages with the HTML5  
> doctype:

Scoping new behavior to the HTML doctype would be contrary to the goal  
of specifying HTML processing is such a way that the same processing  
rules work for both legacy content and new HTML5 content.

> 1.  Forbid the use of auto-detect heuristics for HTML encodings.

IIRC, Firefox ships with the heuristic detector set to off. I don't  
have enough first-hand experience of browsing content in the affected  
languages (mainly CJK and Cyrillic) to be competent to say what the  
user experience impact of removing the detector altogether would be.

With the HTML5 parser, though, I've changed things so that the  
detector only runs for the first 512 bytes when enabled (the same 512  
bytes as the <meta> prescan). The HTML5 parser-enabled Gecko builds  
haven't had enough testing globally to tell yet whether this is  
enough. My cursory testing of CJK sites suggests 512 bytes is  
sufficient. (Previously in Gecko, the heuristic detector continued to  
inspect the data much further into the stream possibly triggering a  
reparse later on.)

I don't want to make the use of the detector conditional on doctype,  
because the doctype hasn't been parsed yet when the detector runs.

> 2.  Forbid the use problematic encodings such as UTF7 and EBCDIC.
>
>     Essentially, get rid of the classes of encodings in which
>     Jscript and tags do not correspond to simple ASCII characters
>     in the raw byte stream.

I support this change (for all of text/html).

My understanding is that Firefox and Opera have never had EBCDIC  
decoders, so getting away with not having them thus far suggests that  
IE and Safari could get rid of those decoders without ill effects (at  
least outside intranets).

The security issues with UTF-7 are well-documented, so it seems like a  
good idea to ban it. (HTML email declared as UTF-7 might be an  
exception. Are there popular MUAs in the real world that send HTML  
email as UTF-7?)

> 3.  Only handling the encoding in the first META tag within the
>     HEAD and

I think requiring the meta to be in HEAD for the purposes of the  
encoding sniffing would complicate things, so I don't support such a  
consumer implementation requirement. The spec already makes the  
containment of the encoding meta in head an authoring conformance  
requirement.

> requiring that the HEAD and META tags to appear within
>     a well-defined, fixed byte distance into the file to take effect.

I support making the number of bytes that the prescan applies to a  
fixed number. I think the number should not be smaller than 512 bytes  
and not be larger than 1024 bytes.

Could someone working on WebKit please comment on the experiences on  
choosing the number? (I have a vague recollection that WebKit  
increased the number from 512 to 1024 for some reason.)

However, due to existing content, I don't think we can remove the tree  
builder detecting later encoding <meta>s and causing renavigation to  
the document. The renavigation is unfortunate, but detecting the  
situation in the tree builder seems to have a negligible cost on the  
load times of pages that don't cause the renavigation.

> 4.  Require the default HTML encoding to be UTF8.

This isn't feasible for text/html in general. Given existing content,  
you need to either allow a locale-dependent default or default to  
windows-1252 if you want a single global default.

It doesn't make sense to change this for HTML5 doctype documents only,  
because the HTML5 doctype is an author opt-in mechanism and authors  
already have three widely supported mechanisms for opting into UTF-8:  
chareset=utf-8 in HTTP header, in <meta> and the BOM.

> But, if we could get substantial agreement from the various user
> agents to tighten up the behavior covering this handling, we can
> greatly improve the Internet in the following regards:

Would Microsoft be willing to tighten the IE 5.5, IE7, IE8 Almost  
Standards and IE8 Standards modes likewise in IE9? What about  
tightening them in a point release of IE8?

> A.  HTML5 would no longer be vulnerable to script injection from
>     encodings such as UTF7 and EBCDIC which then tricks the auto-
>     detection code to reinterpret the entire page and run the
>     injected script.

It makes sense to eliminate this attack vector, but it doesn't make  
sense to eliminate it for documents with the HTML5 doctype only,  
because doing so would still leave browsers vulnerable when the  
attacker targets a system that emits legacy doctypes.

> B.  HTML5 would be able to process markup more efficiently by
>     reducing the scanning and computation required to merely
>     determine the encoding of the file.

Authors can already opt-in to more efficient computation by declaring  
on the HTTP layer that they use UTF-8.

Adding different rules for HTML5 would increase code complexity.

> C.  Since sometimes the heuristics or default encoding uses
>     information about the user’s environment, we often see pages
>     that display quite differently from one region to another.
>     As much as possible, browsing from across the globe should
>     give a consistent experience for a given page.  (Basically, I
>     want my children to one day stop seeing garbage when they
>     browse Japanese web sites from the US.)

This is indeed a problem. However, I don't see a way for browsers to  
force CJK and Cyrillic sites into making the experience for out-of- 
locale readers better without losing market share within the local  
when doing so.

On May 27, 2009, at 07:39, Jonathan Rosenne wrote:

> EBCDIC and its national language variants, including visual encoding  
> of bidi languages, are in use and will continue to be in use as long  
> as mainframes are in use. A large quantity of data is stored in  
> mainframes in EBCDIC and its variants, and the easiest way of  
> interfacing this data to an HTML UI is by using the encoding  
> features of HTML.

It doesn't follow that mainframes should use EBCDIC variants for  
interchange with other systems. I think the burden to perform  
conversion should be on mainframes and other systems shouldn't take on  
the security risk of supporting encodings that aren't rough ASCII  
supersets.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Wednesday, 27 May 2009 07:52:13 UTC