Re: [XHR2] Avoiding charset dependencies on user settings from Jonas Sicking on 2011-09-29 (public-webapps@w3.org from July to September 2011)

From: Jonas Sicking <jonas@sicking.cc>
Date: Wed, 28 Sep 2011 17:30:47 -0700
To: Henri Sivonen <hsivonen@iki.fi>
Cc: public-webapps@w3.org
Message-ID: <CA+c2ei8sAMPsoe+RHNmVwNUXumOj7oT_RKbLZWiLjBvO8FiG=g@mail.gmail.com>

On Wed, Sep 28, 2011 at 2:54 AM, Henri Sivonen <hsivonen@iki.fi> wrote:
> On Wed, Sep 28, 2011 at 4:16 AM, Jonas Sicking <jonas@sicking.cc> wrote:
>> So it sounds like your argument is that we should do <meta> prescan
>> because we can do it without breaking any new ground. Not because it's
>> better or was inherently safer before webkit tried it out.
>
> The outcome I am suggesting is that character encoding determination
> for text/html in XHR should be:
>  1) HTTP charset
>  2) BOM
>  3) <meta> prescan
>  4) UTF-8
>
> My rationale is:
>  * Restarting the parser sucks. Full heuristic detection and
> non-prescan <meta> require restarting.
>  * Supporting HTTP charset, BOM and <meta> prescan means supporting
> all the cases where the author is declaring the encoding in a
> conforming way.
>  * Supporting <meta> prescan even for responseText is safe to the
> extent content is not already broken in WebKit.
>  * Not doing even heuristic detection on the first 1024 bytes allows
> us to avoid one of the unpredictability and
> non-interoperability-inducing legacy flaws that encumber HTML when
> loading it into a browsing context.
>  * Using a clamped last resort encoding instead of a user setting or
> locale-dependent encoding allows us to avoid one of the
> unpredictability and non-interoperability-inducing legacy flaws that
> encumber HTML when loading it into a browsing context.
>  * Using UTF-8 as opposed to Windows-1252 or a user setting or
> locale-dependent encoding as the last resort encoding allows the same
> encoding to be used in the responseXML and responseText cases without
> breaking existing responseText usage that expects UTF-8 (UTF-8 is the
> responseText default in Gecko).

Do we have any guesses or data as to what percentage of existing pages
would parse correctly with the above suggestion? If we only have
guesses, what are those guesses based on?

My concern is leaving large chunks of the web decoded incorrectly with
the above algorithm. My perception was that a very large number of
pages don't declare a charset in the 1-3 locations proposed above, and
yet aren't encoded in UTF8.

This article is over a year old at this point, but we still had less
than 50% of the web in UTF8 at that point.

http://googland.blogspot.com/2010/01/g-unicode-nearing-50-of-web.html

> What outcome do you suggest and why? It seems you aren't suggesting
> doing stuff that involves a parser restart? Are you just arguing
> against UTF-8 as the last resort?

I'm suggesting that we do the same thing for XHR loading as we do for
<iframe> loading. With exception of not ever restarting the parser.
The goals are:

* Parse as much of the HTML on the web as we can.
* Don't ever restart a network operation as that significantly
complicates the progress reporting as well as can have bad side
effects since XHR allows arbitrary headers and HTTP methods.

/ Jonas

Received on Thursday, 29 September 2011 00:31:45 UTC