Re: [XHR2] Avoiding charset dependencies on user settings from Henri Sivonen on 2011-09-28 (public-webapps@w3.org from July to September 2011)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 28 Sep 2011 12:54:44 +0300
To: public-webapps@w3.org
Message-ID: <CAJQvAucWdTJCPWRTiUfagunmuLiQpy6Qcfbd4S0VL9vQQUhebg@mail.gmail.com>

On Wed, Sep 28, 2011 at 4:16 AM, Jonas Sicking <jonas@sicking.cc> wrote:
> So it sounds like your argument is that we should do <meta> prescan
> because we can do it without breaking any new ground. Not because it's
> better or was inherently safer before webkit tried it out.

The outcome I am suggesting is that character encoding determination
for text/html in XHR should be:
 1) HTTP charset
 2) BOM
 3) <meta> prescan
 4) UTF-8

My rationale is:
 * Restarting the parser sucks. Full heuristic detection and
non-prescan <meta> require restarting.
 * Supporting HTTP charset, BOM and <meta> prescan means supporting
all the cases where the author is declaring the encoding in a
conforming way.
 * Supporting <meta> prescan even for responseText is safe to the
extent content is not already broken in WebKit.
 * Not doing even heuristic detection on the first 1024 bytes allows
us to avoid one of the unpredictability and
non-interoperability-inducing legacy flaws that encumber HTML when
loading it into a browsing context.
 * Using a clamped last resort encoding instead of a user setting or
locale-dependent encoding allows us to avoid one of the
unpredictability and non-interoperability-inducing legacy flaws that
encumber HTML when loading it into a browsing context.
 * Using UTF-8 as opposed to Windows-1252 or a user setting or
locale-dependent encoding as the last resort encoding allows the same
encoding to be used in the responseXML and responseText cases without
breaking existing responseText usage that expects UTF-8 (UTF-8 is the
responseText default in Gecko).

What outcome do you suggest and why? It seems you aren't suggesting
doing stuff that involves a parser restart? Are you just arguing
against UTF-8 as the last resort?

> And in any case, it's easy to figure out where the
> data was loaded from after the fact, so debugging doesn't seem any
> harder.

If that counts as "not harder", I concede this point.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Wednesday, 28 September 2011 09:55:13 UTC