Re: [XHR2] Avoiding charset dependencies on user settings from Jonas Sicking on 2011-09-22 (public-webapps@w3.org from July to September 2011)

From: Jonas Sicking <jonas@sicking.cc>
Date: Thu, 22 Sep 2011 11:54:30 -0700
To: Henri Sivonen <hsivonen@iki.fi>
Cc: public-webapps@w3.org
Message-ID: <CA+c2ei92xuen8XttCsRr+npdZX5wgpgDWC7psKUy-BfM=BgP5A@mail.gmail.com>

On Thu, Sep 22, 2011 at 6:33 AM, Henri Sivonen <hsivonen@iki.fi> wrote:
> http://dev.w3.org/2006/webapi/XMLHttpRequest-2/#document-response-entity-body
> says:
> "If final MIME type is text/html let document be Document object that
> represents the response entity body parsed following the rules set
> forth in the HTML specification for an HTML parser with scripting
> disabled. [HTML]"
>
> Since there's presumably no legacy content using XHR to read
> responseXML for text/html (and expecting HTML parsing) and since (in
> Gecko at least) responseText for non-XML tries HTTP charset and falls
> back on UTF-8, it seems it doesn't make sense to implement full-blown
> legacy charset craziness for text/html in XHR.
>
> Specifically, it seems that it makes sense to skip heuristic detection
> and to use UTF-8 (as opposed to Windows-1252 or a locale-dependent
> value) as the fallback encoding if there's neither <meta> nor HTTP
> charset, since UTF-8 is the pre-existing fallback for responseText and
> responseText is already used with text/html.
>
> As it stands, the XHR2 spec defers to a part of HTML that has
> legacy-oriented optional features. It seems that it makes sense to
> clamp down those options for XHR.

I agree that there are no legacy requirements on XHR here, however I
don't think that that is the only thing that we should look at. We
should also look at what makes the feature the most useful. A extreme
counter-example would be that we could let XHR refuse to parse any
HTML page that didn't pass a validator. While this wouldn't break any
existing content, it would make HTML-in-XHR significantly less useful.

It makes sense to me that XHR can load any HTML resource that you
could load through navigation.

The one argument I could see for refusing diverge from the normal HTML
loading algorithm is if it breaks few enough pages that it doesn't
severely limit the usefulness of HTML-in-XHR (in any locale), while
still adding enough pressure on sites to start using explicit charsets
that we accomplish real change.

Unfortunately I don't know how to measure those things though.

/ Jonas

Received on Thursday, 22 September 2011 18:55:31 UTC