Re: [XHR2] Avoiding charset dependencies on user settings from Jonas Sicking on 2011-09-29 (public-webapps@w3.org from July to September 2011)

From: Jonas Sicking <jonas@sicking.cc>
Date: Thu, 29 Sep 2011 13:27:06 -0700
To: Henri Sivonen <hsivonen@iki.fi>
Cc: public-webapps@w3.org
Message-ID: <CA+c2ei9gMnJv0zeo5PiPpA+cytcjKnMdy=zQBG4_ah1qnLpirg@mail.gmail.com>
On Thu, Sep 29, 2011 at 12:03 AM, Henri Sivonen <hsivonen@iki.fi> wrote:
> On Thu, Sep 29, 2011 at 3:30 AM, Jonas Sicking <jonas@sicking.cc> wrote:
>> Do we have any guesses or data as to what percentage of existing pages
>> would parse correctly with the above suggestion?
>
> I don't have guesses or data, because I think the question is irrelevant.
>
> When XHR is used for retrieving responseXML for legacy text/html, I'm
> not expecting legacy data that doesn't have encoding declations to be
> UTF-8 encoded. I want to use UTF-8 for consistency with legacy
> responseText and for well-defined behavior. (In the HTML parsing
> algorithm at least, we value well-defined behavior over guessing the
> author's intent correctly.) When people add responseXML usage for
> text/html, I expect them to add encoding declaration (if they are
> missing) when they add XHR code that uses responseXML for text/html.
>
> We assume for security purposes that an origin is under the control of
> one authority--i.e. that authority can change stuff within the origin.
> I'm suggesting that when XHR is used to retrieve text/html data from
> the same origin, if the text/html data doesn't already have its
> encoding declared, the person exercising the origin's authority to add
> XHR should take care of exercising the origin's authority to modify
> the text/html resources to add encoding declarations.
>
> XHR can't be used for retrieving different-origin legacy data without
> the other origin opting in using CORS. I posit that it's less onerous
> for the other origin to declare its encoding than to add CORS support.
> Since the other origin needs to participate anyway, I think it's
> reasonable to require declaring the encoding to be part of the
> participation.

While I agree that it's generally theoretically possible for a site
administrator to change anything about the site, in reality it's many
times pretty hard to do. We hear time and again how simply adding
headers to resources in a directory is a complex task, for example in
situations where a website is hosted by a third party.

Adding a charset-indicating header is probably generally easier to do
as it can be done by simply reconfiguring the server. However, I'm not
sure that it's safe to do so in all instances. Adding a
charset-indicating header requires knowing what the charset is for all
documents. If you have a large body of document served without a
charset-indicating header today, you take advantage of the automatic
detection in browsers. If you add a charset-indicating header, that
will stop happening and so you risk breaking all documents which
aren't using that encoding.

So consider for example a website which has traditionally been GB2312
for years, but have recently started transitioning to UTF8. If such a
website were to add a header which indicates that all documents are
encoded in GB2312, then all of a sudden all UTF8 documents break.

To do this properly, the website would have to analyze all documents
and either keep a separate database which indicates which documents
have which encoding, or automatically rewrite the documents such that
they all have in-document <meta>s which indicate the correct charset.
The former seems technically very hard to do, the latter seems very
risky since it requires parsing HTML and rewriting HTML.

> Finally, XHR allows the programmer using XHR to override the MIME
> type, including the charset parameter, so if the person adding new XHR
> code can't change the encoding declarations on legacy data, (s)he can
> override the UTF-8 last resort from JS (and a given repository of
> legacy data pretty often has a self-consistent encoding that the XHR
> programmer can discover ahead of time). I think requiring the person
> adding XHR code to write that line is much better than adding more
> locale and/or user setting-dependent behavior to the Web platform.

This is certainly a good point, and is likely generally the easiest
solution for someone rolling out a AJAX version of a new website
rather than requiring webserver configuration changes. However it
still doesn't solve the case where a website uses different encodings
for different documents as described above.

>>> What outcome do you suggest and why? It seems you aren't suggesting
>>> doing stuff that involves a parser restart? Are you just arguing
>>> against UTF-8 as the last resort?
>>
>> I'm suggesting that we do the same thing for XHR loading as we do for
>> <iframe> loading. With exception of not ever restarting the parser.
>> The goals are:
>>
>> * Parse as much of the HTML on the web as we can.
>> * Don't ever restart a network operation as that significantly
>> complicates the progress reporting as well as can have bad side
>> effects since XHR allows arbitrary headers and HTTP methods.
>
> So you suggest scanning the first 1024 bytes heuristically and suggest
> varying the last resort encoding.
>
> Would you decode responseText using the same encoding that's used for
> responseXML? If yes, that would mean changing the way responseText
> decodes in Gecko when there's no declaration.

Yes.

One way to make this less of a risk for backwards compatibility is to
only enable HTML parsing when .responseType is set to "document". That
uses less memory and has higher performance anyway.

Ultimately I'll let others make the decision here. But I do want to
make sure that people keep in mind that we might be leaving behind a
large body of the HTML that exists on the web today if we only allow
explicitly charset marked documents and UTF8 documents be decoded
using XHR.

While we can say that people should follow good practices and do one
of these things, I think practical concerns will make that not an
option for great many websites.

I'm particularly keen to hear how this will affect locales which do
not use ascii by default. Most of the contents I personally consume is
written in english or swedish. Most of which is generally legible even
if decoded using the wrong encoding. I'm under the impression that
that is not the case for for example Chinese or Hindi documents. I
think it would be sad if we went with any particular solution here
without consulting people from those locales.

/ Jonas
Received on Thursday, 29 September 2011 20:28:12 UTC