Re: [XHR2] Avoiding charset dependencies on user settings

On Thu, Sep 29, 2011 at 3:30 AM, Jonas Sicking <jonas@sicking.cc> wrote:
> Do we have any guesses or data as to what percentage of existing pages
> would parse correctly with the above suggestion?

I don't have guesses or data, because I think the question is irrelevant.

When XHR is used for retrieving responseXML for legacy text/html, I'm
not expecting legacy data that doesn't have encoding declations to be
UTF-8 encoded. I want to use UTF-8 for consistency with legacy
responseText and for well-defined behavior. (In the HTML parsing
algorithm at least, we value well-defined behavior over guessing the
author's intent correctly.) When people add responseXML usage for
text/html, I expect them to add encoding declaration (if they are
missing) when they add XHR code that uses responseXML for text/html.

We assume for security purposes that an origin is under the control of
one authority--i.e. that authority can change stuff within the origin.
I'm suggesting that when XHR is used to retrieve text/html data from
the same origin, if the text/html data doesn't already have its
encoding declared, the person exercising the origin's authority to add
XHR should take care of exercising the origin's authority to modify
the text/html resources to add encoding declarations.

XHR can't be used for retrieving different-origin legacy data without
the other origin opting in using CORS. I posit that it's less onerous
for the other origin to declare its encoding than to add CORS support.
Since the other origin needs to participate anyway, I think it's
reasonable to require declaring the encoding to be part of the
participation.

Finally, XHR allows the programmer using XHR to override the MIME
type, including the charset parameter, so if the person adding new XHR
code can't change the encoding declarations on legacy data, (s)he can
override the UTF-8 last resort from JS (and a given repository of
legacy data pretty often has a self-consistent encoding that the XHR
programmer can discover ahead of time). I think requiring the person
adding XHR code to write that line is much better than adding more
locale and/or user setting-dependent behavior to the Web platform.

>> What outcome do you suggest and why? It seems you aren't suggesting
>> doing stuff that involves a parser restart? Are you just arguing
>> against UTF-8 as the last resort?
>
> I'm suggesting that we do the same thing for XHR loading as we do for
> <iframe> loading. With exception of not ever restarting the parser.
> The goals are:
>
> * Parse as much of the HTML on the web as we can.
> * Don't ever restart a network operation as that significantly
> complicates the progress reporting as well as can have bad side
> effects since XHR allows arbitrary headers and HTTP methods.

So you suggest scanning the first 1024 bytes heuristically and suggest
varying the last resort encoding.

Would you decode responseText using the same encoding that's used for
responseXML? If yes, that would mean changing the way responseText
decodes in Gecko when there's no declaration.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Thursday, 29 September 2011 07:03:42 UTC