- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Thu, 29 Sep 2011 10:03:02 +0300
- To: public-webapps@w3.org
On Thu, Sep 29, 2011 at 3:30 AM, Jonas Sicking <jonas@sicking.cc> wrote: > Do we have any guesses or data as to what percentage of existing pages > would parse correctly with the above suggestion? I don't have guesses or data, because I think the question is irrelevant. When XHR is used for retrieving responseXML for legacy text/html, I'm not expecting legacy data that doesn't have encoding declations to be UTF-8 encoded. I want to use UTF-8 for consistency with legacy responseText and for well-defined behavior. (In the HTML parsing algorithm at least, we value well-defined behavior over guessing the author's intent correctly.) When people add responseXML usage for text/html, I expect them to add encoding declaration (if they are missing) when they add XHR code that uses responseXML for text/html. We assume for security purposes that an origin is under the control of one authority--i.e. that authority can change stuff within the origin. I'm suggesting that when XHR is used to retrieve text/html data from the same origin, if the text/html data doesn't already have its encoding declared, the person exercising the origin's authority to add XHR should take care of exercising the origin's authority to modify the text/html resources to add encoding declarations. XHR can't be used for retrieving different-origin legacy data without the other origin opting in using CORS. I posit that it's less onerous for the other origin to declare its encoding than to add CORS support. Since the other origin needs to participate anyway, I think it's reasonable to require declaring the encoding to be part of the participation. Finally, XHR allows the programmer using XHR to override the MIME type, including the charset parameter, so if the person adding new XHR code can't change the encoding declarations on legacy data, (s)he can override the UTF-8 last resort from JS (and a given repository of legacy data pretty often has a self-consistent encoding that the XHR programmer can discover ahead of time). I think requiring the person adding XHR code to write that line is much better than adding more locale and/or user setting-dependent behavior to the Web platform. >> What outcome do you suggest and why? It seems you aren't suggesting >> doing stuff that involves a parser restart? Are you just arguing >> against UTF-8 as the last resort? > > I'm suggesting that we do the same thing for XHR loading as we do for > <iframe> loading. With exception of not ever restarting the parser. > The goals are: > > * Parse as much of the HTML on the web as we can. > * Don't ever restart a network operation as that significantly > complicates the progress reporting as well as can have bad side > effects since XHR allows arbitrary headers and HTTP methods. So you suggest scanning the first 1024 bytes heuristically and suggest varying the last resort encoding. Would you decode responseText using the same encoding that's used for responseXML? If yes, that would mean changing the way responseText decodes in Gecko when there's no declaration. -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Thursday, 29 September 2011 07:03:42 UTC