Re: [XHR2] Avoiding charset dependencies on user settings

On Mon, Sep 26, 2011 at 7:50 AM, Henri Sivonen <hsivonen@iki.fi> wrote:
> On Mon, Sep 26, 2011 at 12:46 PM, Jonas Sicking <jonas@sicking.cc> wrote:
>> On Fri, Sep 23, 2011 at 1:26 AM, Henri Sivonen <hsivonen@iki.fi> wrote:
>>> On Thu, Sep 22, 2011 at 9:54 PM, Jonas Sicking <jonas@sicking.cc> wrote:
>>>> I agree that there are no legacy requirements on XHR here, however I
>>>> don't think that that is the only thing that we should look at. We
>>>> should also look at what makes the feature the most useful. A extreme
>>>> counter-example would be that we could let XHR refuse to parse any
>>>> HTML page that didn't pass a validator. While this wouldn't break any
>>>> existing content, it would make HTML-in-XHR significantly less useful.
>>>
>>> Applying all the legacy text/html craziness to XHR could break current
>>> use of XHR to retrieve responseText of text/html resources (assuming
>>> that we want responseText for text/html work like responseText for XML
>>> in the sense that the same character encoding is used for responseText
>>> and responseXML).
>>
>> This doesn't seem to only be a problem when using "crazy" parts of
>> text/html charset detection. Simply looking for <meta charset> in the
>> first 1024 characters will change behavior and could cause page
>> breakage.
>>
>> Or am I missing something?
>
> Yes: WebKit already performs the <meta> prescan for text/html when
> retrieving responseText via XHR even though it doesn't support full
> HTML parsing in XHR (so responseXML is still null).
> http://hsivonen.iki.fi/test/moz/xhr/charset-xhr.html
>
> Thus, apps broken by the meta prescan would already be broken in
> WebKit (unless, of course, they browser sniff in a very strange way).
>
> And apps that wouldn't be OK with using UTF-8 as the fallback encoding
> when there's no HTTP-level charset, no BOM and no <meta> in the first
> 1024 bytes would already by broken in Gecko.

So it sounds like your argument is that we should do <meta> prescan
because we can do it without breaking any new ground. Not because it's
better or was inherently safer before webkit tried it out.

I'd much rather first debate what behavior we want and if we can try
if that is safe.

And we always have the option of only doing HTML parsing when
.responseType is set to "document". That is unlikely to break a lot of
content. And it saves users resources as it uses less memory.

>>> Applying all the legacy text/html craziness to XHR would make data
>>> loading in programs fail in subtle and hard-to-debug ways depending on
>>> the browser localization and user settings. At least when loading into
>>> a browsing context, there's visual feedback of character misdecoding
>>> and the feedback can be attributed back to a given file. If
>>> setting-dependent misdecoding happens in the XHR data loading
>>> machinery of an app, it's much harder to figure out what part of the
>>> system the problem should be attributed to.
>>
>> Could you provide more detail here. How are you imagining this data
>> being used such that it's not being displayed to the user.
>>
>> I.e. can you describe an application that would break in a non-visual
>> way and where it would be harder to detect where the data originated
>> from compared to for example <iframe> usage.
>
> If a piece of text came from XHR and got injected into a visible DOM,
> it's not immediately obvious, which HTTP response it came from.

But what type of web app would that be? Consider for example a webmail
client. While it might originally show emails in a collapsed state in
a mail-thread view, the data is likely still going to be shown
eventually when the user expands the individual messages. Also, if the
user doesn't expand to see the data, does it really matter that it was
wrongly decoded. And in any case, it's easy to figure out where the
data was loaded from after the fact, so debugging doesn't seem any
harder.

So can you provide a counter example of an app where this wouldn't be the case?

/ Jonas

Received on Wednesday, 28 September 2011 01:17:43 UTC