Re: [XHR2] HTML in XHR implementation feedback from Henri Sivonen on 2011-11-24 (public-webapps@w3.org from October to December 2011)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Thu, 24 Nov 2011 15:15:09 +0200
To: public-webapps WG <public-webapps@w3.org>
Message-ID: <CAJQvAuemvS_FwkGYkTg3WyNGZq06k5HwixfqSHKJPJPXwbxPng@mail.gmail.com>

On Mon, Nov 21, 2011 at 8:26 PM, Jonas Sicking <jonas@sicking.cc> wrote:
> On Wed, Nov 16, 2011 at 2:40 AM, Henri Sivonen <hsivonen@iki.fi> wrote:
>>  * For text/html responses for response type "" and "document", the
>> character encoding is established by taking the first match from this
>> list in this order:
>>   - HTTP charset parameter
>>   - BOM
>>   - HTML-compliant <meta> prescan up to 1024 bytes.
>>   - UTF-8
>
> I still think that we are putting large parts of the world at a
> significant disadvantage here since they would not be able to use this
> feature together with existing content, which I would imagine is a
> large argument for this feature at all.
>
> Here is what I propose. How about we add a .defaultCharset property.
> When not set we use the list as described above. If set, the contents
> of .defaultCharset is used in place of UTF8.

I think that makes sense as a solution if it turns out that a solution
is needed. I think adding that feature now would be a premature
addition of complexity--especially considering that responseText has
existed for this long with a UTF-8 default without a .defaultCharset
property.

>>  * When there is no HTTP-level charset parameter, progress events are
>> stalled and responseText made null until the parser has found a BOM or
>> a charset <meta> or has seen 1024 bytes or the EOF without finding
>> either BOM or charset <meta>.
>
> Why? I wrote the gecko code specifically so that we can adjust
> .responseText once we know the document charset. Given that we're only
> scanning 1024 bytes, this shouldn't ever require more than 1024 bytes
> of extra memory (though the current implementation doesn't take
> advantage of that).

I meant that stalling stops at EOF if the file is shorter than 1024
bytes. However, this point will become moot, because supporting HTML
parsing per spec in the default mode broke Wolfram Alpha and caused
wasteful parsing on Gmail, so per IRC discussion with Anne and Olli,
I'm preparing to limit HTML parsing to responseType == "document"
only.

>>  * Making responseType == "" not support HTML parsing at all and to
>> treat text/html as an unknown type for the purpose of character
>> encoding.
>
> I don't understand what the part after the "and" means. But the part
> before it sounds quite interesting to me. It would also resolve any
> concerns about breaking existing content.

The part after "and" means the old behavior. This is now the plan.

On Mon, Nov 21, 2011 at 8:28 PM, Jonas Sicking <jonas@sicking.cc> wrote:
>> The side effect is that <meta> prescan doesn't happen in the
>> synchronous mode for text/html resources. This is displeasingly
>> inconsistent but makes sense if the sync mode is treated as an evil
>> legacy feature rather than as an evolving part of the platform.
>
> I'm not sure what this means. Aren't we only doing <meta> prescan when
> parsing a HTML document?

The <meta> prescan is done only when parsing HTML.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Thursday, 24 November 2011 13:15:37 UTC