Re: [XHR2] HTML in XHR implementation feedback from Jonas Sicking on 2011-11-27 (public-webapps@w3.org from October to December 2011)

From: Jonas Sicking <jonas@sicking.cc>
Date: Sat, 26 Nov 2011 17:32:24 -0800
To: Henri Sivonen <hsivonen@iki.fi>
Cc: public-webapps WG <public-webapps@w3.org>
Message-ID: <CA+c2ei_ez4ak++HG6spVHC37XhaOof9jPtxrq_VghyLQ-8rNQg@mail.gmail.com>

On Thursday, November 24, 2011, Henri Sivonen <hsivonen@iki.fi> wrote:
> On Mon, Nov 21, 2011 at 8:26 PM, Jonas Sicking <jonas@sicking.cc> wrote:
>> On Wed, Nov 16, 2011 at 2:40 AM, Henri Sivonen <hsivonen@iki.fi> wrote:
>>>  * For text/html responses for response type "" and "document", the
>>> character encoding is established by taking the first match from this
>>> list in this order:
>>>   - HTTP charset parameter
>>>   - BOM
>>>   - HTML-compliant <meta> prescan up to 1024 bytes.
>>>   - UTF-8
>>
>> I still think that we are putting large parts of the world at a
>> significant disadvantage here since they would not be able to use this
>> feature together with existing content, which I would imagine is a
>> large argument for this feature at all.
>>
>> Here is what I propose. How about we add a .defaultCharset property.
>> When not set we use the list as described above. If set, the contents
>> of .defaultCharset is used in place of UTF8.
>
> I think that makes sense as a solution if it turns out that a solution
> is needed. I think adding that feature now would be a premature
> addition of complexity--especially considering that responseText has
> existed for this long with a UTF-8 default without a .defaultCharset
> property.

We have had a notoriously hard time getting feedback from the affected
community here. I.e. the non English speaking (and in particular non-North
America + Europe) community. Just look at the market share for webkit and
gecko in china and Japan for a result of the "wait until we hear if the
rest of the world might be different from us" approach. A prime example of
this was how hard it was to get marquee added to gecko, despite it's wide
use in for example Japan.

So I strongly oppose a wait and see approach here. At the very least we
should reach out to the community in CJK or other countries which are
likely to have large bodies of documents in non-ASCII compatible encodings.

Generally we do wait for having use cases, which I've already presented
here, not for authors to come banging on our door, which is still all too
rare.

>>>  * When there is no HTTP-level charset parameter, progress events are
>>> stalled and responseText made null until the parser has found a BOM or
>>> a charset <meta> or has seen 1024 bytes or the EOF without finding
>>> either BOM or charset <meta>.
>>
>> Why? I wrote the gecko code specifically so that we can adjust
>> .responseText once we know the document charset. Given that we're only
>> scanning 1024 bytes, this shouldn't ever require more than 1024 bytes
>> of extra memory (though the current implementation doesn't take
>> advantage of that).
>
> I meant that stalling stops at EOF if the file is shorter than 1024
> bytes. However, this point will become moot, because supporting HTML
> parsing per spec in the default mode broke Wolfram Alpha and caused
> wasteful parsing on Gmail, so per IRC discussion with Anne and Olli,
> I'm preparing to limit HTML parsing to responseType == "document"
> only.

Sounds good.

/ Jonas

Received on Sunday, 27 November 2011 01:32:53 UTC