Re: [XHR2] HTML in XHR implementation feedback

On Wed, Nov 16, 2011 at 2:40 AM, Henri Sivonen <hsivonen@iki.fi> wrote:
> I landed support for HTML parsing in XHR in Gecko today. It has not
> yet propagated to the Nightly channel.
>
> Here's how it behaves:
>
>  * Contrary to the spec, for response types other than "" and
> "document", character encoding determination for text/html happens the
> same way as for unknown types.

This is great IMO!

>  * For text/html responses for response type "" and "document", the
> character encoding is established by taking the first match from this
> list in this order:
>   - HTTP charset parameter
>   - BOM
>   - HTML-compliant <meta> prescan up to 1024 bytes.
>   - UTF-8

I still think that we are putting large parts of the world at a
significant disadvantage here since they would not be able to use this
feature together with existing content, which I would imagine is a
large argument for this feature at all.

Here is what I propose. How about we add a .defaultCharset property.
When not set we use the list as described above. If set, the contents
of .defaultCharset is used in place of UTF8.

This way websites that have a large body of existing documents which
use an encoding different from utf8 can use this feature without any
serverside changes (which we keep hearing time and again is a big
hurdle for many people). Further, they can also use HTML-in-XHR if
different documents use different charset as long as the documents
either use some default charset or have a <meta> element describing
the charset.

This way we should be able to support the majority of existing content
without resorting to browser-locale specific defaults or encodings of
the loading page.

>  * When there is no HTTP-level charset parameter, progress events are
> stalled and responseText made null until the parser has found a BOM or
> a charset <meta> or has seen 1024 bytes or the EOF without finding
> either BOM or charset <meta>.

Why? I wrote the gecko code specifically so that we can adjust
.responseText once we know the document charset. Given that we're only
scanning 1024 bytes, this shouldn't ever require more than 1024 bytes
of extra memory (though the current implementation doesn't take
advantage of that).

>  * I suggest making responseType modes other than "" and "document"
> not consider the internal character encoding declarations in HTML (or
> XML).

Agreed

> Spec change proposals that I'm not making yet but might make in near future:
>
>  * Making responseType == "" not support HTML parsing at all and to
> treat text/html as an unknown type for the purpose of character
> encoding.

I don't understand what the part after the "and" means. But the part
before it sounds quite interesting to me. It would also resolve any
concerns about breaking existing content.

>  * Making XHR not support HTML parsing in the synchronous mode.

This sounds fine to me.

/ Jonas

Received on Monday, 21 November 2011 18:27:59 UTC