- From: Jonas Sicking <jonas@sicking.cc>
- Date: Mon, 21 Nov 2011 10:26:57 -0800
- To: Henri Sivonen <hsivonen@iki.fi>
- Cc: public-webapps WG <public-webapps@w3.org>
On Wed, Nov 16, 2011 at 2:40 AM, Henri Sivonen <hsivonen@iki.fi> wrote: > I landed support for HTML parsing in XHR in Gecko today. It has not > yet propagated to the Nightly channel. > > Here's how it behaves: > > * Contrary to the spec, for response types other than "" and > "document", character encoding determination for text/html happens the > same way as for unknown types. This is great IMO! > * For text/html responses for response type "" and "document", the > character encoding is established by taking the first match from this > list in this order: > - HTTP charset parameter > - BOM > - HTML-compliant <meta> prescan up to 1024 bytes. > - UTF-8 I still think that we are putting large parts of the world at a significant disadvantage here since they would not be able to use this feature together with existing content, which I would imagine is a large argument for this feature at all. Here is what I propose. How about we add a .defaultCharset property. When not set we use the list as described above. If set, the contents of .defaultCharset is used in place of UTF8. This way websites that have a large body of existing documents which use an encoding different from utf8 can use this feature without any serverside changes (which we keep hearing time and again is a big hurdle for many people). Further, they can also use HTML-in-XHR if different documents use different charset as long as the documents either use some default charset or have a <meta> element describing the charset. This way we should be able to support the majority of existing content without resorting to browser-locale specific defaults or encodings of the loading page. > * When there is no HTTP-level charset parameter, progress events are > stalled and responseText made null until the parser has found a BOM or > a charset <meta> or has seen 1024 bytes or the EOF without finding > either BOM or charset <meta>. Why? I wrote the gecko code specifically so that we can adjust .responseText once we know the document charset. Given that we're only scanning 1024 bytes, this shouldn't ever require more than 1024 bytes of extra memory (though the current implementation doesn't take advantage of that). > * I suggest making responseType modes other than "" and "document" > not consider the internal character encoding declarations in HTML (or > XML). Agreed > Spec change proposals that I'm not making yet but might make in near future: > > * Making responseType == "" not support HTML parsing at all and to > treat text/html as an unknown type for the purpose of character > encoding. I don't understand what the part after the "and" means. But the part before it sounds quite interesting to me. It would also resolve any concerns about breaking existing content. > * Making XHR not support HTML parsing in the synchronous mode. This sounds fine to me. / Jonas
Received on Monday, 21 November 2011 18:27:59 UTC