XHR: responseText encoding detection from Bjoern Hoehrmann on 2007-02-22 (public-webapi@w3.org from February 2007)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Thu, 22 Feb 2007 10:26:01 +0100
To: public-webapi@w3.org
Message-ID: <8onqt211878th88hlrrqsd5t0vh8ujjkgc@hive.bjoern.hoehrmann.de>

Hi,

  For responseText the current draft says:

  If the response includes a Content-Type understood by the user agent
  the characters are encoded following the relevant media type
  specification, with the exception that the rule in the final paragraph
  of section 3.7.1 of [RFC2616], and the rules in section 4.1.2 of
  [RFC2046] must be treated as if they specified the default character
  encoding as being UTF-8. Invalid bytes must be converted to U+FFFD
  REPLACEMENT CHARACTER.

There are many problems with this text. First, it seems "encoded" should
be decoded. When using HTTP, RFC 2046 does not apply, so I am not sure
why this is beeing overriden here. I note that this does not override
e.g. RFC 3023, so if you have a text/xml document, you have

  * RFC 3023              -> us-ascii
  * RFC 2616              -> iso-8859-1
  * XMLHttpRequest        -> utf-8
  * XML ignoring the type -> check the bom and xml declaration

You have a similar situation e.g. for text/css and text/html, where the
specifications similarily override their underlying protocols. It does
not strike me as very likely that implementers will implement the draft
so that including an external document using some markup and using XHR
will result in widely different results.

The rule on handling "invalid bytes" is unclear and problematic, e.g.,
if some character encoding scheme defines detailed error handling rules
it is much more likely that the implementation will follow those and
not whatever XHR might say, even though the current text prohibits that.
I also note that it is very unclear what is exactly an invalid byte, or
what that implies if you have only valid bytes but an invalid sequence,
e.g. a text/plain document containing only 0xC3, which is a valid byte
in UTF-8 but taken as a whole an incomplete sequence. Browsers would in
this case simply ignore the byte, not replace it by something.

I could probably propose alternate text that has less of these problems,
but I am not entirely sure what the text is actually supposed to say.

regards,
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Received on Thursday, 22 February 2007 09:26:06 UTC