Re: XMLHttpRequest: BOM detection for responseText from Alexey Proskuryakov on 2007-05-18 (public-webapi@w3.org from May 2007)

From: Alexey Proskuryakov <ap-carbon@rambler.ru>
Date: Fri, 18 May 2007 12:10:23 +0400
To: Anne van Kesteren <annevk@opera.com>, "Web API WG (public)" <public-webapi@w3.org>
Message-ID: <C273482F.382FD%ap-carbon@rambler.ru>

On 5/17/07 8:09 PM, "Anne van Kesteren" <annevk@opera.com> wrote:

> Based on feedback from Microsoft the algorithm used by responseText now
> takes the potential BOM of the entity body into account. Please let me
> know if you spot any issues with this:

  I'm not quite sure about having two separate variables for both "charset"
and "charset-http". If I'm not mistaken, the algorithm can be streamlined by
using only one of these:

-----------------------
1. If the response entity body is "null" return null and terminate these
steps.

2. Let charset be "null".

3. If there is no Content-Type header or there is a Content-Type header
which contains a MIME type that is text/xml, application/xml, text/xsl or
ends in +xml (ignoring any parameters) use the rules set forth in the XML
specification to determine the character encoding. Let charset be the
determined character encoding ***and terminate these steps***.

4. If charset is "null" and the Content-Type MIME type contains a charset
parameter let charset be the value of that parameter.

5. If charset is "null" <do the BOM detection>.

6. If charset is "null" let charset be "UTF-8".

7. Return the result of decoding the response entity body using charset. Or,
if that fails, return null.
-----------------------

  I think step 5 (BOM detection) could be written in a declarative manner
similar to how it is defined in CSS
<http://www.w3.org/TR/CSS21/syndata.html#q23>. The current algorithm may be
slightly misguiding in that it misses some edge cases (what to do if the
reply is shorter than 4 bytes?) that should only be interesting to
implementors anyway.

- WBR, Alexey Proskuryakov

Received on Friday, 18 May 2007 08:10:35 UTC