Re: [XHR] responseType "json" from Glenn Adams on 2011-12-05 (public-webapps@w3.org from October to December 2011)

From: Glenn Adams <glenn@skynav.com>
Date: Mon, 5 Dec 2011 13:15:09 -0700
To: Glenn Maynard <glenn@zewt.org>
Cc: Bjoern Hoehrmann <derhoermi@gmx.net>, WebApps WG <public-webapps@w3.org>
Message-ID: <CACQ=j+ea7T_rF4eHBSje9tu3SQ3ddNsJeLMauCULLgte106RcA@mail.gmail.com>

Let me choose my words more carefully.

A browser may recognize UTF-32 (e.g., in a sniffer) without supporting it
(either internally or for transcoding into a different internal encoding).

If the browser supports UTF-32, then step (2) of [1] applies.

[1]
http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding

But, if the browser does not support UTF-32, then the table in step (4) of
[1] is supposed to apply, which would interpret the initial two bytes FF FE
as UTF-16LE according to the current language of [1], and further, return a
confidence level of "certain".

I see the problem now. It seems that the table in step (4) should be
changed to interpret an initial FF FE as UTF-16BE only if the following two
bytes are not 00.

On Mon, Dec 5, 2011 at 11:45 AM, Glenn Maynard <glenn@zewt.org> wrote:

> On Mon, Dec 5, 2011 at 1:00 PM, Glenn Adams <glenn@skynav.com> wrote:
>
>> > [2] http://www.w3.org/TR/charmod/#C030
>>
>>>
>>> No, it wouldn't.  That doesn't say that UTF-32 must be recognized.
>>
>>
>> You misread me. I am not saying or supporting that UTF-32 must be
>> recognized. I am saying that MIS-recognizing UTF-32 as UTF-16 violates [2].
>>
>
> It's impossible to violate that rule if the encoding isn't recognized.
> "When an IANA-registered charset name *is recognized*"; UTF-32 isn't
> recognized, so this is irrelevant.
>
> If a browser doesn't support UTF-32 as an incoming interchange format,
>> then it should treat it as any other character encoding it does not
>> recognize. It must not pretend it is another encoding.
>>
>
> When an encoding is not recognized by the browser, the browser has full
> discretion in guessing the encoding.  (See step 7 of
> http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding.)
> It's perfectly reasonable for UTF-32 data to be detected as UTF-16.  For
> example, UTF-32 data is likely to contain null bytes when scanned bytewise,
> and UTF-16 is the only supported encoding where that's likely to happen.
> Steps 7 and 8 gives browsers unrestricted freedom in selecting the encoding
> when the previous steps are unable to do so; if they choose to include "if
> the charset is declared as UTF-32, return UTF-16" as one of their
> autodetection rules, the spec allows it.
>
> --
> Glenn Maynard
>
>
>

Received on Monday, 5 December 2011 20:15:58 UTC