- From: Glenn Adams <glenn@skynav.com>
- Date: Mon, 5 Dec 2011 13:15:09 -0700
- To: Glenn Maynard <glenn@zewt.org>
- Cc: Bjoern Hoehrmann <derhoermi@gmx.net>, WebApps WG <public-webapps@w3.org>
- Message-ID: <CACQ=j+ea7T_rF4eHBSje9tu3SQ3ddNsJeLMauCULLgte106RcA@mail.gmail.com>
Let me choose my words more carefully. A browser may recognize UTF-32 (e.g., in a sniffer) without supporting it (either internally or for transcoding into a different internal encoding). If the browser supports UTF-32, then step (2) of [1] applies. [1] http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding But, if the browser does not support UTF-32, then the table in step (4) of [1] is supposed to apply, which would interpret the initial two bytes FF FE as UTF-16LE according to the current language of [1], and further, return a confidence level of "certain". I see the problem now. It seems that the table in step (4) should be changed to interpret an initial FF FE as UTF-16BE only if the following two bytes are not 00. On Mon, Dec 5, 2011 at 11:45 AM, Glenn Maynard <glenn@zewt.org> wrote: > On Mon, Dec 5, 2011 at 1:00 PM, Glenn Adams <glenn@skynav.com> wrote: > >> > [2] http://www.w3.org/TR/charmod/#C030 >> >>> >>> No, it wouldn't. That doesn't say that UTF-32 must be recognized. >> >> >> You misread me. I am not saying or supporting that UTF-32 must be >> recognized. I am saying that MIS-recognizing UTF-32 as UTF-16 violates [2]. >> > > It's impossible to violate that rule if the encoding isn't recognized. > "When an IANA-registered charset name *is recognized*"; UTF-32 isn't > recognized, so this is irrelevant. > > If a browser doesn't support UTF-32 as an incoming interchange format, >> then it should treat it as any other character encoding it does not >> recognize. It must not pretend it is another encoding. >> > > When an encoding is not recognized by the browser, the browser has full > discretion in guessing the encoding. (See step 7 of > http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding.) > It's perfectly reasonable for UTF-32 data to be detected as UTF-16. For > example, UTF-32 data is likely to contain null bytes when scanned bytewise, > and UTF-16 is the only supported encoding where that's likely to happen. > Steps 7 and 8 gives browsers unrestricted freedom in selecting the encoding > when the previous steps are unable to do so; if they choose to include "if > the charset is declared as UTF-32, return UTF-16" as one of their > autodetection rules, the spec allows it. > > -- > Glenn Maynard > > >
Received on Monday, 5 December 2011 20:15:58 UTC