The problem as I see it is that the current spec text for charset detection
effectively *requires* a browser that does not "support" UTF-32 to
explicitly ignore content metadata that may be correct (if it specifies
UTF-32 as charset param), and further, to explicitly mis-label such content
as UTF-16LE in the case that the first four bytes are FF FE 00 00. Indeed,
the current algorithm requires mis-labelling such content as UTF-16LE with
a confidence of "certain".
The current text is also ambiguous with respect to what "support" means in
step (2) of Section 8.2.2.1 of [1]. Which of the following are meant by
"support"?
- recognize with sniffer
- be capable of using directly as internal coding
- be capable of transcoding to internal coding
[1]
http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding
On Mon, Dec 5, 2011 at 3:10 PM, Ian Hickson <ian@hixie.ch> wrote:
> On Mon, 5 Dec 2011, Glenn Adams wrote:
> >
> > I see the problem now. It seems that the table in step (4) should be
> > changed to interpret an initial FF FE as UTF-16BE only if the following
> > two bytes are not 00.
>
> The current text is intentional. UTF-32 is explicitly not supported by the
> HTML standard.
>
> --
> Ian Hickson U+1047E )\._.,--....,'``. fL
> http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
> Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
>