[Bug 15142] Define "UNICODE" as a defacto alias for "UTF-16"

https://www.w3.org/Bugs/Public/show_bug.cgi?id=15142

--- Comment #10 from Glenn Adams <glenn@skynav.com> 2011-12-12 00:46:51 UTC ---
(In reply to comment #8)
> (1) The main proposal is to require the HTML5 parser to, when it see
> charset="UNICODE" (upper- or lowercase), replace it with charset="UTF-16"
> (which in turns gets replaced with "UTF-8" it occurs inside a HTML document).
> This in order to a) be compatible with "the Web", b) to support the shift to
> Unicode in particular and UTF-8 especially by c) making sure that content that
> is intended to be unicode, is treated as unicode by all HTML5 user agents.

If an HTML representation of an HTML5 document (not an XML representation)
specifies either

<meta charset="UTF-16">

or

<meta charset="UNICODE">

it is effectively in violation of 4.2.5.5 [1]:

"If an HTML document contains a meta element with a charset attribute or a meta
element with an http-equiv attribute in the Encoding declaration state, then
the character encoding used must be an ASCII-compatible character encoding."

[1] http://dev.w3.org/html5/spec/Overview.html#character-encoding-declaration

This is because "a UTF-16 encoding" [2], whether it is labeled explicitly as
"UTF-16" or labeled with a hypothetical alias "UNICODE" is not an
"ASCII-compatible character encoding" [3].

[2] http://dev.w3.org/html5/spec/Overview.html#a-utf-16-encoding
[3]
http://dev.w3.org/html5/spec/Overview.html#ascii-compatible-character-encoding

So, what you appear to be describing is parser behavior when processing an HTML
representation of an HTML5 document that violates the constraint cited above in
[1]. Is that correct?

If that is the case, then are you suggesting a change in the semantics or
language of the "encoding sniffing algorithm" [4]?

[4] http://dev.w3.org/html5/spec/Overview.html#encoding-sniffing-algorithm

Even if you are suggesting a change in [4], it does not appear any change would
be necessary in the first case, since any use of <meta charset="UTF=16"> or any
logical equivalent would only come to play in step 5. sub-step 13.

"If charset is a UTF-16 encoding, change the value of charset to UTF-8."

However, since this language does not define what is meant by "if charset is a
UTF-16 encoding", an implementation could interpret this flexibly.

That is, sub-step 13 does not say something like:

"If the value of the charset attribute is an ASCII case-insensitive match of an
IANA-registered name or alias of a UTF-16 encoding, ..."

rather, the language of sub-step 13 simply says:

"If charset is a UTF-16 encoding..."

leaving it to the imagination of the reader (and the vagaries of the
implementation) to interpret this as desired, including an interpretation that
permits recognizing aliases that are not IANA-registered.

Note that any use of step 5 sub-step 13 occurs only when (1) there is no user
specified encoding override, (2) there is no transport layer supplied character
encoding metadata, and (3) there is no BOM.

Overall, I have to wonder at the utility of your proposal, whether or not such
an alias exists de facto or de jure.

If there is a bug here, it is probably that sub-step 13 does not refer to the
language in 8.2.2.2 [5], especially the 3rd and 4th paragraphs.

[5] http://dev.w3.org/html5/spec/Overview.html#character-encodings-0

In general, I oppose your proposal on the grounds that it is already
inconsistent with the spirit of 4.2.5.5 [1] cited above.

As for registering an alias independently of what HTML5 makes use of it, the
Unicode Consortium would be the appropriate party to take up that issue, not
the HTML WG. I have forwarded a link to this thread to the Unicode Consortium 
in case they wish to address this matter further. I can't comment on their
possible position on the issue of registering "UNICODE" as an alias for
"UTF-16", but I would speculate that they may not support the idea.

Regards,
Glenn

-- 
Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Monday, 12 December 2011 00:48:55 UTC