[Bug 15142] Define "UNICODE" as a defacto alias for "UTF-16" from bugzilla@jessica.w3.org on 2011-12-12 (public-html-bugzilla@w3.org from December 2011)

From: <bugzilla@jessica.w3.org>
Date: Mon, 12 Dec 2011 05:31:50 +0000
To: public-html-bugzilla@w3.org
Message-Id: <E1RZyUQ-0005iy-Gd@jessica.w3.org>
https://www.w3.org/Bugs/Public/show_bug.cgi?id=15142

--- Comment #11 from Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no> 2011-12-12 05:31:48 UTC ---

(In reply to comment #10)

> So, what you appear to be describing is parser behavior when processing an HTML
> representation of an HTML5 document that violates the constraint cited above in
> [1]. Is that correct?

Yes. That is my primary concern.

> If that is the case, then are you suggesting a change in the semantics or
> language of the "encoding sniffing algorithm" [4]?
> 
> [4] http://dev.w3.org/html5/spec/Overview.html#encoding-sniffing-algorithm

Yes, either to sub-step 13 or to the link in sub-step 13 - see below.

> Even if you are suggesting a change in [4], it does not appear any change would
> be necessary in the first case, since any use of <meta charset="UTF=16"> or any
> logical equivalent would only come to play in step 5. sub-step 13.
> 
> "If charset is a UTF-16 encoding, change the value of charset to UTF-8."
> 
> However, since this language does not define what is meant by "if charset is a
> UTF-16 encoding", an implementation could interpret this flexibly.

Therе is a link, on the wording "UTF-16 encoding", to the following text:

   "The term a UTF-16 encoding refers to any variant of UTF-16: self-describing
UTF-16 with a BOM, ambiguous UTF-16 without a BOM, raw UTF-16LE, and raw
UTF-16BE. [RFC2781]"

Because of the phrase "value of charset",  it is natural to think that it
*does* refer to valid encoding names, such as "UTF-16", "UTF 16LE" or
"UTF-16BE". It does not seem naturally to include "UNICODE" in the above unless
something explicitly says that one should link it.

It is the charset value that is supposed to be - or represent - "a UTF-16
encoding". And unless one knows and acknowledge that "UNICODE" represents a
"UTF-16 encoding", then UAs we can only hope that they will treat it as such
.... 

It has been said about HTML5 that it should be specific enough that it is
possible to build a Web compatible parser based on it. And it could seem as if
charset="UNICODE" is necessary to mention for that reason.

> That is, sub-step 13 does not say something like:
> 
> "If the value of the charset attribute is an ASCII case-insensitive match of an
> IANA-registered name or alias of a UTF-16 encoding, ..."
> 
> rather, the language of sub-step 13 simply says:
> 
> "If charset is a UTF-16 encoding..."
> 
> leaving it to the imagination of the reader (and the vagaries of the
> implementation) to interpret this as desired, including an interpretation that
> permits recognizing aliases that are not IANA-registered.

That's a possibility. But see my last point above.

> Note that any use of step 5 sub-step 13 occurs only when (1) there is no user
> specified encoding override, (2) there is no transport layer supplied character
> encoding metadata, and (3) there is no BOM.

If you download such page and open it from the harddisk in Firefox, it will
default to the locale encoding instead of to UTF-8.

W.r.t. BOM. Hm, yes, it could seem as if MSHTML tends to ad the BOM whenver the
"UNICODE" charset is used. So that's a thing that perhaps diminished the
problem compared to the alternative - that MSHTML did not add the BOM. 

Btw, it seems like e.g. BBEdit/Textwrangler (the famous Macintosh text editor)
recognizes "UNICODE" to mean "UTF-16".

For UTF-8 encoded pages, then an understanding of what "UNICODE" means allows
e.g. Validator.nu to give specific advice, like "Replace UNICODE with UTF-8"
instead of only "replace UNICODE with a valid name".

> Overall, I have to wonder at the utility of your proposal, whether or not such
> an alias exists de facto or de jure.

You will find quite a lot of author confusion around the "UNICODE" as an
encoding name. But the ultimate proof is of course a page that gets interpreted
Webkit and IE but not in Firefox and Opera. I suppose, seek and you shall find.

> If there is a bug here, it is probably that sub-step 13 does not refer to the
> language in 8.2.2.2 [5], especially the 3rd and 4th paragraphs.
> 
> [5] http://dev.w3.org/html5/spec/Overview.html#character-encodings-0

Perhaps "UNICODE" should be added to that Character Encoding Overrides table
there ...

> In general, I oppose your proposal on the grounds that it is already
> inconsistent with the spirit of 4.2.5.5 [1] cited above.

It clear that it is already illegal ot use charset=UNICODE - we don't need to
change anything for that to be clear. But my proposal does not make it any more
legal. It instead helps us to have an authorative answer w.r.t. to how to help
authors that mistakenly uses charset=UNICODE.

> As for registering an alias independently of what HTML5 makes use of it, the
> Unicode Consortium would be the appropriate party to take up that issue, not
> the HTML WG. I have forwarded a link to this thread to the Unicode Consortium 
> in case they wish to address this matter further. I can't comment on their
> possible position on the issue of registering "UNICODE" as an alias for
> "UTF-16", but I would speculate that they may not support the idea.

Thank you for notifying them!

-- 
Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Monday, 12 December 2011 05:31:52 UTC