- From: <bugzilla@jessica.w3.org>
- Date: Sun, 11 Dec 2011 06:33:12 +0000
- To: public-html-bugzilla@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=15142
Summary: Define "UNICODE" as a defacto alias for "UTF-16"
Product: HTML WG
Version: unspecified
Platform: PC
URL: http://dev.w3.org/html5/spec/parsing#encoding-sniffing
-algorithm
OS/Version: All
Status: NEW
Severity: normal
Priority: P3
Component: HTML5 spec (editor: Ian Hickson)
AssignedTo: ian@hixie.ch
ReportedBy: xn--mlform-iua@xn--mlform-iua.no
QAContact: public-html-bugzilla@w3.org
CC: mike@w3.org, public-html-wg-issue-tracking@w3.org,
public-html@w3.org
PROPOSAL:
Define "UNICODE" as a defacto alias for "UTF-16".
This has 3 implications for encoding determination:
(1) When "UNICODE" is found inside a data: URI or inside a HTTP
Content-Type: header, then parse resource as UTF-16 encoded.
(2) When "UNICODE" occurs inside <meta charset=*> or
the Content-Type pragma, then treat it as "UTF-8", as
specced in the Encoding Sniffing Algorithm's following step:
]] 13. If charset is a UTF-16 encoding, change the value of charset to UTF-8.
[[
(3) When "UNICODE" occurs inside a XML file, then treat it as
a legal encoding name that nevertheless gets ignored
(meaning that parsers is defaulting instead
- to UTF-16 or UTF-8)
Of course, HTML5 doesn't tell how XML parsing should
work, but I say it for completeness.
JUSTIFICATION:
JUSTIFICATION for HTML:
==============================================================
* <meta charset=UNICODE > works like <meta charset=UTF-8>
in Webkit (Chrome/Safari) and IE (I checked E6-IE9).
This makes sense, once one realise that they see it as a synonym
for UTF-16. (Opera/FIrefox do not yet behave this way.)
* IE (MSHTML) may save pages with the following charset declaration,
either by default or via the user's interation:
<META content="text/html; charset=unicode" http-equiv=Content-Type>
(taken from:
<http://lists.whatwg.org/pipermail/help-whatwg.org/attachments/20091203/e117921b/attachment.htm>)
In the Save menu of IE8, there are two menu items with the value
"UNICODE" - probably one is UTF-8 and one is UTF-16.-
* There are (thus) numerous pages on the Web which uses "charset=UNICODE".
- Opera's MAMA project lists 'UNICODE' as the 29 most used value
http://devfiles.myopera.com/articles/575/metacenc-url.htm
- 150.000 Google hits: http://tinyurl.com/charset-unicode
- scraping the Web would find many, many more
JUSTIFICATION for Higher protocols (HTTP), XML and data URIs:
==============================================================
The value "UNICODE" is treated
* as UTF-16 - for HTTP, in XML and in data: URIs - by IE
* as UTF-16 - for HTTP & in data: URIs - by Webkit
* as - for XML - a legal but ignored encoding name:Firefox/Webkit/Opera
* (for HTTP & data:) as unknown, causing locale default: Firefox/Opera
NOTE: For HTTP Content-Type: and for data: URIs, then "UNICODE"
is treated (by IE and Webkit) as "UTF-16" regardless of
whether the document serialisation is HTML or XML.
IANA registration ?
"UNICODE" should probably be registered as an official alias for "UTF-16":
http://www.iana.org/assignments/character-sets
--
Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Sunday, 11 December 2011 06:35:29 UTC