- From: <bugzilla@jessica.w3.org>
- Date: Sun, 11 Dec 2011 06:35:25 +0000
- To: public-html@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=15142 Summary: Define "UNICODE" as a defacto alias for "UTF-16" Product: HTML WG Version: unspecified Platform: PC URL: http://dev.w3.org/html5/spec/parsing#encoding-sniffing -algorithm OS/Version: All Status: NEW Severity: normal Priority: P3 Component: HTML5 spec (editor: Ian Hickson) AssignedTo: ian@hixie.ch ReportedBy: xn--mlform-iua@xn--mlform-iua.no QAContact: public-html-bugzilla@w3.org CC: mike@w3.org, public-html-wg-issue-tracking@w3.org, public-html@w3.org PROPOSAL: Define "UNICODE" as a defacto alias for "UTF-16". This has 3 implications for encoding determination: (1) When "UNICODE" is found inside a data: URI or inside a HTTP Content-Type: header, then parse resource as UTF-16 encoded. (2) When "UNICODE" occurs inside <meta charset=*> or the Content-Type pragma, then treat it as "UTF-8", as specced in the Encoding Sniffing Algorithm's following step: ]] 13. If charset is a UTF-16 encoding, change the value of charset to UTF-8. [[ (3) When "UNICODE" occurs inside a XML file, then treat it as a legal encoding name that nevertheless gets ignored (meaning that parsers is defaulting instead - to UTF-16 or UTF-8) Of course, HTML5 doesn't tell how XML parsing should work, but I say it for completeness. JUSTIFICATION: JUSTIFICATION for HTML: ============================================================== * <meta charset=UNICODE > works like <meta charset=UTF-8> in Webkit (Chrome/Safari) and IE (I checked E6-IE9). This makes sense, once one realise that they see it as a synonym for UTF-16. (Opera/FIrefox do not yet behave this way.) * IE (MSHTML) may save pages with the following charset declaration, either by default or via the user's interation: <META content="text/html; charset=unicode" http-equiv=Content-Type> (taken from: <http://lists.whatwg.org/pipermail/help-whatwg.org/attachments/20091203/e117921b/attachment.htm>) In the Save menu of IE8, there are two menu items with the value "UNICODE" - probably one is UTF-8 and one is UTF-16.- * There are (thus) numerous pages on the Web which uses "charset=UNICODE". - Opera's MAMA project lists 'UNICODE' as the 29 most used value http://devfiles.myopera.com/articles/575/metacenc-url.htm - 150.000 Google hits: http://tinyurl.com/charset-unicode - scraping the Web would find many, many more JUSTIFICATION for Higher protocols (HTTP), XML and data URIs: ============================================================== The value "UNICODE" is treated * as UTF-16 - for HTTP, in XML and in data: URIs - by IE * as UTF-16 - for HTTP & in data: URIs - by Webkit * as - for XML - a legal but ignored encoding name:Firefox/Webkit/Opera * (for HTTP & data:) as unknown, causing locale default: Firefox/Opera NOTE: For HTTP Content-Type: and for data: URIs, then "UNICODE" is treated (by IE and Webkit) as "UTF-16" regardless of whether the document serialisation is HTML or XML. IANA registration ? "UNICODE" should probably be registered as an official alias for "UTF-16": http://www.iana.org/assignments/character-sets -- Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
Received on Sunday, 11 December 2011 09:48:25 UTC