W3C home > Mailing lists > Public > public-html@w3.org > December 2011

[Bug 15142] New: Define "UNICODE" as a defacto alias for "UTF-16"

From: <bugzilla@jessica.w3.org>
Date: Sun, 11 Dec 2011 06:35:25 +0000
To: public-html@w3.org
Message-ID: <bug-15142-2495@http.www.w3.org/Bugs/Public/>
https://www.w3.org/Bugs/Public/show_bug.cgi?id=15142

           Summary: Define "UNICODE" as a defacto alias for "UTF-16"
           Product: HTML WG
           Version: unspecified
          Platform: PC
               URL: http://dev.w3.org/html5/spec/parsing#encoding-sniffing
                    -algorithm
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P3
         Component: HTML5 spec (editor: Ian Hickson)
        AssignedTo: ian@hixie.ch
        ReportedBy: xn--mlform-iua@xn--mlform-iua.no
         QAContact: public-html-bugzilla@w3.org
                CC: mike@w3.org, public-html-wg-issue-tracking@w3.org,
                    public-html@w3.org


PROPOSAL:

   Define "UNICODE" as a defacto alias for "UTF-16".

   This has 3 implications for encoding determination:

   (1) When "UNICODE" is found inside a data: URI or inside a HTTP 
        Content-Type: header, then parse resource as UTF-16 encoded.


   (2) When "UNICODE" occurs inside <meta charset=*> or
        the Content-Type pragma, then treat it as "UTF-8", as 
        specced in the Encoding Sniffing Algorithm's following step:

]] 13. If charset is a UTF-16 encoding, change the value of charset to UTF-8.
[[


   (3) When "UNICODE" occurs inside a XML file, then treat it as
        a legal encoding name that nevertheless gets ignored
        (meaning that parsers is defaulting instead 
        - to UTF-16 or UTF-8)
        Of course, HTML5 doesn't tell how XML parsing should
        work, but I say it for completeness.



JUSTIFICATION:

   JUSTIFICATION for HTML:
   ==============================================================

*  <meta charset=UNICODE > works like <meta charset=UTF-8> 
    in Webkit (Chrome/Safari) and IE (I checked E6-IE9). 
    This makes sense, once one realise that they see it as a synonym
    for UTF-16. (Opera/FIrefox do not yet behave this way.)

*  IE (MSHTML) may save pages with the following charset declaration,
    either by default  or via the user's interation:
    <META content="text/html; charset=unicode" http-equiv=Content-Type>
   (taken from:
<http://lists.whatwg.org/pipermail/help-whatwg.org/attachments/20091203/e117921b/attachment.htm>)
    In the Save menu of IE8, there are two menu items with the value
    "UNICODE" - probably one is UTF-8 and one is UTF-16.-

*  There are (thus) numerous pages on the Web which uses "charset=UNICODE". 
    - Opera's MAMA project lists 'UNICODE' as the 29 most used value
       http://devfiles.myopera.com/articles/575/metacenc-url.htm
    - 150.000 Google hits: http://tinyurl.com/charset-unicode 
    - scraping the Web would find many, many more


   JUSTIFICATION for Higher protocols (HTTP),  XML and data URIs:
   ==============================================================

   The value "UNICODE" is treated

 * as UTF-16  - for HTTP, in XML and in data: URIs - by IE
 * as UTF-16  - for HTTP & in data: URIs - by Webkit
 * as - for XML - a legal but ignored encoding name:Firefox/Webkit/Opera
 * (for HTTP & data:) as unknown, causing locale default: Firefox/Opera

    NOTE: For HTTP Content-Type: and for data: URIs, then "UNICODE"
               is treated (by IE and Webkit) as "UTF-16" regardless of 
               whether the document serialisation is HTML or XML.


IANA registration ?

    "UNICODE" should probably be registered as an official alias for "UTF-16":
http://www.iana.org/assignments/character-sets

-- 
Configure bugmail: https://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
Received on Sunday, 11 December 2011 09:48:25 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 9 May 2012 00:17:42 GMT