[whatwg] Unicode as alias for UTF-16 (was Re: Default encoding to UTF-8?) from Leif Halvard Silli on 2011-12-22 (public-whatwg-archive@w3.org from December 2011)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Thu, 22 Dec 2011 09:59:43 +0100
Message-ID: <20111222095943412823.f421a7c2@xn--mlform-iua.no>
Henri Sivonen on Tue Dec 20 01:13:45 PST 2011:
> On Mon, Dec 19, 2011 at 9:44 PM, L. David Baron wrote:

>>> > I discovered that "UNICODE" is
>>> > used as alias for "UTF-16" in IE and Webkit.
>>> ...
>>> > Seemingly, this has not affected Firefox users too much.
>>>
>>> It surprises me greatly that Gecko doesn't treat "unicode" as an alias
>>> for "utf-16".
>>
>> Why?
> 
> From playing with IE, I thought it was known that "unicode" is an
> alias for "utf-16" and it had never occurred to me to check if that
> was true in Gecko.

MS 'unicode' is only to a 50% degree (sic) an alias for 'utf-16', 
namely for the *little-endian* "half" of *UTF-16*. (Thus: It is not 
UTF-16LE, since MS 'unicode' usually includes the BOM.)  There is also 
MS 'unicodeFFFE' that represents big-endian UTF-16. See: 
http://mail.apps.ietf.org/ietf/charsets/msg02030.html

>>?If it's not needed, why shouldn't WebKit and IE drop it?

Actually, UTF-16 fails in Webkit much, much more often than in any 
other browser. E.g. this page is (not that it related, though) labelled 
as MS 'unicode': http://sacredheartbayhead.com/. Firefox, Opera and IE 
all display it. But Chrome/Safari fails to detect the encoding.

So despite that Webkit aligns with IE by understanding MS 'unicode' and 
MS 'unicodeFFFE', it does other things wrong when it comes to UTF-16. 
So, you should only look at Webkit if you want to see how well a 
browser can do in the market when it has below average UTF-16 support 
... (Chrome is may be a  better than Safari, though - Chrome at least 
allows me to *select* UTF-16, whereas Safari does not offer UTF-16 in 
its encoding menu.. Chrome also uses character set detection more 
actively.)

> Needed is relative. So far, I haven't seen data about how much
> existing content there is out there that depends on this. It could be
> that some users somewhere have rejected Firefox or Opera for this and
> there just isn't enough of a feedback loop.

Feedback loop for you: In UTF-16LE or UTF-16BE pages without any other 
encoding info. (The HTML5 encoding sniffing tells UAs to *do* read the 
meta @charset *if* all other tests fails.) And, voila, I just now found 
one such page: <http://www.hughesrenier.be/actualites.html>. This page 
works fine in IE - and IE only. (That it fails in Webkit is because of 
some bug in its encoding sniffing - see below.) Offline, on my 
computer, when I switched the value of the meta @charset for that page 
to 'UTF-16', then Firefox and Opera would also pick up the encoding. 

   Other pages of the same kind: 
<http://www.sunsetridgebusinesspark.com/BusinessListing.html>
<http://www.rpmcmillen.com/taxes.html>
<http://www.hughesrenier.be/illustration.html>
<http://memphismitchellathletics.com/pages/2010football.html>

   There are also pages like these, which works fine in IE, but which 
in Firefox, if I manually select UTF-16, displays 
broken-character-signs - I don't know if the UTF-16 code is buggy?:
<http://www.casamobile.org/BoardMembersStaff.html>
<http://comfortablerentals.com/Our%20Services.html>
<http://lergp.cce.cornell.edu/IPM/Home.htm>
<http://www.belpaese2000.narod.ru/Teca/Nove/Deledda/nov/regina.htm>
<http://www.belpaese2000.narod.ru/Teca/Nove/Deledda/nov/macchie.htm>
<http://web.tiscali.it/marcokiller/Mappa_del_sito.htm>
<http://familienlundorff.dk/familienLundorff.dk/genealogi/Andreas_1769/Niels_1813_Johanne_1854.html>
<http://www.prcflow.com/orifice_meter_runs_plates.htm>
<http://healthactioncenter.com/aboutus.htm>
<http://www.belpaese2000.narod.ru/Teca/Nove/Deledda/nov/mago.htm>
<http://www.trascaucristian.3x.ro/> (shows BOM sign)
<http://www.casamobile.org/history.html>
<http://www.hawkpages.com/> (See 'embedded' code on right page side)

I found them via Google, which for certain UTF-16 pages renders the 
source code as search result (which make Google Search very similar to 
how Webkit handles UTF-16, btw):
<http://www.google.com/search?q=%22%3Cmeta+content%3D%27text/html%3B+charset%3Dunicode%27%22>

Not the same thing, but speaking about necessity: This page declares 
"UTF-8" 3 times plus that it includes the BOM. However, the HTTP 
charset says ISO-8859-1, and hence ... the page fails in Firefox and 
Opera, but not in Webkit and IE: <http://www.bozze.1.vg/>.

> Maybe it isn't needed, but it seems that from the WebKit or IE point
> of view, the potential upside from dropping this alias is about
> non-existent while there could be a downside. I'd expect it to be hard
> to get IE and WebKit to drop the alias.

Btw, one thing: A big source of Google findings for the search string 
"<meta content='text/html; charset=unicode'" , are seems to be HTML 
attachments (from MS Word users) in e-mail messages to mailing lists. 
Example: 
http://stsk.no/pipermail/drill-aspiranter_stsk.no/attachments/20101230/8335fbe4/attachment-0001.html
-- 
Leif Halvard Silli
Received on Thursday, 22 December 2011 00:59:43 UTC