RE: Unicode encoding for web pages from McDonald, Ira on 2005-03-30 (www-international@w3.org from January to March 2005)

From: McDonald, Ira <imcdonald@sharplabs.com>
Date: Wed, 30 Mar 2005 08:07:10 -0800
To: "'Chris Lilley'" <chris@w3.org>, Deborah Cawkwell <deborah.cawkwell@bbc.co.uk>
Cc: www-international@w3.org
Message-ID: <CFEE79A465B35C4385389BA5866BEDF00C7B11@mailsrvnt02.enet.sharplabs.com>

Hi,

And for what it's worth, the IETF formally requires that UTF-8
must be supported in transferring human-readable text over any
Internet protocol (including HTTP/1.1) and has done so for a
_long_ time.  See RFC 2277 (January 1998) which specifically
prohibits (for example) UTF-16 only support (without UTF-8).

If you encode a page in UTF-16, there's a fair chance that an
intermediary is going to convert it into UTF-8 before delivery
anyway.  The "benefits" of UTF-16 disappeared after Plane 0
stopped being the only useful and assigned Unicode codepoints
(for example, all the interesting math and musical notation
is not in Plane 0).

Cheers,
- Ira

Ira McDonald (Musician / Software Architect)
Blue Roof Music / High North Inc
PO Box 221  Grand Marais, MI  49839
phone: +1-906-494-2434
email: imcdonald@sharplabs.com

-----Original Message-----
From: www-international-request@w3.org
[mailto:www-international-request@w3.org]On Behalf Of Chris Lilley
Sent: Wednesday, March 30, 2005 9:29 AM
To: Deborah Cawkwell
Cc: www-international@w3.org
Subject: Re: Unicode encoding for web pages



On Wednesday, March 30, 2005, 2:45:27 PM, Deborah wrote:

DC> For web pages, would you consider using a Unicode encoding
DC> other than UTF-8, eg UTF-16? If so, why? or why not?
 
I used to consider that UTF-16 would provide a space saving benefit for
those languages where a single character runs to three or four bytes in
UTF-8. It turns out that if there is a fairly small amount of markup,
this space saving is not seen in practive.

I understand that in well optimised Web Services applications withhigh
throughput, profiling shows that UTF-8 to UTF-16 conversion (eg, to
construct a DOM) can become significant so one would imaging shipping
content in UTF-16 might help there also.

I could not see any particular reason to use UTF-7.

Material where a) random access was a high priority and b) there was
significant usage of characters that would require surrogates, might
indicate that using UCS-4 would be a benefit.

So in general, and particularly for XML where a parser is not required
to understand encodings other than UTF-8 and UTF-16, I see less and less
reason to use anything other than UTF-8.


-- 
 Chris Lilley                    mailto:chris@w3.org
 Chair, W3C SVG Working Group
 W3C Graphics Activity Lead

Received on Wednesday, 30 March 2005 16:07:31 UTC