- From: Tex Texin <tex@i18nguy.com>
- Date: Wed, 30 Mar 2005 12:01:43 -0800
- To: "McDonald, Ira" <imcdonald@sharplabs.com>
- CC: 'Chris Lilley' <chris@w3.org>, Deborah Cawkwell <deborah.cawkwell@bbc.co.uk>, www-international@w3.org
Hi, There are several tradeoffs between utf-8 vs utf-16. Size has been mentioned and that depends on the nature of the data- the distribution of languages or characters used. This is colored by the markup, metadata, scripts, media or other information that can be enclosed with the text of the page. Markup of course tends towards ASCII. The cost of conversion between 8 and 16 is very small and for many situations you can get a significantly bigger performance improvement by optimizing other aspects of the application than either eliminating the conversion or changing your base encoding. However, it pays to consider the nature of the application and what is actually done with the data. Many applications primarily move data back and forth from screen, data and other buffers to databases and back, and don't do much actual modification or linguistic operations (search, etc) with it. For those applications, since they are just moving bytes back and forth, the conversion is needless and there is no benefit. May as well leave the data as-is. On the other hand, applications that intensively linguisticly process text will benefit in terms of CPU cycles from using utf-16. The cpu benefit can be outweighed though if the data access is slowed by the growth in size from utf-8 to utf-16. For example, if more disk reads are needed. Then there is trasmission cost. Sending more bytes over the net can be prohibitive. So: For small pages, or pages that are dominated by non-textual data, or pages that are dominated by ideographic languages, UTF-16 is fine and can be an improvement if the data tends to be more compressed in utf-16. For data that is intensively linguisticly processed, than utf-16 is better and can benefit even if there is some conversion overhead. So you might use utf-16 internally or on a backend, even if the pages are utf-8 and have to convert. For data that is only moved around and not processed, than you might look at language usage and choose the more compressed form of UTF, so that i/o (disk reads/writes) don't impact performance. Net transmission cost is often the biggest performance impediment, so again size is the biggest consideration. hth tex "McDonald, Ira" wrote: > > Hi, > > And for what it's worth, the IETF formally requires that UTF-8 > must be supported in transferring human-readable text over any > Internet protocol (including HTTP/1.1) and has done so for a > _long_ time. See RFC 2277 (January 1998) which specifically > prohibits (for example) UTF-16 only support (without UTF-8). > > If you encode a page in UTF-16, there's a fair chance that an > intermediary is going to convert it into UTF-8 before delivery > anyway. The "benefits" of UTF-16 disappeared after Plane 0 > stopped being the only useful and assigned Unicode codepoints > (for example, all the interesting math and musical notation > is not in Plane 0). > > Cheers, > - Ira > > Ira McDonald (Musician / Software Architect) > Blue Roof Music / High North Inc > PO Box 221 Grand Marais, MI 49839 > phone: +1-906-494-2434 > email: imcdonald@sharplabs.com > > -----Original Message----- > From: www-international-request@w3.org > [mailto:www-international-request@w3.org]On Behalf Of Chris Lilley > Sent: Wednesday, March 30, 2005 9:29 AM > To: Deborah Cawkwell > Cc: www-international@w3.org > Subject: Re: Unicode encoding for web pages > > On Wednesday, March 30, 2005, 2:45:27 PM, Deborah wrote: > > DC> For web pages, would you consider using a Unicode encoding > DC> other than UTF-8, eg UTF-16? If so, why? or why not? > > I used to consider that UTF-16 would provide a space saving benefit for > those languages where a single character runs to three or four bytes in > UTF-8. It turns out that if there is a fairly small amount of markup, > this space saving is not seen in practive. > > I understand that in well optimised Web Services applications withhigh > throughput, profiling shows that UTF-8 to UTF-16 conversion (eg, to > construct a DOM) can become significant so one would imaging shipping > content in UTF-16 might help there also. > > I could not see any particular reason to use UTF-7. > > Material where a) random access was a high priority and b) there was > significant usage of characters that would require surrogates, might > indicate that using UCS-4 would be a benefit. > > So in general, and particularly for XML where a parser is not required > to understand encodings other than UTF-8 and UTF-16, I see less and less > reason to use anything other than UTF-8. > > -- > Chris Lilley mailto:chris@w3.org > Chair, W3C SVG Working Group > W3C Graphics Activity Lead -- ------------------------------------------------------------- Tex Texin cell: +1 781 789 1898 mailto:Tex@XenCraft.com Xen Master http://www.i18nGuy.com XenCraft http://www.XenCraft.com Making e-Business Work Around the World -------------------------------------------------------------
Received on Wednesday, 30 March 2005 20:03:01 UTC