Re: Unicode encoding for web pages

Hi,

There are several tradeoffs between utf-8 vs utf-16.

Size has been mentioned and that depends on the nature of the data- the
distribution of languages or characters used.
This is colored by the markup, metadata, scripts, media or other
information that can be enclosed with the text of the page. Markup of
course tends towards ASCII.

The cost of conversion between 8 and 16 is very small and for many
situations you can get a significantly bigger performance improvement by
optimizing other aspects of the application than either eliminating the
conversion or changing your base encoding.

However, it pays to consider the nature of the application and what is
actually done with the data.
Many applications primarily move data back and forth from screen, data
and other buffers to databases and back, and don't do much actual
modification or linguistic operations (search, etc) with it.
For those applications, since they are just moving bytes back and forth,
the conversion is needless and there is no benefit. May as well leave
the data as-is.

On the other hand, applications that intensively linguisticly process
text will benefit in terms of CPU cycles from using utf-16.

The cpu benefit can be outweighed though if the data access is slowed by
the growth in size from utf-8 to utf-16. For example, if more disk reads
are needed.

Then there is trasmission cost. Sending more bytes over the net can be
prohibitive.

So:

For small pages, or pages that are dominated by non-textual data, or
pages that are dominated by ideographic languages, UTF-16 is fine and
can be an improvement if the data tends to be more compressed in utf-16.

For data that is intensively linguisticly processed, than utf-16 is
better and can benefit even if there is some conversion overhead. So you
might use utf-16 internally or on a backend, even if the pages are utf-8
and have to convert.

For data that is only moved around and not processed, than you might
look at language usage and choose the more compressed form of UTF, so
that i/o (disk reads/writes) don't impact performance.

Net transmission cost is often the biggest performance impediment, so
again size is the biggest consideration.

hth
tex


"McDonald, Ira" wrote:
> 
> Hi,
> 
> And for what it's worth, the IETF formally requires that UTF-8
> must be supported in transferring human-readable text over any
> Internet protocol (including HTTP/1.1) and has done so for a
> _long_ time.  See RFC 2277 (January 1998) which specifically
> prohibits (for example) UTF-16 only support (without UTF-8).
> 
> If you encode a page in UTF-16, there's a fair chance that an
> intermediary is going to convert it into UTF-8 before delivery
> anyway.  The "benefits" of UTF-16 disappeared after Plane 0
> stopped being the only useful and assigned Unicode codepoints
> (for example, all the interesting math and musical notation
> is not in Plane 0).
> 
> Cheers,
> - Ira
> 
> Ira McDonald (Musician / Software Architect)
> Blue Roof Music / High North Inc
> PO Box 221  Grand Marais, MI  49839
> phone: +1-906-494-2434
> email: imcdonald@sharplabs.com
> 
> -----Original Message-----
> From: www-international-request@w3.org
> [mailto:www-international-request@w3.org]On Behalf Of Chris Lilley
> Sent: Wednesday, March 30, 2005 9:29 AM
> To: Deborah Cawkwell
> Cc: www-international@w3.org
> Subject: Re: Unicode encoding for web pages
> 
> On Wednesday, March 30, 2005, 2:45:27 PM, Deborah wrote:
> 
> DC> For web pages, would you consider using a Unicode encoding
> DC> other than UTF-8, eg UTF-16? If so, why? or why not?
> 
> I used to consider that UTF-16 would provide a space saving benefit for
> those languages where a single character runs to three or four bytes in
> UTF-8. It turns out that if there is a fairly small amount of markup,
> this space saving is not seen in practive.
> 
> I understand that in well optimised Web Services applications withhigh
> throughput, profiling shows that UTF-8 to UTF-16 conversion (eg, to
> construct a DOM) can become significant so one would imaging shipping
> content in UTF-16 might help there also.
> 
> I could not see any particular reason to use UTF-7.
> 
> Material where a) random access was a high priority and b) there was
> significant usage of characters that would require surrogates, might
> indicate that using UCS-4 would be a benefit.
> 
> So in general, and particularly for XML where a parser is not required
> to understand encodings other than UTF-8 and UTF-16, I see less and less
> reason to use anything other than UTF-8.
> 
> --
>  Chris Lilley                    mailto:chris@w3.org
>  Chair, W3C SVG Working Group
>  W3C Graphics Activity Lead

-- 
-------------------------------------------------------------
Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
Xen Master                          http://www.i18nGuy.com
                         
XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
-------------------------------------------------------------

Received on Wednesday, 30 March 2005 20:03:01 UTC