W3C home > Mailing lists > Public > www-international@w3.org > July to September 2001

Re: International business communications and Unicode

From: A. Vine <avine@eng.sun.com>
Date: Thu, 23 Aug 2001 10:16:38 -0700
To: www-international@w3.org
Message-id: <3B853A76.F56EF882@eng.sun.com>
To clarify some terminology, here is some text that I sent to one of our
marketing folks explaining how characters are handled by the computer system. 
He seemed to understand it :-)

The way computers see text is as streams of 8-bit bytes.  Individual characters
are represented by a byte or a sequence of bytes, depending on:  1) what
character it is, 2) what character encoding it's in.  Essentially, characters
have to be classified some way, so the display knows that a particular sequence
of bytes needs to be displayed as, e.g. A (Latin capital A).  In order for
computers to be able to do this, all the characters in a character set are
associated with a set of integer values, one per character.  These integer
values are then converted into byte sequences, a unique one for each character. 
This conversion is done via some algorithm.  The resulting character set
associated with byte sequences is called a character encoding scheme.  The name
given to a particular character encoding scheme is called a charset.

So for example, take A.  It's a member of lots of different character sets, and
lots of charsets, usually with the exact same byte sequence.  When a display
program sees its byte sequence, 1 byte with a value of hex 41, or 01000001, it
goes and gets the glyph (visual representation of a character) from a font using
that value.  The glyph comes out as "A", or maybe something that looks somewhat
different because it's bold or italic or in another font.

Now, if you wanted access to all the major characters of the world, you can use
a charset called UTF-8.  UTF-8 is a character encoding scheme for Unicode (there
are many other character encoding schemes for Unicode, e.g. UCS-2, UTF-16, etc.,
but UTF-8 tends to work better in older software).  That is, Unicode is a set of
characters associated with integer values (called a coded character set), and
UTF-8 is the byte sequences for those Unicode values, so programs can
"understand" the text well enough to figure out how to pick out the appropriate
glyphs from the appropriate font.  Let's not bother with fonts - they are their
own discipline.

If you want to understand why Unicode is only a tool to help folks develop
multilingual applications and Web sites, let me know your email address and I'll
send you a presentation on internationalization which will clarify how complex a
task it is to put this sort of product together.

Andrea Vine
iPlanet i18n architect

P.S.  Internationalization is abbreviated "i18n", because there are 18 letters
between the "i" and the "n".  Since it's an abbreviation, there is no need to
capitalize it.
Received on Thursday, 23 August 2001 13:17:32 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:16:57 GMT