- From: A. Vine <avine@eng.sun.com>
- Date: Thu, 23 Aug 2001 10:16:38 -0700
- To: www-international@w3.org
Emma, To clarify some terminology, here is some text that I sent to one of our marketing folks explaining how characters are handled by the computer system. He seemed to understand it :-) The way computers see text is as streams of 8-bit bytes. Individual characters are represented by a byte or a sequence of bytes, depending on: 1) what character it is, 2) what character encoding it's in. Essentially, characters have to be classified some way, so the display knows that a particular sequence of bytes needs to be displayed as, e.g. A (Latin capital A). In order for computers to be able to do this, all the characters in a character set are associated with a set of integer values, one per character. These integer values are then converted into byte sequences, a unique one for each character. This conversion is done via some algorithm. The resulting character set associated with byte sequences is called a character encoding scheme. The name given to a particular character encoding scheme is called a charset. So for example, take A. It's a member of lots of different character sets, and lots of charsets, usually with the exact same byte sequence. When a display program sees its byte sequence, 1 byte with a value of hex 41, or 01000001, it goes and gets the glyph (visual representation of a character) from a font using that value. The glyph comes out as "A", or maybe something that looks somewhat different because it's bold or italic or in another font. Now, if you wanted access to all the major characters of the world, you can use a charset called UTF-8. UTF-8 is a character encoding scheme for Unicode (there are many other character encoding schemes for Unicode, e.g. UCS-2, UTF-16, etc., but UTF-8 tends to work better in older software). That is, Unicode is a set of characters associated with integer values (called a coded character set), and UTF-8 is the byte sequences for those Unicode values, so programs can "understand" the text well enough to figure out how to pick out the appropriate glyphs from the appropriate font. Let's not bother with fonts - they are their own discipline. If you want to understand why Unicode is only a tool to help folks develop multilingual applications and Web sites, let me know your email address and I'll send you a presentation on internationalization which will clarify how complex a task it is to put this sort of product together. Regards, Andrea Vine iPlanet i18n architect P.S. Internationalization is abbreviated "i18n", because there are 18 letters between the "i" and the "n". Since it's an abbreviation, there is no need to capitalize it.
Received on Thursday, 23 August 2001 13:17:32 UTC