- From: Deborah Cawkwell <deborah.cawkwell@bbc.co.uk>
- Date: Wed, 24 Mar 2004 12:54:00 -0000
- To: <public-i18n-geo@w3.org>
- Message-ID: <418B7E44473AC34488C9E730D09FF3CF02BEED22@bbcxue204.national.core.bbc.co.uk>
This is rough at the moment. Below is initial UTF-8 BOM material, that was reduced to have more of a focus on the question. I'd like to re-use the byte stuff (mostly nearer the bottom). I think this would be a useful article, because I don't think it's explained simply in one place. Also, it's crucial to the I18N and as such, should be GEO-ed. --------------------------- NEW ARTICLE: CHARACTER ENCODING AND BYTES AREAS TO COVER - How it works without Unicode in terms of bytes - Unicode encoding's use of multiple bytes and how that varies across different bytes. - Implications, eg, 'heavier' character representation. --------------------------- WHAT IS THE UNICODE BOM (BYTE ORDER MARK)? The encoded Unicode Scalar Value (unique reference to a character defined within the Unicode repertoire) corresponding to the Unicode character "Zero Width Non-Breaking Space", whose sequence of byte-values is judged to be unlikely to occur at the beginning of any 'normal' text file. A Unicode BOM has two purposes: - for UTF-16/32, it indicates the order in which bytes should be interpreted (big Endian/little Endian), since the 'unit of encoding' is greater than one byte for both of these encodings, unlike UTF-8; this is the primary purpose of a BOM. - to identify a plain text file as containing Unicode text; this is a secondary 'purpose'. WHERE DOES IT OCCUR WITHIN BBC WS NEW MEDIA? In Notepad files used to create HTML files/Apache includes, when saving as UTF-8. In BBC World Service New Media, we came across the BOM when working with languages, which, for the web, required to be encoded in UTF-8 (we currently use UTF-8 only when there is no other alternative); those languages are Hindi, Pashto, Persian, Urdu, Vietnamese). For those languages, our current process is to request translations of 'furniture' vocabulary, such as navigation words, 'Latest News', etc. These translations are supplied from the BBC Language Services in MS Word. - We create HTML navigational includes using MS Notepad. - We create language service configuration files for use in XSL processing. PROBLEMS In MS Notepad, the Unicode BOM does not appear on-screen when editing the file; however, once uploaded and viewed via the web, the BOM appears as a space above the included text. In the case of UTF-8, it appears to serve no purpose. In order to remove the extraneous space, the file is opened in Macromedia Homesite, where the BOM appears as 3 characters, which are deleted (corresponding the values EF BB BF, which is the UTF-8 encoding for the U+FFEF Zero Width Non-Breaking Space Unicode character used for the BOM). It would seem that the second above-identified purpose of the Unicode BOM is superfluous; the first above-identified purpose is irrelevant to UTF-8 anyway. MISUNDERSTANDINGS I had thought that the different encodings (UTF-8, UTF-16, UTF-32) of the Unicode character repertoire, represented different (sub)sets of that repertoire. I had thought that any UTF-8 encoding carried double the weight of the presented text (as opposed to the mark-up of that presented text). To understand the Unicode BOM I had to understand the following... SO WHY ARE THERE THREE DIFFERENT UNICODE ENCODINGS, IF THEY CAN ALL REPRESENT THE SAME (UNIVERSAL) CHARACTER REPERTOIRE? The 3 different Unicode encodings are UTF-8, UTF-16 and UTF-32. Each encoding can refer to each and every value in the Unicode repertoire. The choice of Unicode encoding depends on language and efficiency. HOW DOES THE UNICODE BOM EFFECT EACH ENCODING? The Unicode BOM primarily indicates the order in which byte(s) should be 'translated' into the corresponding Unicode Scalar Value. To understand, it is important to understand that: In regard to the the Unicode encodings (UTF-8, UTF-16, UTF-32): - UTF-8 is encoded using 1-4 bytes. In UTF-8, the smallest indivisable unit of meaning is 1 byte. That byte either represents a given Scalar Value in the Unicode repertoire or indicates that a Scalar Value cannot be obtained with only 1 byte and the system needs to look at subsequent bytes. Unicode character UTF-8 byte 1 UTF-8 byte 2 UTF-8 byte 3 UTF-8 byte 4 0000 to 007F (ASCII) 01xxxxxx 0080 to 07FF 110xxxxx 10xxxxxx 0800 to FFFF 1110xxxx 10xxxxxx 10xxxxxx 10000 to 10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx , where the first byte indicates how many further bytes have been used to encode the given Scalar Value in the Unicode repertoire for the character being encoded. The bytes can be seen as discrete in that the first byte indicates how many (if any) further bytes are required to identify the Unicode Scalar Value. The values in the Unicode repertoire from 0-127 represent the ASCII (non-extended set). As 'low' values, these can be represented in a single byte. - UTF-16 is encoded using 2 or 4 bytes (mostly 2). Where 4 bytes are required to represent a value in the Unicode repertoire at the higher end, the first 2 bytes use a reserved value to indicate that a 2nd 2 bytes are required to identify the Unicode Scalar Value. - UTF-32 is always encoded using 4 bytes. Where there are encodings which require more than 1 byte (eg, UTF-16, UTF-32), a system may read those bytes from 'least significant numerical value to most significant numerical value' or 'most significant numerical value to least significant numerical value' depending on whether it is 'little endian' or 'big endian' respectively. To illustrate this, take the decimal number 'twenty-five'. Written conventionally in decimal format, this becomes '25', that is to say, the digit '2' followed by the digit '5'. In this example, the '2' is most significant because it represents the value 'twenty' or 'two times ten', and '5' is the least significant because it represents the value 'five' or 'five ones'. We could write a representation of the same value as '52' if we adopted a convention whereby the 'least significant' preceded 'more significant' values when written out ('serialized'), but if we allow this alternative representation, we need some means of specifying which representation has been used. In the case of Unicode, the receiving system uses the Unicode BOM and its encoding in UTF-8/16/32 to understand 'direction'. Thus, the encoded representation of the BOM indicates the direction in which the remaining bytes should be interpreted. Little endian serializes from the least significant (number), whereas big endian serializes from the most significant number. - big or little endian is dependent on the processor architecure: - Motorola, PowerPC = big endian - Intel = little endian - little endian - reads byte - big endian - reads bytes right-to-left UTF-8 is a byte-stream (code-size = one byte), hence byte-ordering is irrelevant; each byte is 'atomic' in itself, and does not require to be paired with another to derive a 'meaningful' unit of information (this does not mean that a single byte can encode any character, merely that relevant information can be derived from a single byte, whereas a single byte of data whose 'code-size' is 2 bytes is essentially 'meaningless'). However, as the 'code size' of UTF-16/32 is greater than one byte (eg 2 or 4 bytes), given the fact that different architectures store/serialize such sequences in memory/on disk either 'most-significant-byte-first' or 'least-significant-byte-first', byte-order is significant (eg little-endian vs big-endian), and some means of identifying the byte-order is essential if byte-sequences serialized on one machine are to be interpreted correctly on another. The Unicode standard does not impose a specific byte-ordering sequence, on the basis that machines function most efficiently when able to operate using their 'native' serialization order, hence some means of identifying same is required. This is the primary function of the BOM. Unicode uses the U+FFEF Zero Width Non-Breaking Space (ZWNBSP) Unicode character at the beginning of a file as its byte-order mark; thus when interpreting UTF-16/32, if an application executing on a 'big endian' machine 'sees' the value FFEF in the first two/four bytes of a file, it is able to infer that the incoming data is UTF-16/UTF-32 'big endian, whereas if it 'sees' the value FEFF (which is not a valid Unicode Scalar Value) it is able to infer that that the incoming data is UTF-16/32 'little endian'. UTF-8 is a variable length encoding - characters corresponding to the ASCII range are encoded using one byte, the Unicode Scalar Values corresponding exactly the equivalent ASCII values. Characters whose Unicode Scalar Value is > 127 (decimal) are encoded using two to four bytes. The most significant bits of the first byte of such characters' byte sequences will be 110xxxxx, for characters encoded with two bytes, 1110xxxx for characters encoded with three byte, and 11110xxx for those encoded with four bytes. The most significant bits of the second, third and fourth bytes will always be 10xxxxxx. These bit 'patterns' allow a 'UTF-8-savvy' application to distinguish characters encoded with a single byte from those encoded with multiple bytes, since the the most significant bits of the first byte indicates how many bytes have been used to encode any character. A side effect of this use of bit positions to signify encoding size, is that while 4 bytes can theoretically encode 4,294,967,296 different characters, only 2,164,863 variations are possible in UTF-8, most of which are not defined by the Unicode repetoire anyway. Examples: Unicode character UTF-8 byte 1 UTF-8 byte 2 UTF-8 byte 3 UTF-8 byte 4 0000 to 007F (ASCII) 01xxxxxx 0080 to 07FF 110xxxxx 10xxxxxx 0800 to FFFF 1110xxxx 10xxxxxx 10xxxxxx 10000 to 10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx REFERENCES http://www.dpawson.co.uk/xsl/sect2/N7702.html#d8771e16 http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/un icode_42jv.asp Unicode: A Primer, Tony Graham, M&T Books (IDG Books Worldwide), 2000 (http://www.mulberrytech.com/unicode/primer/) OTHER FAQS Which languages fall at the end of the Unicode repertoire requiring 4 bytes? Does Unicode mean heavier pages? What a user needs to receive Unicode? Using applications & OS with languages. Deborah Cawkwell Senior Software Engineer BBC World Service New Media 707 NE Bush House London WC2 4PH Tel: 0207 557 3763 Fax: 0207 836 4332 deborah.cawkwell@bbc.co.uk http://www.bbc.co.uk/worldservice BBCi at http://www.bbc.co.uk/ This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated. If you have received it in error, please delete it from your system. Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately. Please note that the BBC monitors e-mails sent or received. Further communication will signify your consent to this.
Received on Wednesday, 24 March 2004 07:58:23 UTC