Statement of intent: article about character encoding and bytes (with some initial material)

This is rough at the moment.
Below is initial UTF-8 BOM material, that was reduced to have more of a
focus on the question. I'd like to re-use the byte stuff (mostly nearer
the bottom).
I think this would be a useful article, because I don't think it's
explained simply in one place. Also, it's crucial to the I18N and as
such, should be 
GEO-ed.
 
---------------------------
 
NEW ARTICLE: CHARACTER ENCODING AND BYTES
 
AREAS TO COVER
 
- How it works without Unicode in terms of bytes
- Unicode encoding's use of multiple bytes and how that varies across
different bytes.
- Implications, eg, 'heavier' character representation.
 
---------------------------
 
WHAT IS THE UNICODE BOM (BYTE ORDER MARK)?
 
The encoded Unicode Scalar Value (unique reference to a character
defined within the Unicode repertoire) corresponding to the Unicode
character "Zero Width 
Non-Breaking Space", whose sequence of byte-values is judged to be
unlikely to occur at the beginning of any 'normal' text file. 
 
A Unicode BOM has two purposes:
            - for UTF-16/32, it indicates the order in which bytes
should be interpreted (big Endian/little Endian), since the 'unit of
encoding' is greater than 
one byte for both of these encodings, unlike UTF-8; this is the primary
purpose of a BOM.
            - to identify a plain text file as containing Unicode text;
this is a secondary 'purpose'.
            
            
WHERE DOES IT OCCUR WITHIN BBC WS NEW MEDIA?
 
In Notepad files used to create HTML files/Apache includes, when saving
as UTF-8.
 
In BBC World Service New Media, we came across the BOM when working with
languages, which, for the web, required to be encoded in UTF-8 (we
currently use 
UTF-8 only when there is no other alternative); those languages are
Hindi, Pashto, Persian, Urdu, Vietnamese). For those languages, our
current process is to 
request translations of 'furniture' vocabulary, such as navigation
words, 'Latest News', etc. These translations are supplied from the BBC
Language Services 
in MS Word. 
            - We create HTML navigational includes using MS Notepad.
            - We create language service configuration files for use in
XSL processing.
            
            
PROBLEMS
 
In MS Notepad, the Unicode BOM does not appear on-screen when editing
the file; however, once uploaded and viewed via the web, the BOM appears
as a space 
above the included text. In the case of UTF-8, it appears to serve no
purpose. In order to remove the extraneous space, the file is opened in
Macromedia 
Homesite, where the BOM appears as 3 characters, which are deleted
(corresponding the values EF BB BF, which is the UTF-8 encoding for the
U+FFEF Zero Width 
Non-Breaking Space Unicode character used for the BOM). It would seem
that the second above-identified purpose of the Unicode BOM is
superfluous; the first 
above-identified purpose is irrelevant to UTF-8 anyway.
 
 
MISUNDERSTANDINGS
 
I had thought that the different encodings (UTF-8, UTF-16, UTF-32) of
the Unicode character repertoire, represented different (sub)sets of
that repertoire. I 
had thought that any UTF-8 encoding carried double the weight of the
presented text (as opposed to the mark-up of that presented text).
 
To understand the Unicode BOM I had to understand the following...
 
 
SO WHY ARE THERE THREE DIFFERENT UNICODE ENCODINGS, IF THEY CAN ALL
REPRESENT THE SAME (UNIVERSAL) CHARACTER REPERTOIRE?
 
The 3 different Unicode encodings are UTF-8, UTF-16 and UTF-32. Each
encoding can refer to each and every value in the Unicode repertoire.
The choice of 
Unicode encoding depends on language and efficiency. 
 
 
HOW DOES THE UNICODE BOM EFFECT EACH ENCODING?
 
The Unicode BOM primarily indicates the order in which byte(s) should be
'translated' into the corresponding Unicode Scalar Value. 
 
To understand, it is important to understand that:
 
            In regard to the the Unicode encodings (UTF-8, UTF-16,
UTF-32):
 
                        - UTF-8 is encoded using 1-4 bytes. In UTF-8,
the smallest indivisable unit of meaning is 1 byte. That byte either
represents a given Scalar 
Value in the Unicode repertoire or indicates that a Scalar Value cannot
be obtained with only 1 byte and the system needs to look at subsequent
bytes.
                        
                        Unicode character
UTF-8 byte 1     UTF-8 byte 2     UTF-8 byte 3     UTF-8 byte 4
                        
                        0000 to 007F (ASCII)
01xxxxxx
                        0080 to 07FF
110xxxxx                                  10xxxxxx
                        0800 to FFFF
1110xxxx                                  10xxxxxx
10xxxxxx
                        10000 to 10FFFF
11110xxx                                  10xxxxxx
10xxxxxx                                  
10xxxxxx
                        
                        
                        
                        , where the first byte indicates how many
further bytes have been used to encode the given Scalar Value in the
Unicode repertoire for the 
character being encoded. 
                        
                        The bytes can be seen as discrete in that the
first byte indicates how many (if any) further bytes are required to
identify the Unicode 
Scalar Value. 
                        
                        
                        The values in the Unicode repertoire from 0-127
represent the ASCII (non-extended set). As 'low' values, these can be
represented in a single 
byte.
 
                        - UTF-16 is encoded using 2 or 4 bytes (mostly
2). Where 4 bytes are required to represent a value in the Unicode
repertoire at the higher 
end, the first 2 bytes use a reserved value to indicate that a 2nd 2
bytes are required to identify the Unicode Scalar Value.
 
                        - UTF-32 is always encoded using 4 bytes.
            
            Where there are encodings which require more than 1 byte
(eg, UTF-16, UTF-32), a system may read those bytes from 'least
significant numerical value 
to most significant numerical value' or 'most significant numerical
value to least significant numerical value' depending on whether it is
'little endian' or 
'big endian' respectively.
            
            To illustrate this, take the decimal number 'twenty-five'.
Written conventionally in decimal format, this becomes '25', that is to
say, the digit '2' 
followed by the digit '5'.  In this example, the '2' is most significant
because it represents the value 'twenty' or 'two times ten', and '5' is
the least 
significant because it represents the value 'five' or 'five ones'. We
could write a representation of the same value as '52' if we adopted a
convention 
whereby the 'least significant' preceded 'more significant' values when
written out ('serialized'), but if we allow this alternative
representation, we need 
some means of specifying which representation has been used.
            
            In the case of Unicode, the receiving system uses the
Unicode BOM and its encoding in UTF-8/16/32 to understand 'direction'.
Thus, the encoded 
representation of the BOM indicates the direction in which the remaining
bytes should be interpreted. Little endian serializes from the least
significant 
(number), whereas big endian serializes from the most significant
number.
            
                        - big or little endian is dependent on the
processor architecure:
                                    - Motorola, PowerPC = big endian
                                    - Intel = little endian
            
                        - little endian - reads byte
            
                        - big endian - reads bytes right-to-left
            
            
 
 
 
UTF-8 is a byte-stream (code-size = one byte), hence byte-ordering is
irrelevant; each byte is 'atomic' in itself, and does not require to be
paired with 
another to derive a 'meaningful' unit of information (this does not mean
that a single byte can encode any character, merely that relevant
information can be 
derived from a single byte, whereas a single byte of data whose
'code-size' is 2 bytes is essentially 'meaningless').
 
However, as the 'code size' of UTF-16/32 is greater than one byte (eg 2
or 4 bytes), given the fact that different architectures store/serialize
such 
sequences in memory/on disk either 'most-significant-byte-first' or
'least-significant-byte-first', byte-order is significant (eg
little-endian vs 
big-endian), and some means of identifying the byte-order is essential
if byte-sequences serialized on one machine are to be interpreted
correctly on 
another.
 
The Unicode standard does not impose a specific byte-ordering sequence,
on the basis that machines function most efficiently when able to
operate using their 
'native' serialization order, hence some means of identifying same is
required.
 
This is the primary function of the BOM.  Unicode uses the U+FFEF Zero
Width Non-Breaking Space (ZWNBSP) Unicode character at the beginning of
a file as its 
byte-order mark; thus when interpreting UTF-16/32, if an application
executing on a 'big endian' machine 'sees' the value FFEF in the first
two/four bytes of 
a file, it is able to infer that the incoming data is UTF-16/UTF-32 'big
endian, whereas if it 'sees' the value FEFF (which is not a valid
Unicode Scalar 
Value) it is able to infer that that the incoming data is UTF-16/32
'little endian'.
 
UTF-8 is a variable length encoding - characters corresponding to the
ASCII range are encoded using one byte, the Unicode Scalar Values
corresponding exactly 
the equivalent ASCII values.
 
Characters whose Unicode Scalar Value is > 127 (decimal) are encoded
using two to four bytes.  The most significant bits of the first byte of
such 
characters' byte sequences will be 110xxxxx, for characters encoded with
two bytes, 1110xxxx for characters encoded with three byte, and 11110xxx
for those 
encoded with four bytes.  The most significant bits of the second, third
and fourth bytes will always be 10xxxxxx.
 
These bit 'patterns' allow a 'UTF-8-savvy' application to distinguish
characters encoded with a single byte from those encoded with multiple
bytes, since the 
the most significant bits of the first byte indicates how many bytes
have been used to encode any character.  A side effect of this use of
bit positions to 
signify encoding size, is that while 4 bytes can theoretically encode
4,294,967,296 different characters, only 2,164,863 variations are
possible in UTF-8, 
most of which are not defined by the Unicode repetoire anyway.
 
Examples:
 
Unicode character                                              UTF-8
byte 1     UTF-8 byte 2     UTF-8 byte 3     UTF-8 byte 4
 
0000 to 007F (ASCII)                              01xxxxxx
0080 to 07FF
110xxxxx                                  10xxxxxx
0800 to FFFF                                                    1110xxxx
10xxxxxx                                  10xxxxxx
10000 to 10FFFF                                               11110xxx
10xxxxxx                                  10xxxxxx
10xxxxxx
 
 
REFERENCES
 
http://www.dpawson.co.uk/xsl/sect2/N7702.html#d8771e16
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/un
icode_42jv.asp
Unicode: A Primer, Tony Graham, M&T Books (IDG Books Worldwide), 2000
(http://www.mulberrytech.com/unicode/primer/)
 
 
OTHER FAQS
 
Which languages fall at the end of the Unicode repertoire requiring 4
bytes?
Does Unicode mean heavier pages?
What a user needs to receive Unicode?
Using applications & OS with languages.
 
Deborah Cawkwell
Senior Software Engineer 
BBC World Service New Media
707 NE Bush House
London
WC2 4PH
Tel: 0207 557 3763
Fax: 0207 836 4332
deborah.cawkwell@bbc.co.uk
http://www.bbc.co.uk/worldservice 
 

BBCi at http://www.bbc.co.uk/

This e-mail (and any attachments) is confidential and may contain
personal views which are not the views of the BBC unless specifically
stated.
If you have received it in error, please delete it from your system. 
Do not use, copy or disclose the information in any way nor act in
reliance on it and notify the sender immediately. Please note that the
BBC monitors e-mails sent or received. 
Further communication will signify your consent to this.

Received on Wednesday, 24 March 2004 07:58:23 UTC