Re: BOM (several messages about handling encodings in HTML)

Geoffrey Sneddon wrote:

>> "In particular, whenever a data stream is declared to be UTF-16BE,  
>> UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used."

>> If somebody wants to include a zero-width non-breaking space  
>> (ZWNBSP) at the beginning of a stream, they have to use U+2060 WORD  
>> JOINER instead.
 
> Could you possibly give me a pointer to something in the Unicode  
> standard that requires that? I've never seen such a requirement.

TUS 5.0 chapter 3.10, D96: 
"In UTF-16BE an initial byte sequence <FE FF> is interpreted as
 U+FEFF ZERO WIDTH NON-BREAK SPACE."

D97 is the corresponding <FF FE> definition for UTF-16LE. 
D98 explains that an initial <FE FF> or <FF FE> is a BOM.
D99, D100, and D101 are for UTF32-BE, UTF32-LE, and UTF-32.

Chapter 16.8 notes that WORD JOINER should be used for what
the name says instead of ZWNBSP.
Chapter 16.2 states that WORD JOINER is strongly preferred
in comparison with ZWNBSP.

For a summary see table 2.4 in chapter 2.6, it says "BOM
allowed: yes" for UTF-8, UTF-16, and-32, and it says "no"
for UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE.  

Check out C11 in chapter 3, not exactly clear from my POV.

For better definitions with MUST and MUST NOT see RFC 2781,
this RFC is the normative text for the IANA registrations.

 Frank

Received on Saturday, 1 March 2008 10:38:41 UTC