RE: BOM (several messages about handling encodings in HTML)

Geoffrey Sneddon wrote:
> On 29 Feb 2008, at 13:38, Brian Smith wrote:
> > If somebody wants to include a zero-width non-breaking space
> > (ZWNBSP) at the beginning of a stream, they have to use U+2060 WORD 
> > JOINER instead.
> 
> Could you possibly give me a pointer to something in the 
> Unicode standard that requires that? I've never seen such a 
> requirement.

See 16.8 Specials:

"For compatibility with versions of the Unicode Standard prior to
Version 3.2, the code
point U+FEFF has the word-joining semantics of zero width no-break space
when it is not
used as a BOM. In new text, these semantics should be encoded by U+2060
word joiner."

But, if you do want to use U+FEFF anyway, and you are not using -BE or
-LE, then:

"To represent an initial U+FEFF zero width no-break space in a UTF-16
file, use
U+FEFF twice in a row. The first one is a byte order mark; the second
one is the initial zero
width no-break space. See Table 16-4 for a summary of encoding scheme
signatures."

But:

"Where the byte order is explicitly specified, such as in UTF-16BE or
UTF-16LE, then all
U+FEFF characters-even at the very beginning of the text-are to be
interpreted as zero
width no-break spaces."

So, an initial U+FEFF is never an error, even for the -BE and -LE
variants. But, in -BE and -LE, it isn't a BOM, but a ZWNBSP. And, also,
producers of documents should never use U+FEFF anywhere in the document
unless it is used as a BOM, which by definition can't exist in a -BE/-LE
document.

- Brian

Received on Friday, 29 February 2008 16:55:08 UTC