W3C home > Mailing lists > Public > public-html@w3.org > February 2008

RE: BOM (several messages about handling encodings in HTML)

From: Brian Smith <brian@briansmith.org>
Date: Fri, 29 Feb 2008 08:54:55 -0800
To: "'HTML WG'" <public-html@w3.org>
Message-ID: <005201c87af3$d01492f0$6401a8c0@T60>

Geoffrey Sneddon wrote:
> On 29 Feb 2008, at 13:38, Brian Smith wrote:
> > If somebody wants to include a zero-width non-breaking space
> > (ZWNBSP) at the beginning of a stream, they have to use U+2060 WORD 
> > JOINER instead.
> 
> Could you possibly give me a pointer to something in the 
> Unicode standard that requires that? I've never seen such a 
> requirement.

See 16.8 Specials:

"For compatibility with versions of the Unicode Standard prior to
Version 3.2, the code
point U+FEFF has the word-joining semantics of zero width no-break space
when it is not
used as a BOM. In new text, these semantics should be encoded by U+2060
word joiner."

But, if you do want to use U+FEFF anyway, and you are not using -BE or
-LE, then:

"To represent an initial U+FEFF zero width no-break space in a UTF-16
file, use
U+FEFF twice in a row. The first one is a byte order mark; the second
one is the initial zero
width no-break space. See Table 16-4 for a summary of encoding scheme
signatures."

But:

"Where the byte order is explicitly specified, such as in UTF-16BE or
UTF-16LE, then all
U+FEFF characters-even at the very beginning of the text-are to be
interpreted as zero
width no-break spaces."

So, an initial U+FEFF is never an error, even for the -BE and -LE
variants. But, in -BE and -LE, it isn't a BOM, but a ZWNBSP. And, also,
producers of documents should never use U+FEFF anywhere in the document
unless it is used as a BOM, which by definition can't exist in a -BE/-LE
document.

- Brian
Received on Friday, 29 February 2008 16:55:08 UTC

This archive was generated by hypermail 2.3.1 : Monday, 29 September 2014 09:38:53 UTC