RE: UTF-8 and BOM

At 00/08/25 13:08 -0700, John Boyer wrote:
><TomParaphrase>
>Must we strip all zero width no break spaces (U+FEFF = EF, BB, BF in UTF-8)
>from our data
></TomParaphrase>

No, definitely not. Only at the start it is a special case.


><john>
>For starters, the BOM and its UTF-8 encoding are considered to be at the
>very beginning of the file (see http://www.cl.cam.ac.uk/~mgk25/unicode.html
>by Marcus Kuhn).
>
>Moreover, according to XML 1.0, the UTF-16 BOM is considered to be *outside
>of* the data

Outside of the document, but in the octet stream.

>and used to qualify how to take the UTF-16 data and convert it
>to UCS.  It is 'metadata' that XML does not retain.
>
>In other words, the UTF-16 BOM is not U+FEFF or U+FFFE.  It is not intended
>to represent a Unicode character of data, so it's just a 16-bit binary value
>of FEFF or FFFE, the latter of which is illegal under Unicode anyway.

Well, it's exactly the Unicode character designed for that propose,
so saying it's not a character is a bit strange, but that's not
very important.


>What I think we are saying is that we will not encode the UTF-16 BOM when
>converting to UTF-8 because the BOM is not part of the data. However, I
>think we are not clear in saying what will happen to our UTF-8 data if
>U+FEFF appeared in the UTF-16 data stream *after* the BOM.

Then it's a zero width non-joiner, a perfectly legal character in
XML data, and all you have to do is to transcode it to UTF-8, as
for any other character. No reason for special words.


>When translating from UTF-16 to UTF-8, I would think that the UTF-16 BOM
>would be used solely to convert to UCS, but then if U+FEFF appears in the
>actual data, then the corresponding UTF-8 sequence would appear in our UTF-8
>data stream.

Yes, exactly.


>Indeed, the rationale for prepending of the UTF-8 for U+FEFF is at best
>confusingly stated by the Unicode 3.0 standard on p.324 (this reference was
>provided by Phillip H. Griffin of Griffin Consulting (http://asn-1.com)):
>
>         "In UTF-8 the BOM corresponds to the byte sequence 
> EF(16)BB(16)BF(16).
>         Although there are never any questions of byte order with UTF-8 text,
>         this sequence can serve as a signature for UTF-8 encoded text where
>         the character set is unmarked."
>
>How is this a 'signature' for UTF-8?  Isn't this also a valid string prefix
>for a stream of ISO-8859-1 characters?

Valid, but extremely rare if not non-existing in practice. And similar
arguments apply to most other byte sequences used as magic numbers.


>As well, what if the UTF-16 BOM is FFFE, which is not a valid UCS character?
>Are we to conclude that UTF-8 encoded text is unidentifiable on machines
>that would start a UTF-16 encoding with a BOM of FFFE?

You are confusing characters and bytes. The BOM as a code value is
always FEFF. It just looks like FFFE on little-endian machines, because
these revert everything.


>I believe this is why we don't understand the application of a BOM to UTF-8
>data, neverminding actually putting U+FEFF *inside* the UTF-8 encoded data.

You don't have to understand it. You just have to understand that there
are people that may do it, and may claim that it's okay because nobody
says it's forbidden (unless we do).

The current text is fine, if we can just leave it at that without
any more discussions. If you want, I can have the I18N WG confirm
it next week (we have a F2F in Seattle).


Regards,   Martin.

Received on Saturday, 26 August 2000 21:27:03 UTC