RE: UTF-8 and BOM

     I have only one disagreement with this.  Of course, BOM is not a legal
UTF-8 byte sequence - and neither is the UCS-2 representation of any
character in the Latin-1 supplemental set such as u-umlaut, which appears
in Martin's last name.  However, if you convert a UCS-2 string containing a
BOM into UTF-8, the BOM will be converted into the triple-byte sequence
xEF,xBB,xBF which is legal UTF-8, and converting it back to UCS-2 yields a
BOM or a zero-width no break space.  So, are we supposed to strip all
occurrences of "zero-width no break space" or not?

          Tom Gindin

"John Boyer" <> on 08/25/2000 02:45:41 PM

To:   Tom Gindin/Watson/IBM@IBMUS
cc:   "Martin J. Duerst" <>, "Joseph M. Reagle Jr."
      <>, <>
Subject:  RE: UTF-8 and BOM

     Actually, John, a UCS-2 BOM in big-endian order (U+FEFF) is a valid
UCS character known as "zero width no break space".  One in little-endian
order is not.  This is documented, among other places, in section 3.8
(Special Character Properties) of version 2.0 of the Unicode standard.
     If you follow the table at the top of page 3 of RFC 2279, you will
always encode a character in the minimal number of bytes.  This table
appears on the same page as the surrogate pair warning, so if there is no
need to warn implementors about surrogate pairs in our spec there is also
no need to warn them about overlong character encodings.

          Tom Gindin
Obviously I agree about the surrogate pair and overlong character warnings.
1) Martin's message didn't specify the overlong stuff, so I'm curious if he
has other warnings that *don't* come across in RFC2279.  Like you, I feel
that we have referred to RFC2279, so things that are clear in that RFC need
not be repeated.
2) The BOM may be a valid UCS character, but it is not a valid UTF-8 byte
sequence.  Martin's message asserted that it was valid UTF-8, which is what
I disagreed with.  Indeed, everything between 0 and 7FFFFFFF is a valid UCS
character (even if most have not been assigned to symbols yet).

John Boyer
Development Team Leader,
Distributed Processing and XML
PureEdge Solutions Inc.
Creating Binding E-Commerce
v: 250-479-8334, ext. 143  f: 250-479-3772
1-888-517-2675 <>

"John Boyer" <> on 08/25/2000 02:13:37 PM

To:   "Martin J. Duerst" <>, Tom Gindin/Watson/IBM@IBMUS,
      "Joseph M. Reagle Jr." <>
cc:   <>
Subject:  RE: UTF-8 and BOM

Hi Martin,

At the last teleconference, Joseph decided we would leave it as is (comment
about BOM but not surrogates) because there was not consensus to change it
either way.

Although I am still not convinced of its necessity, I really don't mind one
way or the other, and will most likely be adding the BOM comment to the
C14N draft just because it doesn't seem to hurt anything.

Could you point me to some source I can cite which implies that a BOM at
front of UTF-8 is actually legal UTF-8? (e.g. an URL to ISO 10646 Annex R,
and a paragraph would help).  I ask because, according to RFC2279, a BOM
does not legally decode into a UCS character, so a conformant
would seemingly never generate a BOM, and it would throw an exception if
asked to decode one.  This is why I believe, in the absence of evidence to
the contrary, that the BOM comment is superfluous (but also why I don't
think it hurts to keep it).

On your last point about using too many bytes, are you referring to cases
such as having the eleven x bits of a two byte sequence containing
less than 80 (e.g. C0 80 in place of 00)?  That type of error came across
pretty clearly in RFC2279, so I don't think another RFC would be necessary,
but are there other cases that don't come across in RFC2279?

John Boyer
Development Team Leader,
Distributed Processing and XML
PureEdge Solutions Inc.
Creating Binding E-Commerce
v: 250-479-8334, ext. 143  f: 250-479-3772
1-888-517-2675 <>

-----Original Message-----
[]On Behalf Of Martin J. Duerst
Sent: Friday, August 25, 2000 2:09 AM
To: John Boyer;; Joseph M. Reagle Jr.
Subject: RE: UTF-8 and BOM

At 00/08/23 11:47 -0700, John Boyer wrote:
>I actually think we need to remove the comment about BOM *and* not put in
>comment about surrogate pairs.

No. You have to keep the comment about the BOM, because both
with and without a bom is legal UTF-8.

You better remove the comment about surrogates, because encoding
individual surrogates in UTF-8 is illegal. There are other things
that are illegal and still are sometimes done (e.g. using more
than the necessary number of bytes), and if we wanted to list
all of them, we would write another RFC for UTF-8, I guess.

Regards,    Martin.

>There does not seem to be any such thing as a need for a BOM for UTF-8.
>for surrogate pairs...  RFC2279 [1] clearly states that
>A) The only correct way to convert from UTF-16 to UTF-8 is through UCS-4
>B) The only correct way to convert from UTF-16 to UCS-4 is to fix the
>surrogate pairs.
>Moreover, RFC2781 [2] clearly states how to fix the surrogate pairs.  It
>does not seem necessary to add more text that tells the implementer how to
>transcode.  This job has been done by these other RFCs [1,2], both of
>are referenced in the XML Dsig WD.
>John Boyer
>Development Team Leader,
>Distributed Processing and XML
>PureEdge Solutions Inc.
>Creating Binding E-Commerce
>v: 250-479-8334, ext. 143  f: 250-479-3772
>1-888-517-2675 <>
>-----Original Message-----
>[]On Behalf Of
>Sent: Wednesday, August 23, 2000 10:39 AM
>To: Joseph M. Reagle Jr.
>Subject: Re: UTF-8 and BOM
>      If we retain wording excluding BOM's from UTF-8, as we currently
>it, I think that we should exclude surrogates as well.
>      The current text in section 6.5.1 reads "converts the character
>encoding to UTF-8 (without any byte order mark (BOM)) ", and corresponding
>text in section 7 reads "that coded character set is UTF-8 (without a byte
>order mark (BOM))"  The new text should probably read "... UTF-8 (without
>byte order mark (BOM) and with surrogate pairs converted to UCS-4 before
>conversion to UTF-8)" in both of these places.  I realize that RFC 2279
>(not 2379) explicitly requires surrogate conversion while it fails to
>mention BOM's for some reason, but the two issues are similar and many
>implementors do not understand the surrogate issue.  The wording about
>surrogates in versions 2.0 of the Unicode standard is actually somewhat
>similar to the wording about the "reversed byte order mark" U+FFFE.
>           Tom Gindin

Received on Friday, 25 August 2000 15:26:07 UTC