RE: UTF-8 and BOM

<TomParaphrase>
Must we strip all zero width no break spaces (U+FEFF = EF, BB, BF in UTF-8)
from our data
</TomParaphrase>

<john>
At last we see where our understated comment gets us into some
interoperability trouble.

For starters, the BOM and its UTF-8 encoding are considered to be at the
very beginning of the file (see http://www.cl.cam.ac.uk/~mgk25/unicode.html
by Marcus Kuhn).

Moreover, according to XML 1.0, the UTF-16 BOM is considered to be *outside
of* the data and used to qualify how to take the UTF-16 data and convert it
to UCS.  It is 'metadata' that XML does not retain.

In other words, the UTF-16 BOM is not U+FEFF or U+FFFE.  It is not intended
to represent a Unicode character of data, so it's just a 16-bit binary value
of FEFF or FFFE, the latter of which is illegal under Unicode anyway.

What I think we are saying is that we will not encode the UTF-16 BOM when
converting to UTF-8 because the BOM is not part of the data. However, I
think we are not clear in saying what will happen to our UTF-8 data if
U+FEFF appeared in the UTF-16 data stream *after* the BOM.

When translating from UTF-16 to UTF-8, I would think that the UTF-16 BOM
would be used solely to convert to UCS, but then if U+FEFF appears in the
actual data, then the corresponding UTF-8 sequence would appear in our UTF-8
data stream.

Indeed, the rationale for prepending of the UTF-8 for U+FEFF is at best
confusingly stated by the Unicode 3.0 standard on p.324 (this reference was
provided by Phillip H. Griffin of Griffin Consulting (http://asn-1.com)):

	"In UTF-8 the BOM corresponds to the byte sequence EF(16)BB(16)BF(16).
	Although there are never any questions of byte order with UTF-8 text,
	this sequence can serve as a signature for UTF-8 encoded text where
	the character set is unmarked."

How is this a 'signature' for UTF-8?  Isn't this also a valid string prefix
for a stream of ISO-8859-1 characters?

As well, what if the UTF-16 BOM is FFFE, which is not a valid UCS character?
Are we to conclude that UTF-8 encoded text is unidentifiable on machines
that would start a UTF-16 encoding with a BOM of FFFE?

Finally, in trying to use a unicode character normally associated with a BOM
as an ineffective signature, it is clearly not being used as a byte order
mark.  This is at least in part because it does not make sense to impose a
BOM on the actual UTF-8 data; a machine will represent the UCS characters
corresponding to the UTF-8 in the byte order of the machine, not the byte
order dictated by the UTF-8 encoding.

I believe this is why we don't understand the application of a BOM to UTF-8
data, neverminding actually putting U+FEFF *inside* the UTF-8 encoded data.

John Boyer
Development Team Leader,
Distributed Processing and XML
PureEdge Solutions Inc.
Creating Binding E-Commerce
v: 250-479-8334, ext. 143  f: 250-479-3772
1-888-517-2675   http://www.PureEdge.com <http://www.pureedge.com/>

</john>


-----Original Message-----
From: tgindin@us.ibm.com [mailto:tgindin@us.ibm.com]
Sent: Friday, August 25, 2000 12:26 PM
To: John Boyer
Cc: Martin J. Duerst; Joseph M. Reagle Jr.; w3c-ietf-xmldsig@w3.org
Subject: RE: UTF-8 and BOM


     I have only one disagreement with this.  Of course, BOM is not a legal
UTF-8 byte sequence - and neither is the UCS-2 representation of any
character in the Latin-1 supplemental set such as u-umlaut, which appears
in Martin's last name.  However, if you convert a UCS-2 string containing a
BOM into UTF-8, the BOM will be converted into the triple-byte sequence
xEF,xBB,xBF which is legal UTF-8, and converting it back to UCS-2 yields a
BOM or a zero-width no break space.  So, are we supposed to strip all
occurrences of "zero-width no break space" or not?

          Tom Gindin

"John Boyer" <jboyer@PureEdge.com> on 08/25/2000 02:45:41 PM

To:   Tom Gindin/Watson/IBM@IBMUS
cc:   "Martin J. Duerst" <duerst@w3.org>, "Joseph M. Reagle Jr."
      <reagle@w3.org>, <w3c-ietf-xmldsig@w3.org>
Subject:  RE: UTF-8 and BOM



<Tom>
     Actually, John, a UCS-2 BOM in big-endian order (U+FEFF) is a valid
UCS character known as "zero width no break space".  One in little-endian
order is not.  This is documented, among other places, in section 3.8
(Special Character Properties) of version 2.0 of the Unicode standard.
     If you follow the table at the top of page 3 of RFC 2279, you will
always encode a character in the minimal number of bytes.  This table
appears on the same page as the surrogate pair warning, so if there is no
need to warn implementors about surrogate pairs in our spec there is also
no need to warn them about overlong character encodings.

          Tom Gindin
</Tom>
<john>
Obviously I agree about the surrogate pair and overlong character warnings.
1) Martin's message didn't specify the overlong stuff, so I'm curious if he
has other warnings that *don't* come across in RFC2279.  Like you, I feel
that we have referred to RFC2279, so things that are clear in that RFC need
not be repeated.
2) The BOM may be a valid UCS character, but it is not a valid UTF-8 byte
sequence.  Martin's message asserted that it was valid UTF-8, which is what
I disagreed with.  Indeed, everything between 0 and 7FFFFFFF is a valid UCS
character (even if most have not been assigned to symbols yet).

John Boyer
Development Team Leader,
Distributed Processing and XML
PureEdge Solutions Inc.
Creating Binding E-Commerce
v: 250-479-8334, ext. 143  f: 250-479-3772
1-888-517-2675   http://www.PureEdge.com <http://www.pureedge.com/>
</john>

"John Boyer" <jboyer@PureEdge.com> on 08/25/2000 02:13:37 PM

To:   "Martin J. Duerst" <duerst@w3.org>, Tom Gindin/Watson/IBM@IBMUS,
      "Joseph M. Reagle Jr." <reagle@w3.org>
cc:   <w3c-ietf-xmldsig@w3.org>
Subject:  RE: UTF-8 and BOM



Hi Martin,

At the last teleconference, Joseph decided we would leave it as is (comment
about BOM but not surrogates) because there was not consensus to change it
either way.

Although I am still not convinced of its necessity, I really don't mind one
way or the other, and will most likely be adding the BOM comment to the
next
C14N draft just because it doesn't seem to hurt anything.

Could you point me to some source I can cite which implies that a BOM at
the
front of UTF-8 is actually legal UTF-8? (e.g. an URL to ISO 10646 Annex R,
and a paragraph would help).  I ask because, according to RFC2279, a BOM
does not legally decode into a UCS character, so a conformant
implementation
would seemingly never generate a BOM, and it would throw an exception if
asked to decode one.  This is why I believe, in the absence of evidence to
the contrary, that the BOM comment is superfluous (but also why I don't
think it hurts to keep it).

On your last point about using too many bytes, are you referring to cases
such as having the eleven x bits of a two byte sequence containing
something
less than 80 (e.g. C0 80 in place of 00)?  That type of error came across
pretty clearly in RFC2279, so I don't think another RFC would be necessary,
but are there other cases that don't come across in RFC2279?

Thanks,
John Boyer
Development Team Leader,
Distributed Processing and XML
PureEdge Solutions Inc.
Creating Binding E-Commerce
v: 250-479-8334, ext. 143  f: 250-479-3772
1-888-517-2675   http://www.PureEdge.com <http://www.pureedge.com/>



-----Original Message-----
From: w3c-ietf-xmldsig-request@w3.org
[mailto:w3c-ietf-xmldsig-request@w3.org]On Behalf Of Martin J. Duerst
Sent: Friday, August 25, 2000 2:09 AM
To: John Boyer; tgindin@us.ibm.com; Joseph M. Reagle Jr.
Cc: w3c-ietf-xmldsig@w3.org
Subject: RE: UTF-8 and BOM


At 00/08/23 11:47 -0700, John Boyer wrote:
>I actually think we need to remove the comment about BOM *and* not put in
a
>comment about surrogate pairs.

No. You have to keep the comment about the BOM, because both
with and without a bom is legal UTF-8.

You better remove the comment about surrogates, because encoding
individual surrogates in UTF-8 is illegal. There are other things
that are illegal and still are sometimes done (e.g. using more
than the necessary number of bytes), and if we wanted to list
all of them, we would write another RFC for UTF-8, I guess.


Regards,    Martin.





>There does not seem to be any such thing as a need for a BOM for UTF-8.
As
>for surrogate pairs...  RFC2279 [1] clearly states that
>
>A) The only correct way to convert from UTF-16 to UTF-8 is through UCS-4
>B) The only correct way to convert from UTF-16 to UCS-4 is to fix the
>surrogate pairs.
>
>Moreover, RFC2781 [2] clearly states how to fix the surrogate pairs.  It
>does not seem necessary to add more text that tells the implementer how to
>transcode.  This job has been done by these other RFCs [1,2], both of
which
>are referenced in the XML Dsig WD.
>
>[1] www.ietf.org/rfc/rfc2279.txt
>[2] www.ietf.org/rfc/rfc2781.txt
>
>John Boyer
>Development Team Leader,
>Distributed Processing and XML
>PureEdge Solutions Inc.
>Creating Binding E-Commerce
>v: 250-479-8334, ext. 143  f: 250-479-3772
>1-888-517-2675   http://www.PureEdge.com <http://www.pureedge.com/>
>
>
>
>
>-----Original Message-----
>From: w3c-ietf-xmldsig-request@w3.org
>[mailto:w3c-ietf-xmldsig-request@w3.org]On Behalf Of tgindin@us.ibm.com
>Sent: Wednesday, August 23, 2000 10:39 AM
>To: Joseph M. Reagle Jr.
>Cc: w3c-ietf-xmldsig@w3.org; duerst@w3.org
>Subject: Re: UTF-8 and BOM
>
>
>      If we retain wording excluding BOM's from UTF-8, as we currently
have
>it, I think that we should exclude surrogates as well.
>      The current text in section 6.5.1 reads "converts the character
>encoding to UTF-8 (without any byte order mark (BOM)) ", and corresponding
>text in section 7 reads "that coded character set is UTF-8 (without a byte
>order mark (BOM))"  The new text should probably read "... UTF-8 (without
a
>byte order mark (BOM) and with surrogate pairs converted to UCS-4 before
>conversion to UTF-8)" in both of these places.  I realize that RFC 2279
>(not 2379) explicitly requires surrogate conversion while it fails to
>mention BOM's for some reason, but the two issues are similar and many
>implementors do not understand the surrogate issue.  The wording about
>surrogates in versions 2.0 of the Unicode standard is actually somewhat
>similar to the wording about the "reversed byte order mark" U+FFFE.
>
>           Tom Gindin

Received on Friday, 25 August 2000 16:09:11 UTC