RE: UTF-8 and BOM from John Boyer on 2000-08-25 (w3c-ietf-xmldsig@w3.org from July to September 2000)

From: John Boyer <jboyer@PureEdge.com>
Date: Fri, 25 Aug 2000 11:13:37 -0700
To: "Martin J. Duerst" <duerst@w3.org>, <tgindin@us.ibm.com>, "Joseph M. Reagle Jr." <reagle@w3.org>
Cc: <w3c-ietf-xmldsig@w3.org>
Message-ID: <BFEDKCINEPLBDLODCODKKEJNCEAA.jboyer@PureEdge.com>
Hi Martin,

At the last teleconference, Joseph decided we would leave it as is (comment
about BOM but not surrogates) because there was not consensus to change it
either way.

Although I am still not convinced of its necessity, I really don't mind one
way or the other, and will most likely be adding the BOM comment to the next
C14N draft just because it doesn't seem to hurt anything.

Could you point me to some source I can cite which implies that a BOM at the
front of UTF-8 is actually legal UTF-8? (e.g. an URL to ISO 10646 Annex R,
and a paragraph would help).  I ask because, according to RFC2279, a BOM
does not legally decode into a UCS character, so a conformant implementation
would seemingly never generate a BOM, and it would throw an exception if
asked to decode one.  This is why I believe, in the absence of evidence to
the contrary, that the BOM comment is superfluous (but also why I don't
think it hurts to keep it).

On your last point about using too many bytes, are you referring to cases
such as having the eleven x bits of a two byte sequence containing something
less than 80 (e.g. C0 80 in place of 00)?  That type of error came across
pretty clearly in RFC2279, so I don't think another RFC would be necessary,
but are there other cases that don't come across in RFC2279?

Thanks,
John Boyer
Development Team Leader,
Distributed Processing and XML
PureEdge Solutions Inc.
Creating Binding E-Commerce
v: 250-479-8334, ext. 143  f: 250-479-3772
1-888-517-2675   http://www.PureEdge.com <http://www.pureedge.com/>



-----Original Message-----
From: w3c-ietf-xmldsig-request@w3.org
[mailto:w3c-ietf-xmldsig-request@w3.org]On Behalf Of Martin J. Duerst
Sent: Friday, August 25, 2000 2:09 AM
To: John Boyer; tgindin@us.ibm.com; Joseph M. Reagle Jr.
Cc: w3c-ietf-xmldsig@w3.org
Subject: RE: UTF-8 and BOM


At 00/08/23 11:47 -0700, John Boyer wrote:
>I actually think we need to remove the comment about BOM *and* not put in a
>comment about surrogate pairs.

No. You have to keep the comment about the BOM, because both
with and without a bom is legal UTF-8.

You better remove the comment about surrogates, because encoding
individual surrogates in UTF-8 is illegal. There are other things
that are illegal and still are sometimes done (e.g. using more
than the necessary number of bytes), and if we wanted to list
all of them, we would write another RFC for UTF-8, I guess.


Regards,    Martin.





>There does not seem to be any such thing as a need for a BOM for UTF-8.  As
>for surrogate pairs...  RFC2279 [1] clearly states that
>
>A) The only correct way to convert from UTF-16 to UTF-8 is through UCS-4
>B) The only correct way to convert from UTF-16 to UCS-4 is to fix the
>surrogate pairs.
>
>Moreover, RFC2781 [2] clearly states how to fix the surrogate pairs.  It
>does not seem necessary to add more text that tells the implementer how to
>transcode.  This job has been done by these other RFCs [1,2], both of which
>are referenced in the XML Dsig WD.
>
>[1] www.ietf.org/rfc/rfc2279.txt
>[2] www.ietf.org/rfc/rfc2781.txt
>
>John Boyer
>Development Team Leader,
>Distributed Processing and XML
>PureEdge Solutions Inc.
>Creating Binding E-Commerce
>v: 250-479-8334, ext. 143  f: 250-479-3772
>1-888-517-2675   http://www.PureEdge.com <http://www.pureedge.com/>
>
>
>
>
>-----Original Message-----
>From: w3c-ietf-xmldsig-request@w3.org
>[mailto:w3c-ietf-xmldsig-request@w3.org]On Behalf Of tgindin@us.ibm.com
>Sent: Wednesday, August 23, 2000 10:39 AM
>To: Joseph M. Reagle Jr.
>Cc: w3c-ietf-xmldsig@w3.org; duerst@w3.org
>Subject: Re: UTF-8 and BOM
>
>
>      If we retain wording excluding BOM's from UTF-8, as we currently have
>it, I think that we should exclude surrogates as well.
>      The current text in section 6.5.1 reads "converts the character
>encoding to UTF-8 (without any byte order mark (BOM)) ", and corresponding
>text in section 7 reads "that coded character set is UTF-8 (without a byte
>order mark (BOM))"  The new text should probably read "... UTF-8 (without a
>byte order mark (BOM) and with surrogate pairs converted to UCS-4 before
>conversion to UTF-8)" in both of these places.  I realize that RFC 2279
>(not 2379) explicitly requires surrogate conversion while it fails to
>mention BOM's for some reason, but the two issues are similar and many
>implementors do not understand the surrogate issue.  The wording about
>surrogates in versions 2.0 of the Unicode standard is actually somewhat
>similar to the wording about the "reversed byte order mark" U+FFFE.
>
>           Tom Gindin
>
Received on Friday, 25 August 2000 14:13:46 UTC