RE: UTF-8 and BOM from Martin J. Duerst on 2000-08-25 (w3c-ietf-xmldsig@w3.org from July to September 2000)

From: Martin J. Duerst <duerst@w3.org>
Date: Fri, 25 Aug 2000 18:09:13 +0900
To: "John Boyer" <jboyer@PureEdge.com>, <tgindin@us.ibm.com>, "Joseph M. Reagle Jr." <reagle@w3.org>
Cc: <w3c-ietf-xmldsig@w3.org>
Message-Id: <4.2.0.58.J.20000825180654.036e5b70@sh.w3.mag.keio.ac.jp>

At 00/08/23 11:47 -0700, John Boyer wrote:
>I actually think we need to remove the comment about BOM *and* not put in a
>comment about surrogate pairs.

No. You have to keep the comment about the BOM, because both
with and without a bom is legal UTF-8.

You better remove the comment about surrogates, because encoding
individual surrogates in UTF-8 is illegal. There are other things
that are illegal and still are sometimes done (e.g. using more
than the necessary number of bytes), and if we wanted to list
all of them, we would write another RFC for UTF-8, I guess.


Regards,    Martin.





>There does not seem to be any such thing as a need for a BOM for UTF-8.  As
>for surrogate pairs...  RFC2279 [1] clearly states that
>
>A) The only correct way to convert from UTF-16 to UTF-8 is through UCS-4
>B) The only correct way to convert from UTF-16 to UCS-4 is to fix the
>surrogate pairs.
>
>Moreover, RFC2781 [2] clearly states how to fix the surrogate pairs.  It
>does not seem necessary to add more text that tells the implementer how to
>transcode.  This job has been done by these other RFCs [1,2], both of which
>are referenced in the XML Dsig WD.
>
>[1] www.ietf.org/rfc/rfc2279.txt
>[2] www.ietf.org/rfc/rfc2781.txt
>
>John Boyer
>Development Team Leader,
>Distributed Processing and XML
>PureEdge Solutions Inc.
>Creating Binding E-Commerce
>v: 250-479-8334, ext. 143  f: 250-479-3772
>1-888-517-2675   http://www.PureEdge.com <http://www.pureedge.com/>
>
>
>
>
>-----Original Message-----
>From: w3c-ietf-xmldsig-request@w3.org
>[mailto:w3c-ietf-xmldsig-request@w3.org]On Behalf Of tgindin@us.ibm.com
>Sent: Wednesday, August 23, 2000 10:39 AM
>To: Joseph M. Reagle Jr.
>Cc: w3c-ietf-xmldsig@w3.org; duerst@w3.org
>Subject: Re: UTF-8 and BOM
>
>
>      If we retain wording excluding BOM's from UTF-8, as we currently have
>it, I think that we should exclude surrogates as well.
>      The current text in section 6.5.1 reads "converts the character
>encoding to UTF-8 (without any byte order mark (BOM)) ", and corresponding
>text in section 7 reads "that coded character set is UTF-8 (without a byte
>order mark (BOM))"  The new text should probably read "... UTF-8 (without a
>byte order mark (BOM) and with surrogate pairs converted to UCS-4 before
>conversion to UTF-8)" in both of these places.  I realize that RFC 2279
>(not 2379) explicitly requires surrogate conversion while it fails to
>mention BOM's for some reason, but the two issues are similar and many
>implementors do not understand the surrogate issue.  The wording about
>surrogates in versions 2.0 of the Unicode standard is actually somewhat
>similar to the wording about the "reversed byte order mark" U+FFFE.
>
>           Tom Gindin
>

Received on Friday, 25 August 2000 05:20:49 UTC