RE: UTF-8 and BOM from John Boyer on 2000-08-23 (w3c-ietf-xmldsig@w3.org from July to September 2000)

From: John Boyer <jboyer@PureEdge.com>
Date: Wed, 23 Aug 2000 11:47:10 -0700
To: <tgindin@us.ibm.com>, "Joseph M. Reagle Jr." <reagle@w3.org>
Cc: <w3c-ietf-xmldsig@w3.org>, <duerst@w3.org>
Message-ID: <BFEDKCINEPLBDLODCODKKEIPCEAA.jboyer@PureEdge.com>

I actually think we need to remove the comment about BOM *and* not put in a
comment about surrogate pairs.

There does not seem to be any such thing as a need for a BOM for UTF-8.  As
for surrogate pairs...  RFC2279 [1] clearly states that

A) The only correct way to convert from UTF-16 to UTF-8 is through UCS-4
B) The only correct way to convert from UTF-16 to UCS-4 is to fix the
surrogate pairs.

Moreover, RFC2781 [2] clearly states how to fix the surrogate pairs.  It
does not seem necessary to add more text that tells the implementer how to
transcode.  This job has been done by these other RFCs [1,2], both of which
are referenced in the XML Dsig WD.

[1] www.ietf.org/rfc/rfc2279.txt
[2] www.ietf.org/rfc/rfc2781.txt

John Boyer
Development Team Leader,
Distributed Processing and XML
PureEdge Solutions Inc.
Creating Binding E-Commerce
v: 250-479-8334, ext. 143  f: 250-479-3772
1-888-517-2675   http://www.PureEdge.com <http://www.pureedge.com/>




-----Original Message-----
From: w3c-ietf-xmldsig-request@w3.org
[mailto:w3c-ietf-xmldsig-request@w3.org]On Behalf Of tgindin@us.ibm.com
Sent: Wednesday, August 23, 2000 10:39 AM
To: Joseph M. Reagle Jr.
Cc: w3c-ietf-xmldsig@w3.org; duerst@w3.org
Subject: Re: UTF-8 and BOM


     If we retain wording excluding BOM's from UTF-8, as we currently have
it, I think that we should exclude surrogates as well.
     The current text in section 6.5.1 reads "converts the character
encoding to UTF-8 (without any byte order mark (BOM)) ", and corresponding
text in section 7 reads "that coded character set is UTF-8 (without a byte
order mark (BOM))"  The new text should probably read "... UTF-8 (without a
byte order mark (BOM) and with surrogate pairs converted to UCS-4 before
conversion to UTF-8)" in both of these places.  I realize that RFC 2279
(not 2379) explicitly requires surrogate conversion while it fails to
mention BOM's for some reason, but the two issues are similar and many
implementors do not understand the surrogate issue.  The wording about
surrogates in versions 2.0 of the Unicode standard is actually somewhat
similar to the wording about the "reversed byte order mark" U+FFFE.

          Tom Gindin

Received on Wednesday, 23 August 2000 14:47:17 UTC