- From: John Boyer <jboyer@PureEdge.com>
- Date: Wed, 23 Aug 2000 11:47:10 -0700
- To: <tgindin@us.ibm.com>, "Joseph M. Reagle Jr." <reagle@w3.org>
- Cc: <w3c-ietf-xmldsig@w3.org>, <duerst@w3.org>
I actually think we need to remove the comment about BOM *and* not put in a comment about surrogate pairs. There does not seem to be any such thing as a need for a BOM for UTF-8. As for surrogate pairs... RFC2279 [1] clearly states that A) The only correct way to convert from UTF-16 to UTF-8 is through UCS-4 B) The only correct way to convert from UTF-16 to UCS-4 is to fix the surrogate pairs. Moreover, RFC2781 [2] clearly states how to fix the surrogate pairs. It does not seem necessary to add more text that tells the implementer how to transcode. This job has been done by these other RFCs [1,2], both of which are referenced in the XML Dsig WD. [1] www.ietf.org/rfc/rfc2279.txt [2] www.ietf.org/rfc/rfc2781.txt John Boyer Development Team Leader, Distributed Processing and XML PureEdge Solutions Inc. Creating Binding E-Commerce v: 250-479-8334, ext. 143 f: 250-479-3772 1-888-517-2675 http://www.PureEdge.com <http://www.pureedge.com/> -----Original Message----- From: w3c-ietf-xmldsig-request@w3.org [mailto:w3c-ietf-xmldsig-request@w3.org]On Behalf Of tgindin@us.ibm.com Sent: Wednesday, August 23, 2000 10:39 AM To: Joseph M. Reagle Jr. Cc: w3c-ietf-xmldsig@w3.org; duerst@w3.org Subject: Re: UTF-8 and BOM If we retain wording excluding BOM's from UTF-8, as we currently have it, I think that we should exclude surrogates as well. The current text in section 6.5.1 reads "converts the character encoding to UTF-8 (without any byte order mark (BOM)) ", and corresponding text in section 7 reads "that coded character set is UTF-8 (without a byte order mark (BOM))" The new text should probably read "... UTF-8 (without a byte order mark (BOM) and with surrogate pairs converted to UCS-4 before conversion to UTF-8)" in both of these places. I realize that RFC 2279 (not 2379) explicitly requires surrogate conversion while it fails to mention BOM's for some reason, but the two issues are similar and many implementors do not understand the surrogate issue. The wording about surrogates in versions 2.0 of the Unicode standard is actually somewhat similar to the wording about the "reversed byte order mark" U+FFFE. Tom Gindin
Received on Wednesday, 23 August 2000 14:47:17 UTC