- From: Donald E. Eastlake 3rd <dee3@torque.pothole.com>
- Date: Fri, 25 Aug 2000 13:43:40 -0400
- To: <w3c-ietf-xmldsig@w3.org>
I do not see any consensus for changing from the wording which says UTF-8 without BOM and does not mention surrogate pairs. Unless such a consensus developes, we will stay with the current wording. Thanks, Donald From: "Martin J. Duerst" <duerst@w3.org> Resent-Date: Fri, 25 Aug 2000 05:21:04 -0400 (EDT) Resent-Message-Id: <200008250921.e7P9L4v18419@www19.w3.org> Message-Id: <4.2.0.58.J.20000825180654.036e5b70@sh.w3.mag.keio.ac.jp> Date: Fri, 25 Aug 2000 18:09:13 +0900 To: "John Boyer" <jboyer@PureEdge.com>, <tgindin@us.ibm.com>, "Joseph M. Reagle Jr." <reagle@w3.org> Cc: <w3c-ietf-xmldsig@w3.org> In-Reply-To: <BFEDKCINEPLBDLODCODKKEIPCEAA.jboyer@PureEdge.com> References: <85256944.0060FDBD.00@D51MTA04.pok.ibm.com> >At 00/08/23 11:47 -0700, John Boyer wrote: >>I actually think we need to remove the comment about BOM *and* not put in a >>comment about surrogate pairs. > >No. You have to keep the comment about the BOM, because both >with and without a bom is legal UTF-8. > >You better remove the comment about surrogates, because encoding >individual surrogates in UTF-8 is illegal. There are other things >that are illegal and still are sometimes done (e.g. using more >than the necessary number of bytes), and if we wanted to list >all of them, we would write another RFC for UTF-8, I guess. > > >Regards, Martin. > > > > > >>There does not seem to be any such thing as a need for a BOM for UTF-8. As >>for surrogate pairs... RFC2279 [1] clearly states that >> >>A) The only correct way to convert from UTF-16 to UTF-8 is through UCS-4 >>B) The only correct way to convert from UTF-16 to UCS-4 is to fix the >>surrogate pairs. >> >>Moreover, RFC2781 [2] clearly states how to fix the surrogate pairs. It >>does not seem necessary to add more text that tells the implementer how to >>transcode. This job has been done by these other RFCs [1,2], both of which >>are referenced in the XML Dsig WD. >> >>[1] www.ietf.org/rfc/rfc2279.txt >>[2] www.ietf.org/rfc/rfc2781.txt >> >>John Boyer >>Development Team Leader, >>Distributed Processing and XML >>PureEdge Solutions Inc. >>Creating Binding E-Commerce >>v: 250-479-8334, ext. 143 f: 250-479-3772 >>1-888-517-2675 http://www.PureEdge.com <http://www.pureedge.com/> >> >> >> >> >>-----Original Message----- >>From: w3c-ietf-xmldsig-request@w3.org >>[mailto:w3c-ietf-xmldsig-request@w3.org]On Behalf Of tgindin@us.ibm.com >>Sent: Wednesday, August 23, 2000 10:39 AM >>To: Joseph M. Reagle Jr. >>Cc: w3c-ietf-xmldsig@w3.org; duerst@w3.org >>Subject: Re: UTF-8 and BOM >> >> >> If we retain wording excluding BOM's from UTF-8, as we currently have >>it, I think that we should exclude surrogates as well. >> The current text in section 6.5.1 reads "converts the character >>encoding to UTF-8 (without any byte order mark (BOM)) ", and corresponding >>text in section 7 reads "that coded character set is UTF-8 (without a byte >>order mark (BOM))" The new text should probably read "... UTF-8 (without a >>byte order mark (BOM) and with surrogate pairs converted to UCS-4 before >>conversion to UTF-8)" in both of these places. I realize that RFC 2279 >>(not 2379) explicitly requires surrogate conversion while it fails to >>mention BOM's for some reason, but the two issues are similar and many >>implementors do not understand the surrogate issue. The wording about >>surrogates in versions 2.0 of the Unicode standard is actually somewhat >>similar to the wording about the "reversed byte order mark" U+FFFE. >> >> Tom Gindin >> >
Received on Friday, 25 August 2000 13:40:47 UTC