Re: UTF-8 and BOM

I do not see any consensus for changing from the wording which says
UTF-8 without BOM and does not mention surrogate pairs.  Unless such
a consensus developes, we will stay with the current wording.

Thanks,
Donald

From:  "Martin J. Duerst" <duerst@w3.org>
Resent-Date:  Fri, 25 Aug 2000 05:21:04 -0400 (EDT)
Resent-Message-Id:  <200008250921.e7P9L4v18419@www19.w3.org>
Message-Id:  <4.2.0.58.J.20000825180654.036e5b70@sh.w3.mag.keio.ac.jp>
Date:  Fri, 25 Aug 2000 18:09:13 +0900
To:  "John Boyer" <jboyer@PureEdge.com>, <tgindin@us.ibm.com>,
            "Joseph M. Reagle Jr." <reagle@w3.org>
Cc:  <w3c-ietf-xmldsig@w3.org>
In-Reply-To:  <BFEDKCINEPLBDLODCODKKEIPCEAA.jboyer@PureEdge.com>
References:  <85256944.0060FDBD.00@D51MTA04.pok.ibm.com>

>At 00/08/23 11:47 -0700, John Boyer wrote:
>>I actually think we need to remove the comment about BOM *and* not put in a
>>comment about surrogate pairs.
>
>No. You have to keep the comment about the BOM, because both
>with and without a bom is legal UTF-8.
>
>You better remove the comment about surrogates, because encoding
>individual surrogates in UTF-8 is illegal. There are other things
>that are illegal and still are sometimes done (e.g. using more
>than the necessary number of bytes), and if we wanted to list
>all of them, we would write another RFC for UTF-8, I guess.
>
>
>Regards,    Martin.
>
>
>
>
>
>>There does not seem to be any such thing as a need for a BOM for UTF-8.  As
>>for surrogate pairs...  RFC2279 [1] clearly states that
>>
>>A) The only correct way to convert from UTF-16 to UTF-8 is through UCS-4
>>B) The only correct way to convert from UTF-16 to UCS-4 is to fix the
>>surrogate pairs.
>>
>>Moreover, RFC2781 [2] clearly states how to fix the surrogate pairs.  It
>>does not seem necessary to add more text that tells the implementer how to
>>transcode.  This job has been done by these other RFCs [1,2], both of which
>>are referenced in the XML Dsig WD.
>>
>>[1] www.ietf.org/rfc/rfc2279.txt
>>[2] www.ietf.org/rfc/rfc2781.txt
>>
>>John Boyer
>>Development Team Leader,
>>Distributed Processing and XML
>>PureEdge Solutions Inc.
>>Creating Binding E-Commerce
>>v: 250-479-8334, ext. 143  f: 250-479-3772
>>1-888-517-2675   http://www.PureEdge.com <http://www.pureedge.com/>
>>
>>
>>
>>
>>-----Original Message-----
>>From: w3c-ietf-xmldsig-request@w3.org
>>[mailto:w3c-ietf-xmldsig-request@w3.org]On Behalf Of tgindin@us.ibm.com
>>Sent: Wednesday, August 23, 2000 10:39 AM
>>To: Joseph M. Reagle Jr.
>>Cc: w3c-ietf-xmldsig@w3.org; duerst@w3.org
>>Subject: Re: UTF-8 and BOM
>>
>>
>>      If we retain wording excluding BOM's from UTF-8, as we currently have
>>it, I think that we should exclude surrogates as well.
>>      The current text in section 6.5.1 reads "converts the character
>>encoding to UTF-8 (without any byte order mark (BOM)) ", and corresponding
>>text in section 7 reads "that coded character set is UTF-8 (without a byte
>>order mark (BOM))"  The new text should probably read "... UTF-8 (without a
>>byte order mark (BOM) and with surrogate pairs converted to UCS-4 before
>>conversion to UTF-8)" in both of these places.  I realize that RFC 2279
>>(not 2379) explicitly requires surrogate conversion while it fails to
>>mention BOM's for some reason, but the two issues are similar and many
>>implementors do not understand the surrogate issue.  The wording about
>>surrogates in versions 2.0 of the Unicode standard is actually somewhat
>>similar to the wording about the "reversed byte order mark" U+FFFE.
>>
>>           Tom Gindin
>>
>

Received on Friday, 25 August 2000 13:40:47 UTC