- From: John Boyer <jboyer@PureEdge.com>
- Date: Fri, 25 Aug 2000 13:08:55 -0700
- To: <tgindin@us.ibm.com>, <phil.griffin@asn-1.com>
- Cc: "Martin J. Duerst" <duerst@w3.org>, "Joseph M. Reagle Jr." <reagle@w3.org>, <w3c-ietf-xmldsig@w3.org>
<TomParaphrase> Must we strip all zero width no break spaces (U+FEFF = EF, BB, BF in UTF-8) from our data </TomParaphrase> <john> At last we see where our understated comment gets us into some interoperability trouble. For starters, the BOM and its UTF-8 encoding are considered to be at the very beginning of the file (see http://www.cl.cam.ac.uk/~mgk25/unicode.html by Marcus Kuhn). Moreover, according to XML 1.0, the UTF-16 BOM is considered to be *outside of* the data and used to qualify how to take the UTF-16 data and convert it to UCS. It is 'metadata' that XML does not retain. In other words, the UTF-16 BOM is not U+FEFF or U+FFFE. It is not intended to represent a Unicode character of data, so it's just a 16-bit binary value of FEFF or FFFE, the latter of which is illegal under Unicode anyway. What I think we are saying is that we will not encode the UTF-16 BOM when converting to UTF-8 because the BOM is not part of the data. However, I think we are not clear in saying what will happen to our UTF-8 data if U+FEFF appeared in the UTF-16 data stream *after* the BOM. When translating from UTF-16 to UTF-8, I would think that the UTF-16 BOM would be used solely to convert to UCS, but then if U+FEFF appears in the actual data, then the corresponding UTF-8 sequence would appear in our UTF-8 data stream. Indeed, the rationale for prepending of the UTF-8 for U+FEFF is at best confusingly stated by the Unicode 3.0 standard on p.324 (this reference was provided by Phillip H. Griffin of Griffin Consulting (http://asn-1.com)): "In UTF-8 the BOM corresponds to the byte sequence EF(16)BB(16)BF(16). Although there are never any questions of byte order with UTF-8 text, this sequence can serve as a signature for UTF-8 encoded text where the character set is unmarked." How is this a 'signature' for UTF-8? Isn't this also a valid string prefix for a stream of ISO-8859-1 characters? As well, what if the UTF-16 BOM is FFFE, which is not a valid UCS character? Are we to conclude that UTF-8 encoded text is unidentifiable on machines that would start a UTF-16 encoding with a BOM of FFFE? Finally, in trying to use a unicode character normally associated with a BOM as an ineffective signature, it is clearly not being used as a byte order mark. This is at least in part because it does not make sense to impose a BOM on the actual UTF-8 data; a machine will represent the UCS characters corresponding to the UTF-8 in the byte order of the machine, not the byte order dictated by the UTF-8 encoding. I believe this is why we don't understand the application of a BOM to UTF-8 data, neverminding actually putting U+FEFF *inside* the UTF-8 encoded data. John Boyer Development Team Leader, Distributed Processing and XML PureEdge Solutions Inc. Creating Binding E-Commerce v: 250-479-8334, ext. 143 f: 250-479-3772 1-888-517-2675 http://www.PureEdge.com <http://www.pureedge.com/> </john> -----Original Message----- From: tgindin@us.ibm.com [mailto:tgindin@us.ibm.com] Sent: Friday, August 25, 2000 12:26 PM To: John Boyer Cc: Martin J. Duerst; Joseph M. Reagle Jr.; w3c-ietf-xmldsig@w3.org Subject: RE: UTF-8 and BOM I have only one disagreement with this. Of course, BOM is not a legal UTF-8 byte sequence - and neither is the UCS-2 representation of any character in the Latin-1 supplemental set such as u-umlaut, which appears in Martin's last name. However, if you convert a UCS-2 string containing a BOM into UTF-8, the BOM will be converted into the triple-byte sequence xEF,xBB,xBF which is legal UTF-8, and converting it back to UCS-2 yields a BOM or a zero-width no break space. So, are we supposed to strip all occurrences of "zero-width no break space" or not? Tom Gindin "John Boyer" <jboyer@PureEdge.com> on 08/25/2000 02:45:41 PM To: Tom Gindin/Watson/IBM@IBMUS cc: "Martin J. Duerst" <duerst@w3.org>, "Joseph M. Reagle Jr." <reagle@w3.org>, <w3c-ietf-xmldsig@w3.org> Subject: RE: UTF-8 and BOM <Tom> Actually, John, a UCS-2 BOM in big-endian order (U+FEFF) is a valid UCS character known as "zero width no break space". One in little-endian order is not. This is documented, among other places, in section 3.8 (Special Character Properties) of version 2.0 of the Unicode standard. If you follow the table at the top of page 3 of RFC 2279, you will always encode a character in the minimal number of bytes. This table appears on the same page as the surrogate pair warning, so if there is no need to warn implementors about surrogate pairs in our spec there is also no need to warn them about overlong character encodings. Tom Gindin </Tom> <john> Obviously I agree about the surrogate pair and overlong character warnings. 1) Martin's message didn't specify the overlong stuff, so I'm curious if he has other warnings that *don't* come across in RFC2279. Like you, I feel that we have referred to RFC2279, so things that are clear in that RFC need not be repeated. 2) The BOM may be a valid UCS character, but it is not a valid UTF-8 byte sequence. Martin's message asserted that it was valid UTF-8, which is what I disagreed with. Indeed, everything between 0 and 7FFFFFFF is a valid UCS character (even if most have not been assigned to symbols yet). John Boyer Development Team Leader, Distributed Processing and XML PureEdge Solutions Inc. Creating Binding E-Commerce v: 250-479-8334, ext. 143 f: 250-479-3772 1-888-517-2675 http://www.PureEdge.com <http://www.pureedge.com/> </john> "John Boyer" <jboyer@PureEdge.com> on 08/25/2000 02:13:37 PM To: "Martin J. Duerst" <duerst@w3.org>, Tom Gindin/Watson/IBM@IBMUS, "Joseph M. Reagle Jr." <reagle@w3.org> cc: <w3c-ietf-xmldsig@w3.org> Subject: RE: UTF-8 and BOM Hi Martin, At the last teleconference, Joseph decided we would leave it as is (comment about BOM but not surrogates) because there was not consensus to change it either way. Although I am still not convinced of its necessity, I really don't mind one way or the other, and will most likely be adding the BOM comment to the next C14N draft just because it doesn't seem to hurt anything. Could you point me to some source I can cite which implies that a BOM at the front of UTF-8 is actually legal UTF-8? (e.g. an URL to ISO 10646 Annex R, and a paragraph would help). I ask because, according to RFC2279, a BOM does not legally decode into a UCS character, so a conformant implementation would seemingly never generate a BOM, and it would throw an exception if asked to decode one. This is why I believe, in the absence of evidence to the contrary, that the BOM comment is superfluous (but also why I don't think it hurts to keep it). On your last point about using too many bytes, are you referring to cases such as having the eleven x bits of a two byte sequence containing something less than 80 (e.g. C0 80 in place of 00)? That type of error came across pretty clearly in RFC2279, so I don't think another RFC would be necessary, but are there other cases that don't come across in RFC2279? Thanks, John Boyer Development Team Leader, Distributed Processing and XML PureEdge Solutions Inc. Creating Binding E-Commerce v: 250-479-8334, ext. 143 f: 250-479-3772 1-888-517-2675 http://www.PureEdge.com <http://www.pureedge.com/> -----Original Message----- From: w3c-ietf-xmldsig-request@w3.org [mailto:w3c-ietf-xmldsig-request@w3.org]On Behalf Of Martin J. Duerst Sent: Friday, August 25, 2000 2:09 AM To: John Boyer; tgindin@us.ibm.com; Joseph M. Reagle Jr. Cc: w3c-ietf-xmldsig@w3.org Subject: RE: UTF-8 and BOM At 00/08/23 11:47 -0700, John Boyer wrote: >I actually think we need to remove the comment about BOM *and* not put in a >comment about surrogate pairs. No. You have to keep the comment about the BOM, because both with and without a bom is legal UTF-8. You better remove the comment about surrogates, because encoding individual surrogates in UTF-8 is illegal. There are other things that are illegal and still are sometimes done (e.g. using more than the necessary number of bytes), and if we wanted to list all of them, we would write another RFC for UTF-8, I guess. Regards, Martin. >There does not seem to be any such thing as a need for a BOM for UTF-8. As >for surrogate pairs... RFC2279 [1] clearly states that > >A) The only correct way to convert from UTF-16 to UTF-8 is through UCS-4 >B) The only correct way to convert from UTF-16 to UCS-4 is to fix the >surrogate pairs. > >Moreover, RFC2781 [2] clearly states how to fix the surrogate pairs. It >does not seem necessary to add more text that tells the implementer how to >transcode. This job has been done by these other RFCs [1,2], both of which >are referenced in the XML Dsig WD. > >[1] www.ietf.org/rfc/rfc2279.txt >[2] www.ietf.org/rfc/rfc2781.txt > >John Boyer >Development Team Leader, >Distributed Processing and XML >PureEdge Solutions Inc. >Creating Binding E-Commerce >v: 250-479-8334, ext. 143 f: 250-479-3772 >1-888-517-2675 http://www.PureEdge.com <http://www.pureedge.com/> > > > > >-----Original Message----- >From: w3c-ietf-xmldsig-request@w3.org >[mailto:w3c-ietf-xmldsig-request@w3.org]On Behalf Of tgindin@us.ibm.com >Sent: Wednesday, August 23, 2000 10:39 AM >To: Joseph M. Reagle Jr. >Cc: w3c-ietf-xmldsig@w3.org; duerst@w3.org >Subject: Re: UTF-8 and BOM > > > If we retain wording excluding BOM's from UTF-8, as we currently have >it, I think that we should exclude surrogates as well. > The current text in section 6.5.1 reads "converts the character >encoding to UTF-8 (without any byte order mark (BOM)) ", and corresponding >text in section 7 reads "that coded character set is UTF-8 (without a byte >order mark (BOM))" The new text should probably read "... UTF-8 (without a >byte order mark (BOM) and with surrogate pairs converted to UCS-4 before >conversion to UTF-8)" in both of these places. I realize that RFC 2279 >(not 2379) explicitly requires surrogate conversion while it fails to >mention BOM's for some reason, but the two issues are similar and many >implementors do not understand the surrogate issue. The wording about >surrogates in versions 2.0 of the Unicode standard is actually somewhat >similar to the wording about the "reversed byte order mark" U+FFFE. > > Tom Gindin
Received on Friday, 25 August 2000 16:09:11 UTC