- From: Mark Davis <mark.davis@us.ibm.com>
- Date: Thu, 17 Oct 2002 19:45:34 -0700
- To: Kenneth Whistler <kenw@sybase.com>
- Cc: duerst@w3.org, ietf-charsets@iana.org
I think what Markus was referring to was the following. In the case of UTF-16, we clearly distinguish the tag "UTF-16" from the tag "UTF-16BE". With "UTF-16", an initial FE FF means a BOM, and is not logically part of the content, and can be stripped without affecting that content. With "UTF-16BE", an initial FE FF is a ZWNBSP, *is* logically part of the content, and *cannot* be stripped without affecting that content. Very unfortunately, there is no tag difference that lets us know which variety of UTF-8 we are dealing with. In well-formed text tagged with "UTF-8", an initial EF BB BF could stand for a BOM, if the system is interpreting it that way, or it could simply stand for the legitimate character ZWNBSP. You don't know. You can, of course, guess. And your guess is pretty likely to be correct. But you don't know for certain. And if you simply strip a character willy-nilly, and yet purport not to be changing the text, you are violating Unicode conformance clause C10. The real way to solve this is to have two distinct tags, one where an initial EF BB BF stands for a BOM, and one where it doesn't. That way any protocol can be absolutely clear as to the meaning. Of course, this is just a guess as to what Markus meant ;-) Mark ___ mark.davis@us.ibm.com IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193 (408) 256-3148 fax: (408) 256-0799 Kenneth Whistler <kenw@sybase.com> To: duerst@w3.org cc: ietf-charsets@iana.org 2002.10.17 19:13 Subject: Re: Comments on draft-yergeau-rfc2279bis-00.txt Please respond to Kenneth Whistler Martin asked: > At 09:12 02/10/17 -0700, Markus Scherer wrote: > >Patrik F$BgM(Btstr$B‹N(B wrote: > > > >>What I hear on this list is that the consensus is that BOM SHOULD NOT be > >>used. I would like it to be MUST NOT be used in Internet protocols, which > >>leads to tagged UTF-8 text be illegal if the BOM exists in the text. > > > > > >That would violate the Unicode standard. > > Hello Markus, > > Can you give the details of why and how (in terms e.g. of conformance > clauses in the Unicode Standard)? I think Markus may have overstated the case. It is certainly possible for an Internet protocol specification to make BOM-initial UTF-8 text illegal *for that protocol*, if the relevant protocol definers deem it such. That does not violate the Unicode Standard, any more than if such a protocol made the use of the backslash character illegal *for that protocol*. The Unicode Standard allows BOM-initial UTF-8. The reason it does so is because encoding conversions should not drop data if converting between UTF-16 (or UTF-32), which might have an initial BOM, and UTF-8. However, the Unicode Standard does not require or recommend the use of a BOM with UTF-8, since its use as a signature is superfluous in that encoding form, and as y'all have discussed, if anything, it is harmful in that context for many protocols and for the ASCII compatibility of UTF-8 data streams. The exact wording being considered for the Unicode 4.0 revision is: "When represented in UTF-8, the byte order mark turns into the byte sequence <EF BB BF>. Its usage at the beginning of a UTF-8 data stream is neither required nor recommended by the Unicode Standard, but its presence is not considered non-conformant for the UTF-8 encoding scheme." And then there is a bunch more language a bit later about the care that is necessary when handling BOM's when converting between encoding schemes. --Ken
Received on Thursday, 17 October 2002 22:46:56 UTC