- From: Kenneth Whistler <kenw@sybase.com>
- Date: Thu, 17 Oct 2002 19:13:43 -0700 (PDT)
- To: duerst@w3.org
- Cc: ietf-charsets@iana.org
Martin asked: > At 09:12 02/10/17 -0700, Markus Scherer wrote: > >Patrik F$BgM(Btstr$B‹N(B wrote: > > > >>What I hear on this list is that the consensus is that BOM SHOULD NOT be > >>used. I would like it to be MUST NOT be used in Internet protocols, which > >>leads to tagged UTF-8 text be illegal if the BOM exists in the text. > > > > > >That would violate the Unicode standard. > > Hello Markus, > > Can you give the details of why and how (in terms e.g. of conformance > clauses in the Unicode Standard)? I think Markus may have overstated the case. It is certainly possible for an Internet protocol specification to make BOM-initial UTF-8 text illegal *for that protocol*, if the relevant protocol definers deem it such. That does not violate the Unicode Standard, any more than if such a protocol made the use of the backslash character illegal *for that protocol*. The Unicode Standard allows BOM-initial UTF-8. The reason it does so is because encoding conversions should not drop data if converting between UTF-16 (or UTF-32), which might have an initial BOM, and UTF-8. However, the Unicode Standard does not require or recommend the use of a BOM with UTF-8, since its use as a signature is superfluous in that encoding form, and as y'all have discussed, if anything, it is harmful in that context for many protocols and for the ASCII compatibility of UTF-8 data streams. The exact wording being considered for the Unicode 4.0 revision is: "When represented in UTF-8, the byte order mark turns into the byte sequence <EF BB BF>. Its usage at the beginning of a UTF-8 data stream is neither required nor recommended by the Unicode Standard, but its presence is not considered non-conformant for the UTF-8 encoding scheme." And then there is a bunch more language a bit later about the care that is necessary when handling BOM's when converting between encoding schemes. --Ken
Received on Thursday, 17 October 2002 22:14:42 UTC