- From: Dan Oscarsson <Dan.Oscarsson@trab.se>
- Date: Wed, 17 Apr 2002 13:44:07 +0200
- To: ietf-charsets@iana.org, duerst@w3.org, FYergeau@alis.com
Martin Duerst wrote: >Here are my comments on >http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-00.txt. > >5. Byte order mark (BOM) > >This section needs more work. The 'change log' says that it's >mostly taken from the UTF-16 RFC. But the BOM for UTF-8 is >much less necessary, and much more of a problem, than for UTF-16. >We should clearly say that with IETF protocols, character encodings >are always either labeled or fixed, and therefore the BOM SHOULD >(and MUST at least for small segments) never be used for UTF-8. >And we should clearly give the main argument, namely that it >breaks US-ASCII compatibility (US-ASCII encoded as UTF-8 >(without a BOM) stays exactly the same, but US-ASCII encoded >as UTF-8 with a BOM is different). Just what I have been thinking about. While it could have been nice to require all UTF-8 encoded files to have a "magic" marker in the beginning to separate between other 8-bit character sets, we are already past the stage where that could be introduced. So I recommend that BOM MUST never be used in UTF-8. So you never have to expect it when handling UTF-8. I would also very much like UTF-8 to require that Unicode normalisation form C has been used on the UCS encoded. Otherwise can the same character sequence have different UTF-8 codings. While it is no problem to use overlong UTF-8 sequences, they are forbidden in the document. This makes it impossible to encode the same ASCII character sequence in several ways. The same should be applied to all characters in UCS - only one form should be allowed. As form C do not destroy any data and is most compact, it is the best choice. So UTF-8 should REQUIRE the characters to be normalised using form C. (note: text normalised using from KC will work also, it it is normalised using form C it will result in the same text). Having both BOM removed and form C required will make handling of UTF-8 in software much simpler as well as less error and security prone. Dan
Received on Monday, 29 April 2002 05:02:34 UTC