Re: Comments on draft-yergeau-rfc2279bis-00.txt from Dan Oscarsson on 2002-04-17 (ietf-charsets@w3.org from April to June 2002)

From: Dan Oscarsson <Dan.Oscarsson@trab.se>
Date: Wed, 17 Apr 2002 13:44:07 +0200
To: ietf-charsets@iana.org, duerst@w3.org, FYergeau@alis.com
Message-id: <3CBD6007.653586A@trab.se>

Martin Duerst wrote:

>Here are my comments on
>http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-00.txt.
>

>5. Byte order mark (BOM)
>
>This section needs more work. The 'change log' says that it's
>mostly taken from the UTF-16 RFC. But the BOM for UTF-8 is
>much less necessary, and much more of a problem, than for UTF-16.
>We should clearly say that with IETF protocols, character encodings
>are always either labeled or fixed, and therefore the BOM SHOULD
>(and MUST at least for small segments) never be used for UTF-8.
>And we should clearly give the main argument, namely that it
>breaks US-ASCII compatibility (US-ASCII encoded as UTF-8
>(without a BOM) stays exactly the same, but US-ASCII encoded
>as UTF-8 with a BOM is different).

Just what I have been thinking about. While it could have been nice to
require all UTF-8 encoded files to have a "magic" marker in the
beginning
to separate between other 8-bit character sets, we are already
past the stage where that could be introduced. So I recommend that
BOM MUST never be used in UTF-8. So you never have to expect it
when handling UTF-8.

I would also very much like UTF-8 to require that Unicode
normalisation form C has been used on the UCS encoded.
Otherwise can the same character sequence have
different UTF-8 codings.
While it is no problem to use overlong UTF-8 sequences, they
are forbidden in the document. This makes it impossible to
encode the same ASCII character sequence in several ways.
The same should be applied to all characters in UCS - only
one form should be allowed.
As form C do not destroy any data and is most compact, it is
the best choice.
So UTF-8 should REQUIRE the characters to be normalised
using form C. (note: text normalised using from KC will
work also, it it is normalised using form C it will result
in the same text).

Having both BOM removed and form C required will make handling
of UTF-8 in software much simpler as well as less error and security
prone.

    Dan

Received on Monday, 29 April 2002 05:02:34 UTC