Re: Comments on draft-yergeau-rfc2279bis-00.txt

Martin Duerst <duerst@w3.org> writes:

> IETF protocols are based on the principle that the character
> encoding is either fixed (i.e. a certain field in a certain
> protocol is always in the same encoding) or labeled (e.g.
> with the 'charset' parameter in the Content-Type header of
> a MIME entity). In both cases, the BOM as an encoding identifier
> is unnecessary. Also, the compatibility of UTF-8 with US-ASCII
> is important in many IETF protocols. However, while UTF-8 without
> the BOM is compatible with US-ASCII (encoding an US-ASCII string
> in UTF-8 results in exactly the same bytes), UTF-8 with a BOM is
> not compatible with US-ASCII, because the BOM is not allowed in
> US-ASCII. Also, while a BOM was always allowed for UTF-8 in
> ISO 10646, there are many implementations that do not process
> a BOM appropriately.
>
> Therefore, senders SHOULD NOT use the BOM in larger, usually
> labeled, pieces of text (e.g. MIME entities), and MUST NOT
> use it in smaller protocol elements (usually with a fixed
> encoding). Receivers SHOULD recognize and remove the BOM
> in larger, usually labeled, pieces of text (e.g. MIME entities).

This still says that implementations should support BOM, which I think
is bad practice.  If I were to follow this practice in a MUA I would
break digital signatures on UTF-8 data.  How about changing the last
sentence into the following:

  Receivers MAY recognize and remove the BOM in larger, usually
  labeled, pieces of text (e.g. MIME entities), if it requires
  compability with software that generates it.  Care should be taken
  to not remove BOM in data that must be preserved correctly (such as
  digitally signed data).

What do you think?

Received on Thursday, 3 October 2002 11:21:13 UTC