- From: Simon Josefsson <simon+ietf-charsets@josefsson.org>
- Date: Thu, 03 Oct 2002 15:35:51 +0200
- To: Martin Duerst <duerst@w3.org>
- Cc: ietf-charsets@iana.org
Martin Duerst <duerst@w3.org> writes: > IETF protocols are based on the principle that the character > encoding is either fixed (i.e. a certain field in a certain > protocol is always in the same encoding) or labeled (e.g. > with the 'charset' parameter in the Content-Type header of > a MIME entity). In both cases, the BOM as an encoding identifier > is unnecessary. Also, the compatibility of UTF-8 with US-ASCII > is important in many IETF protocols. However, while UTF-8 without > the BOM is compatible with US-ASCII (encoding an US-ASCII string > in UTF-8 results in exactly the same bytes), UTF-8 with a BOM is > not compatible with US-ASCII, because the BOM is not allowed in > US-ASCII. Also, while a BOM was always allowed for UTF-8 in > ISO 10646, there are many implementations that do not process > a BOM appropriately. > > Therefore, senders SHOULD NOT use the BOM in larger, usually > labeled, pieces of text (e.g. MIME entities), and MUST NOT > use it in smaller protocol elements (usually with a fixed > encoding). Receivers SHOULD recognize and remove the BOM > in larger, usually labeled, pieces of text (e.g. MIME entities). This still says that implementations should support BOM, which I think is bad practice. If I were to follow this practice in a MUA I would break digital signatures on UTF-8 data. How about changing the last sentence into the following: Receivers MAY recognize and remove the BOM in larger, usually labeled, pieces of text (e.g. MIME entities), if it requires compability with software that generates it. Care should be taken to not remove BOM in data that must be preserved correctly (such as digitally signed data). What do you think?
Received on Thursday, 3 October 2002 11:21:13 UTC