W3C home > Mailing lists > Public > www-international@w3.org > October to December 2006

Re: Strange advice re BOM and UTF-8

From: Asmus Freytag <asmusf@ix.netcom.com>
Date: Wed, 06 Dec 2006 10:23:58 -0800
Message-ID: <45770ABE.3030402@ix.netcom.com>
To: "McDonald, Ira" <imcdonald@sharplabs.com>
CC: 'Richard Ishida' <ishida@w3.org>, 'Chris Lilley' <chris@w3.org>, www-validator@w3.org, www-international@w3.org

Some comments:

On 12/6/2006 9:00 AM, McDonald, Ira wrote:
> Hi,
> FWIW - the IETF's formal definition of UTF-8 (RFC 3629)
> recommends very strongly AGAINST the use of BOM in UTF-8
> in all IETF protocols because:
> (a) it's useless as a signature (a small fragment of 
>     UTF-8 can be reliably auto-detected without BOM);
It's not at all useless. A not uncommon situation is that of a UTF-8 
encoded text in English where the only character outside ASCII is the 
copyright sign, occurring at the end. Such texts (and many similar ones) 
cannot be reliably autodetected without reading the entire text. Here, a 
signature is useful.

An alternative would be the use of an adaptive converter, that is, a 
converter that starts out in ASCII and autodetects along the way, 
settling into any one of several modes once the first instance of a 
particular encoding has been encountered. Such a converter is superior 
to looking at the start of a text and deciding the encoding based on 
limited evidence. Does anybody use such a scheme?
> (b) it's dangerous because it breaks string concatenation.
The point of preferring U+2060 for use as Word Joiner is just that: to 
make it easier to ignore and/or filter non-initial FEFF as redundant. 
However, I tend to believe that Jukka is correct, and FEFF is more 
widely supported as invisible non-break character, than U+2060. 
Presumably, that's a relatively temporary situation: it takes several 
versions before new characters get fully implemented.

It seems to me that blind string concatenation is risky in UTF-8 because 
it relies on the well-formed nature of both strings. But it's also 
tricky with any encoding form, because if the second string starts with 
a combining mark, the effect may not be as desired. (A combining mark at 
the start of a string would normally be displayed 'in isolation', but 
following concatenation it would form a combining character sequence 
with the end of the previous string whenever that string ends with a 
base or combining character). Safe concatenation would ensure that the 
first stringbuffer correctly ends on a character boundary and would be 
able to handle BOM disposal from the second string  as well. In an 
editor, I would expect even more sophisticated support of concatenation, 
in line with the cut&paste support for spacers surrounding words, and 
similar features.

Received on Wednesday, 6 December 2006 18:24:19 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:27 UTC