RE: Comments on draft-yergeau-rfc2279bis-00.txt

On 17/04/2002 21:51:19 Francois Yergeau wrote:
> Martin Duerst wrote:
[...]

> > 5. Byte order mark (BOM)
> >
> > This section needs more work. The 'change log' says that it's
> > mostly taken from the UTF-16 RFC. But the BOM for UTF-8 is
> > much less necessary, and much more of a problem, than for UTF-16.
> > We should clearly say that with IETF protocols, character encodings
> > are always either labeled or fixed, and therefore the BOM SHOULD
> > (and MUST at least for small segments) never be used for UTF-8.
> > And we should clearly give the main argument, namely that it
> > breaks US-ASCII compatibility (US-ASCII encoded as UTF-8
> > (without a BOM) stays exactly the same, but US-ASCII encoded
> > as UTF-8 with a BOM is different).
>
> I don't quite see your point.  A US-ASCII string, with or without a BOM, is
> always a valid UTF-8 string, I don't see where compatibility is broken.  I
> can see that protocols shouldn't *require* a BOM, because then a strict
> (BOM-less) ASCII string wouldn't meet the requirement.  But that's not what
> you're saying, right?

The point Martin may be making is that some tools insert a BOM
at the start of a resource they consider to be encoded using
UTF-8, but do not do so for a resource they consider to be
encoded using US-ASCII.

I have just carried out the following test.  I opened Notepad
under Win2K and typed the letter "a".  I then saved the file,
leaving the default encoding of "ANSI".  I then saved the file
again, under a different name, specifying "UTF-8" as the
encoding.  I then checked the file sizes using Properties.
The first file is 1 byte long; the second 4 bytes.

Misha

[...]





------------------------------------------------------------- ---
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.

Received on Wednesday, 17 April 2002 17:14:15 UTC