RE: XHTML and charset's [was: Re: XHTML questions] from Christian Smith on 2000-06-30 (www-html@w3.org from June 2000)

From: Christian Smith <csmith@barebones.com>
Date: Fri, 30 Jun 2000 10:26:14 -0400
To: Ian Graham <ian.graham@utoronto.ca>
cc: www-html@w3.org, Chris Croome <chris@webarchitects.co.uk>
Message-ID: <20000630102614-f01010601-bf1191ff@204.107.232.107>

On Friday, June 30, 2000 at 09:29, igraham@smaug.java.utoronto.ca (Ian
Graham) wrote:

> I think you mean UTF-16 (the two-byte encoding). UTF-8 doesn't use /
> require a byte order mark, as all characters are encoded as a stream of
> one, two, or more bytes, and the encoding rules uniquely define the
> ordering of the bytes (a byte stream). 

No, I do mean UTF-8. While UTF-8 does not require a BOM (neither does
UTF-16) there is a defined BOM for UTF-* and it is convenient to have one.
Otherwise it can be dificult to determine that a file is UTF-8 (as opposed
to some other binary format) absent some other specific designator.

That UTF-8 doesn't have a BOM seems to be a common misconception but the
Unicode FAQ is pretty clear on this.

http://www.unicode.org/unicode/faq/#BOM

Part of the problem is that the RFC for Unicode is almost (but not
quite)[1] completely useless and the ISO specification is no better.
Neither of these documents can be read and understood by us mere mortals.

Of course it is perhaps a bit misleading to call this a BOM (yet that is
what it is called) as UTF-8 doesn't have little/big-endian forms so there
is no "order" to mark.

[1] Is this a TLM? BNQ = "but not quite" or should we have ABNQ = "almost
but not quite" ;-?

-- 
Christian Smith  |  csmith@barebones.com  |  http://web.barebones.com

He who dies with the most friends... Is still dead!

Received on Friday, 30 June 2000 10:26:15 UTC