Re: UTF-8 signature / BOM in CSS from Tex Texin on 2003-12-03 (www-international@w3.org from October to December 2003)

From: Tex Texin <tex@i18nguy.com>
Date: Tue, 02 Dec 2003 23:26:49 -0500
To: Etan Wexler <ewexler@stickdog.com>
Cc: Richard Ishida <ishida@w3.org>, www-international@w3.org, w3c-css-wg@w3.org, w3c-i18n-ig@w3.org, www-style@w3.org
Message-ID: <3FCD6609.7C5A8F4F@i18nguy.com>

Etan,

I am not sure I would agree with stripping non-characters. I would
rather reject documents with junk in them than silently clean them up.

In the case of the UTF-8 BOM, I would not object to simply stripping it,
but it does seem odd to not make use of the information about the
document's encoding and odder still to not use the information about
endian-ness in a UTF-16 encoded document. Also stripping it in the case
of UTF-16 would eliminate useful information from a CSS document.

To answer your question about other BOMs, they are all based on U+FEFF,
but they exist for
UTF-16, UTF-32, and SCSU (Unicode compression).

I have a list with more detail here:
http://www.i18nguy.com/unicode/c-unicode.html#BOM

and the Unicode Consortium has a FAQ on UTF-8 and the BOM at:
http://www.unicode.org/faq/utf_bom.html

tex


Etan Wexler wrote:
> 
> Richard Ishida wrote to <mailto:www-international@w3.org>,
> <mailto:w3c-css-wg@w3.org>, and <mailto:w3c-i18n-ig@w3.org> on 2
> December 2003 in "RE: UTF-8 signature / BOM in CSS"
> (<mid:005301c3b8e4$1d862250$6501a8c0@w3c40upc3ma3j2>):
> 
> > I wonder whether CSS can introduce a change to CSS2.1 at this stage to
> > clarify that the BOM - particularly any UTF-8 signature - should not be
> > considered part of the following text.
> 
> I'd like to see such a revision made.
> 
> CSS specifications should mandate a preparation phase for CSS
> consumption. In this phase, a CSS engine would strip an initial BOM, if
> present, and strip all noncharacters. After this phase, a clean stream
> of Unicode characters gets passed to the tokenizer; parsing proceeds as
> specified in the grammar.
> 
> By the way, what UTF-8 signatures exist besides U+FEFF?
> 
> --
> Etan Wexler.
> (Sorry about the character munging in my original message. And sorry
> about using my unsubscribed address, thus splitting the thread. I'm
> reconnecting with www-style.)

-- 
-------------------------------------------------------------
Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
Xen Master                          http://www.i18nGuy.com
                         
XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
-------------------------------------------------------------

Received on Tuesday, 2 December 2003 23:28:33 UTC