- From: Tex Texin <tex@i18nguy.com>
- Date: Tue, 02 Dec 2003 23:26:49 -0500
- To: Etan Wexler <ewexler@stickdog.com>
- Cc: Richard Ishida <ishida@w3.org>, www-international@w3.org, w3c-css-wg@w3.org, w3c-i18n-ig@w3.org, www-style@w3.org
Etan, I am not sure I would agree with stripping non-characters. I would rather reject documents with junk in them than silently clean them up. In the case of the UTF-8 BOM, I would not object to simply stripping it, but it does seem odd to not make use of the information about the document's encoding and odder still to not use the information about endian-ness in a UTF-16 encoded document. Also stripping it in the case of UTF-16 would eliminate useful information from a CSS document. To answer your question about other BOMs, they are all based on U+FEFF, but they exist for UTF-16, UTF-32, and SCSU (Unicode compression). I have a list with more detail here: http://www.i18nguy.com/unicode/c-unicode.html#BOM and the Unicode Consortium has a FAQ on UTF-8 and the BOM at: http://www.unicode.org/faq/utf_bom.html tex Etan Wexler wrote: > > Richard Ishida wrote to <mailto:www-international@w3.org>, > <mailto:w3c-css-wg@w3.org>, and <mailto:w3c-i18n-ig@w3.org> on 2 > December 2003 in "RE: UTF-8 signature / BOM in CSS" > (<mid:005301c3b8e4$1d862250$6501a8c0@w3c40upc3ma3j2>): > > > I wonder whether CSS can introduce a change to CSS2.1 at this stage to > > clarify that the BOM - particularly any UTF-8 signature - should not be > > considered part of the following text. > > I'd like to see such a revision made. > > CSS specifications should mandate a preparation phase for CSS > consumption. In this phase, a CSS engine would strip an initial BOM, if > present, and strip all noncharacters. After this phase, a clean stream > of Unicode characters gets passed to the tokenizer; parsing proceeds as > specified in the grammar. > > By the way, what UTF-8 signatures exist besides U+FEFF? > > -- > Etan Wexler. > (Sorry about the character munging in my original message. And sorry > about using my unsubscribed address, thus splitting the thread. I'm > reconnecting with www-style.) -- ------------------------------------------------------------- Tex Texin cell: +1 781 789 1898 mailto:Tex@XenCraft.com Xen Master http://www.i18nGuy.com XenCraft http://www.XenCraft.com Making e-Business Work Around the World -------------------------------------------------------------
Received on Tuesday, 2 December 2003 23:28:34 UTC