Re: UTF-8 signature / BOM in CSS from Tex Texin on 2003-12-06 (www-international@w3.org from October to December 2003)

From: Tex Texin <tex@i18nguy.com>
Date: Fri, 05 Dec 2003 21:55:57 -0500
To: Etan Wexler <ewexler@stickdog.com>
Cc: Richard Ishida <ishida@w3.org>, www-international@w3.org, w3c-css-wg@w3.org, w3c-i18n-ig@w3.org, www-style@w3.org
Message-ID: <3FD1453D.3252B143@i18nguy.com>

Etan,

I would be happy for either the brain-altering meds, or some
justification.

I went the other way. I used to be for being tolerant on reading and
strict on writing and now I would prefer strict everywhere. Being
tolerant disguises and perpetuates problems, introduces security risks,
and leads to unpredictable behavior. It also causes users to think the
technology is mysterious and unpredictable rather than being
decipherable and manageable. And the benefit? I can't think of one.
I'd be happy to understand how being accepting is beneficial.

Regards,

Tex

Etan Wexler wrote:
> 
> Tex Texin wrote to>, <mailto:www-international@w3.org>,
> <mailto:w3c-css-wg@w3.org>, <mailto:w3c-i18n-ig@w3.org>, and
> <mailto:www-style@w3.org> on 2 December 2003 in "Re: UTF-8 signature /
> BOM in CSS" (<mid:3FCD6609.7C5A8F4F@i18nguy.com>):
> 
> > I am not sure I would agree with stripping non-characters. I would
> > rather reject documents with junk in them than silently clean them up.
> 
> I used to be of the junk-rejection mentality. Ian Hickson, time, and
> probably some brain-altering medication have convinced me of the case
> for parsing at all costs.
> 
> > In the case of the UTF-8 BOM, I would not object to simply stripping
> > it,
> > but it does seem odd to not make use of the information about the
> > document's encoding and odder still to not use the information about
> > endian-ness in a UTF-16 encoded document.
> 
> I assumed that the CSS engine would make use of out-of-band information
> to indicate the detected encoding scheme. Or the CSS engine would
> internally convert style sheets' encodings to a single chosen encoding
> (say, UTF-16BE). Or the CSS engine would parse bytes into
> encoding-independent character objects. The CSS engine would then pass
> these character objects to the tokenizer, with the original encoding
> scheme becoming irrelevant to understanding the CSS.
> 
> > Also stripping it in the case
> > of UTF-16 would eliminate useful information from a CSS document.
> 
> I intend to strip the BOM only for internal purposes. That is, the CSS
> engine strips the BOM from the style sheet in order to normalize the
> input fed to the tokenizer. As stated, the encoding scheme should
> either be flagged out of band or be made irrelevant by the particulars
> of the implementation. Interfacing with the outside world has separate
> concerns, one of which is to preserve any encoding signature. Again,
> the BOM is not necessary internally. A CSS editor could note the
> encoding scheme of a style sheet, work with the style sheet's
> constructs, and add appropriate encoding signatures upon serialization.
> 
> > To answer your question about other BOMs, they are all based on U+FEFF,
> > but they exist for
> > UTF-16, UTF-32, and SCSU (Unicode compression).
> >
> > I have a list with more detail here:
> > http://www.i18nguy.com/unicode/c-unicode.html#BOM
> >
> > and the Unicode Consortium has a FAQ on UTF-8 and the BOM at:
> > http://www.unicode.org/faq/utf_bom.html
> 
> They're not just based on U+FEFF, they are U+FEFF. There are various
> byte sequences, yes, but each encodes the same character.
> 
> --
> Etan Wexler.

-- 
-------------------------------------------------------------
Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
Xen Master                          http://www.i18nGuy.com
                         
XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
-------------------------------------------------------------

Received on Friday, 5 December 2003 21:56:14 UTC