Re: UTF-8 signature / BOM in CSS from Etan Wexler on 2003-12-08 (www-international@w3.org from October to December 2003)

From: Etan Wexler <ewexler@stickdog.com>
Date: Sun, 7 Dec 2003 17:34:55 -0800
To: www-international@w3.org, www-style@w3.org
Message-Id: <B9F6D1F2-291E-11D8-9135-000502CB1B77@stickdog.com>

Mark Davis wrote to <mailto:www-international@w3.org>, 
<mailto:w3c-css-wg@w3.org>, and <mailto:www-style@w3.org> on 7 December 
2003 in "Re: UTF-8 signature / BOM in CSS" 
(<mid:001f01c3bd02$3d81d910$7900a8c0@DAVIS1>):

> The string "UTF-16" names both an encoding form and an encoding 
> scheme. Somewhat
> unfortunate, but it is that way for historical reasons. As an encoding 
> form, it
> is independent of byte ordering. As an encoding scheme, it is defined 
> such that
> an (optional) BOM determines the interpretation of the rest of the 
> bytes, as
> either pairs of big-endian or little-endian bytes.

I stand corrected. UTF-16 is an unambiguous encoding scheme. That 
given, does my processing model make sense? I'll repeat for convenience:

The encoding scheme has been detected and noted, including any 
significant endian-ness. No BOM is necessary for the tokenizer. The BOM 
is stripped from the internal representation of the style sheet. The 
remaining byte stream moves along to the tokenizer. The tokenizer 
consults the noted encoding scheme in order to properly interpret the 
bytes.

If (because of existing software libraries or what have you) it's just 
as easy and fast to tell the tokenizer, "This is in UTF-16; check the 
BOM", then, okay, there's no need for the extra processing. But I still 
like the cleanliness of eliminating this thorn of a codepoint so that 
the tokenizer doesn't ever see it and so that the grammar doesn't have 
to account for it.

-- 
Etan Wexler.

Received on Sunday, 7 December 2003 20:34:11 UTC