- From: Etan Wexler <ewexler@stickdog.com>
- Date: Sun, 7 Dec 2003 17:34:55 -0800
- To: www-international@w3.org, www-style@w3.org
Mark Davis wrote to <mailto:www-international@w3.org>, <mailto:w3c-css-wg@w3.org>, and <mailto:www-style@w3.org> on 7 December 2003 in "Re: UTF-8 signature / BOM in CSS" (<mid:001f01c3bd02$3d81d910$7900a8c0@DAVIS1>): > The string "UTF-16" names both an encoding form and an encoding > scheme. Somewhat > unfortunate, but it is that way for historical reasons. As an encoding > form, it > is independent of byte ordering. As an encoding scheme, it is defined > such that > an (optional) BOM determines the interpretation of the rest of the > bytes, as > either pairs of big-endian or little-endian bytes. I stand corrected. UTF-16 is an unambiguous encoding scheme. That given, does my processing model make sense? I'll repeat for convenience: The encoding scheme has been detected and noted, including any significant endian-ness. No BOM is necessary for the tokenizer. The BOM is stripped from the internal representation of the style sheet. The remaining byte stream moves along to the tokenizer. The tokenizer consults the noted encoding scheme in order to properly interpret the bytes. If (because of existing software libraries or what have you) it's just as easy and fast to tell the tokenizer, "This is in UTF-16; check the BOM", then, okay, there's no need for the extra processing. But I still like the cleanliness of eliminating this thorn of a codepoint so that the tokenizer doesn't ever see it and so that the grammar doesn't have to account for it. -- Etan Wexler.
Received on Sunday, 7 December 2003 20:34:11 UTC