Re: UTF-8 signature / BOM in CSS from Chris Haynes on 2003-12-08 (www-international@w3.org from October to December 2003)

From: Chris Haynes <chris@harvington.org.uk>
Date: Mon, 8 Dec 2003 10:05:13 -0000
To: <www-international@w3.org>, <www-style@w3.org>
Message-ID: <004e01c3bd72$c58c6620$0200000a@ringo>

 "Etan Wexler"  wrote:


>
> Mark Davis wrote to <mailto:www-international@w3.org>,
> <mailto:w3c-css-wg@w3.org>, and <mailto:www-style@w3.org> on 7 December
> 2003 in "Re: UTF-8 signature / BOM in CSS"
> (<mid:001f01c3bd02$3d81d910$7900a8c0@DAVIS1>):
>
> > The string "UTF-16" names both an encoding form and an encoding
> > scheme. Somewhat
> > unfortunate, but it is that way for historical reasons. As an encoding
> > form, it
> > is independent of byte ordering. As an encoding scheme, it is defined
> > such that
> > an (optional) BOM determines the interpretation of the rest of the
> > bytes, as
> > either pairs of big-endian or little-endian bytes.
>
> I stand corrected. UTF-16 is an unambiguous encoding scheme. That
> given, does my processing model make sense? I'll repeat for convenience:
>
> The encoding scheme has been detected and noted, including any
> significant endian-ness. No BOM is necessary for the tokenizer. The BOM
> is stripped from the internal representation of the style sheet. The
> remaining byte stream moves along to the tokenizer. The tokenizer
> consults the noted encoding scheme in order to properly interpret the
> bytes.
>
> If (because of existing software libraries or what have you) it's just
> as easy and fast to tell the tokenizer, "This is in UTF-16; check the
> BOM", then, okay, there's no need for the extra processing. But I still
> like the cleanliness of eliminating this thorn of a codepoint so that
> the tokenizer doesn't ever see it and so that the grammar doesn't have
> to account for it.
>
> --
> Etan Wexler.


There is a different, architecturally-layered  way of looking at this:

1) The file layer contains a sequence of bytes / octets
2) The Style Sheet layer consists of a sequence of characters

The first layer reconstructs the character sequence from the byte sequence. The BOM and other UTF-16 related artifacts belong to
this layer and never leave it. This layer emits a sequence of characters (represented by 16-bit Unicode code points) for use in the
next layer up.

The second layer, which contains the tokenizer etc. sees only characters. It does not need to know what encoding scheme was used at
the 'file' layer. The BOM etc. never get to this character layer, they were consumed by the lower layer.


HTH

Chris Haynes

Received on Monday, 8 December 2003 05:18:03 UTC