- From: Mark Davis <mark.davis@jtcsv.com>
- Date: Sun, 7 Dec 2003 12:39:40 -0800
- To: "Etan Wexler" <ewexler@stickdog.com>, "Chris Lilley" <chris@w3.org>, <www-international@w3.org>, <w3c-css-wg@w3.org>, <www-style@w3.org>
One correction. >There is no encoding scheme UTF-16. The string "UTF-16" names both an encoding form and an encoding scheme. Somewhat unfortunate, but it is that way for historical reasons. As an encoding form, it is independent of byte ordering. As an encoding scheme, it is defined such that an (optional) BOM determines the interpretation of the rest of the bytes, as either pairs of big-endian or little-endian bytes. Mark __________________________________ http://www.macchiato.com ► शिष्यादिच्छेत्पराजयम् ◄ ----- Original Message ----- From: "Etan Wexler" <ewexler@stickdog.com> To: "Chris Lilley" <chris@w3.org>; <www-international@w3.org>; <w3c-css-wg@w3.org>; <www-style@w3.org> Sent: Sat, 2003 Dec 06 19:08 Subject: Re: UTF-8 signature / BOM in CSS Chris Lilley wrote to <mailto:www-international@w3.org>, <mailto:w3c-css-wg@w3.org>, <mailto:w3c-i18n-ig@w3.org>, and <mailto:www-style@w3.org> on 6 December 2003 in "Re: UTF-8 signature / BOM in CSS" (<mid:862788409.20031206164822@w3.org>): > EW> I assumed that the CSS engine would make use of out-of-band > information > EW> to indicate the detected encoding scheme. > > Please check the definition of that out of band information [— in] > particular what it says about when a BOM must be present. Perhaps I was unclear. I did not mean that the CSS engine would propagate the "charset" parameter's value unmodified. What I had in mind is as follows. The CSS engine retrieves a style sheet. It could be from HTTP, the local file system, FTP, SMTP + MIME, a database, or any source, really. The CSS engine detects an encoding scheme according to the prescribed or accepted best practice. Factors that could determine the detection include a "charset" parameter, a byte-order mark (U+FEFF), a database schema, a file name extension, and the native byte order of the local machine. Once the encoding scheme is detected, it is noted for further use. The encoding scheme will never be noted as UTF-16 or UTF-32. There is no encoding scheme UTF-16. There is a "charset" value in the IANA registry called "UTF-16", but UTF-16 is an encoding form. Any serialized UTF-16 document is either big-endian or little-endian. Nevertheless, the "UTF-16" label is allowed and in use, so we resort to the BOM to disambiguate the label's meaning. The situation with UTF-32 is analogous. We return to our CSS scenario. The encoding scheme has been detected and noted, including any significant endian-ness. No BOM is necessary for the tokenizer. The BOM is stripped from the internal representation of the style sheet. The remaining byte stream moves along to the tokenizer. The tokenizer consults the noted encoding scheme in order to properly interpret the bytes. Parsing proceeds on course. What did I miss or misunderstand? -- Etan Wexler.
Received on Sunday, 7 December 2003 15:39:43 UTC