Re: [CSS21] response to issue 115 (and 44) from L. David Baron on 2004-02-21 (www-style@w3.org from February 2004)

From: L. David Baron <dbaron@dbaron.org>
Date: Fri, 20 Feb 2004 17:48:31 -0800
To: www-style@w3.org
Message-ID: <20040221014831.GA5493@darby.dbaron.org>
On Friday 2004-02-20 23:26 +0100, Bert Bos wrote:
>  1) Trust the HTTP header (or similar out-of-band information in other
>     protocols). If the file then appears to start with a U+FEFF
>     character, ignore it. If there is a @charset at the start or after
>     that U+FEFF, ignore it. Otherwise, start parsing at the first
>     character.

I don't like putting these "ignore" rules here.

Encoding determination and parsing are separate processes -- the latter
happens after the encoding has been determined and the byte stream has
been converted into a character stream.  The parsing should be described
as starting from the beginning of the stylesheet no matter what encoding
determination path is followed.  So the "ignore BOM" rule (and the
@charset parsing rules) should be stated as part of the parsing rules,
not as part of the encoding determination rules.

>  2) If the header gives no encoding, try to recognize a U+FEFF
>     and/or @charset in various encodings (see algorithm below). Then
>     use the encoding that worked to parse the remainder of the file.
> 
>  3) If neither the header nor looking for U+FEFF or @charset yield an
>     encoding, but this style sheet was loaded because a document
>     linked to it (or linked to a style sheet that in turn linked to
>     it, recursively), then use the encoding of the document (or style
>     style sheet) that linked to this one.

I'd like to see this step removed in css3-syntax, but it should probably
stay for CSS 2.1.

>  4) If all else fails, assume UTF-8.

> I also omitted the CHARSET parameter of the LINK element in HTML. Is
> that a problem?

I think it should stay.  It was part of CSS2, and should remain, between
steps (2) and (3) above.  I'd also like to see it removed for
css3-syntax, though.

> The algorithm for (2) would be as follows:
> 
>   2a) If the first bytes are 00 00 FE FF, use UCS-4 (1234 order).
>       Remove those bytes. If they are followed by "@charset
>       <anything>;" remove that as well.
> 
>   2b) If the first bytes are FF FE 00 00, use UCS-4 (4321 order).
>       Remove those bytes. If they are followed by "@charset
>       <anything>;" remove that as well.
> 
>   2c) If the first bytes are 00 00 FF FE, use UCS-4 (2143 order).
>       Remove those bytes. If they are followed by "@charset
>       <anything>;" remove that as well.
> 
>   2d) If the first bytes are FE FF 00 00, use UCS-4 (3412 order).
>       Remove those bytes. If they are followed by "@charset
>       <anything>;" remove that as well.
> 
>   2e) If the first bytes are FE FF xx, where xx is not 00, use UTF-16-BE.
>       Remove the first two bytes. If they are followed by "@charset
>       <anything>;", remove that as well.
> 
>   2f) If the first bytes are FF FE xx, where xx is not 00, use UTF-16-LE.
>       Remove the first two bytes. If they are followed by "@charset
>       <anything>;", remove that as well.

Why does xx need to be something other than 00?  That could be perfectly
valid in either case -- as an identifier, and the UCS-4 cases are
handled by rules (2b) and (2d).  (It might be worth noting that these 2
tests must occur after the (2b) and (2d) tests.)

>   2g) If the first bytes are EF BB BF, use UTF-8.
>       Remove those bytes. If they are followed by "@charset
>       <anything>;" remove that as well.

Again, I think the mention of removing "@charset" in this part of the
process to be removed.

>   2h) For all encodings X that the UA knows, starting with UTF-8,
>       UTF-16-BE and UTF-16-LE, if the first bytes correspond to
>       '@charset "X";' (case-insensitive) in encoding X, use that
>       encoding X and remove those bytes.

I don't think this should be a MUST requirement for all encodings --
rather only those encodings where ASCII characters are encoded as in
ASCII or padded with null bytes to a character size of 2 or 4 bytes.

That prevents the conformance requirements for becoming much more
difficult for UAs that support obscure encodings.

-David

-- 
L. David Baron                                <URL: http://dbaron.org/ >
Received on Friday, 20 February 2004 20:48:34 UTC