- From: L. David Baron <dbaron@dbaron.org>
- Date: Fri, 20 Feb 2004 17:48:31 -0800
- To: www-style@w3.org
On Friday 2004-02-20 23:26 +0100, Bert Bos wrote: > 1) Trust the HTTP header (or similar out-of-band information in other > protocols). If the file then appears to start with a U+FEFF > character, ignore it. If there is a @charset at the start or after > that U+FEFF, ignore it. Otherwise, start parsing at the first > character. I don't like putting these "ignore" rules here. Encoding determination and parsing are separate processes -- the latter happens after the encoding has been determined and the byte stream has been converted into a character stream. The parsing should be described as starting from the beginning of the stylesheet no matter what encoding determination path is followed. So the "ignore BOM" rule (and the @charset parsing rules) should be stated as part of the parsing rules, not as part of the encoding determination rules. > 2) If the header gives no encoding, try to recognize a U+FEFF > and/or @charset in various encodings (see algorithm below). Then > use the encoding that worked to parse the remainder of the file. > > 3) If neither the header nor looking for U+FEFF or @charset yield an > encoding, but this style sheet was loaded because a document > linked to it (or linked to a style sheet that in turn linked to > it, recursively), then use the encoding of the document (or style > style sheet) that linked to this one. I'd like to see this step removed in css3-syntax, but it should probably stay for CSS 2.1. > 4) If all else fails, assume UTF-8. > I also omitted the CHARSET parameter of the LINK element in HTML. Is > that a problem? I think it should stay. It was part of CSS2, and should remain, between steps (2) and (3) above. I'd also like to see it removed for css3-syntax, though. > The algorithm for (2) would be as follows: > > 2a) If the first bytes are 00 00 FE FF, use UCS-4 (1234 order). > Remove those bytes. If they are followed by "@charset > <anything>;" remove that as well. > > 2b) If the first bytes are FF FE 00 00, use UCS-4 (4321 order). > Remove those bytes. If they are followed by "@charset > <anything>;" remove that as well. > > 2c) If the first bytes are 00 00 FF FE, use UCS-4 (2143 order). > Remove those bytes. If they are followed by "@charset > <anything>;" remove that as well. > > 2d) If the first bytes are FE FF 00 00, use UCS-4 (3412 order). > Remove those bytes. If they are followed by "@charset > <anything>;" remove that as well. > > 2e) If the first bytes are FE FF xx, where xx is not 00, use UTF-16-BE. > Remove the first two bytes. If they are followed by "@charset > <anything>;", remove that as well. > > 2f) If the first bytes are FF FE xx, where xx is not 00, use UTF-16-LE. > Remove the first two bytes. If they are followed by "@charset > <anything>;", remove that as well. Why does xx need to be something other than 00? That could be perfectly valid in either case -- as an identifier, and the UCS-4 cases are handled by rules (2b) and (2d). (It might be worth noting that these 2 tests must occur after the (2b) and (2d) tests.) > 2g) If the first bytes are EF BB BF, use UTF-8. > Remove those bytes. If they are followed by "@charset > <anything>;" remove that as well. Again, I think the mention of removing "@charset" in this part of the process to be removed. > 2h) For all encodings X that the UA knows, starting with UTF-8, > UTF-16-BE and UTF-16-LE, if the first bytes correspond to > '@charset "X";' (case-insensitive) in encoding X, use that > encoding X and remove those bytes. I don't think this should be a MUST requirement for all encodings -- rather only those encodings where ASCII characters are encoded as in ASCII or padded with null bytes to a character size of 2 or 4 bytes. That prevents the conformance requirements for becoming much more difficult for UAs that support obscure encodings. -David -- L. David Baron <URL: http://dbaron.org/ >
Received on Friday, 20 February 2004 20:48:34 UTC