- From: Ernest Cline <ernestcline@mindspring.com>
- Date: Fri, 20 Feb 2004 21:59:13 -0500
- To: "Bert Bos" <bert@w3.org>, "WWW Style" <www-style@w3.org>
> [Original Message] > From: Bert Bos <bert@w3.org> > I also omitted the CHARSET parameter of the LINK element in HTML. > Is that a problem? No. Based on section 5.2.2 of the HTML 4.01 standard, it is fairly clear that the charset attribute should be considered a source of out-of band information as mentioned in step (1) of your algorithm, and as such, should be handled in accordance with how the standard says to choose between the multiple possible sources of out-of-band info. > The algorithm for (2) would be as follows: 2a-2d) Detect one of the UTF-32 BOM variants. > 2e) If the first bytes are FE FF xx, where xx is not 00, use UTF-16-BE. > Remove the first two bytes. If they are followed by "@charset > <anything>;", remove that as well. > > 2f) If the first bytes are FF FE xx, where xx is not 00, use UTF-16-LE. > Remove the first two bytes. If they are followed by "@charset > <anything>;", remove that as well. And what if the third byte is 00, as in FE FF 00 40? You've already eliminated the possibility of UTF-32 by the first four steps. Taken literally, your algorithm could cause a UTF-16 stylesheet to be taken as different encoding because of step (3) altho I doubt that was your intention. > 2g) If the first bytes are EF BB BF, use UTF-8. > Remove those bytes. If they are followed by "@charset > <anything>;" remove that as well. So is CESU-8 is to be implicitly prohibited from using a BOM, unless identified as such by out-of-band info, since that would cause it to be treated as UTF-8? (I could live with that as CESU-8 isn't really intended for transmission of data.) > 3) If neither the header nor looking for U+FEFF or @charset > yield an encoding, but this style sheet was loaded because > a document linked to it (or linked to a style sheet that in turn > linked to it, recursively), then use the encoding of the > document (or style sheet) that linked to this one. > > 4) If all else fails, assume UTF-8. How could step (3) fail to determine a character encoding?
Received on Friday, 20 February 2004 21:59:16 UTC