- From: Ernest Cline <ernestcline@mindspring.com>
- Date: Wed, 25 Feb 2004 13:08:27 -0500
- To: "W3C CSS List" <www-style@w3.org>
Just thought of some complications for what has already been given. The following 16 codepoints: U+1FEFF, U+2FEFF, ... , U+10FEFF, can if they are the first byte of UTF-32 stylesheet, and that uses the 4321 or 3412 byte ordering, be misinterpreted as the BOM for UTF-16 followed by the character U+0001 to U+0010. This is largely theoretical at this point, as with the exception of the private use characters U+FFEFF and U+10FEFF, none of these have been assigned as of yet and stylesheets that use private use characters are no doubt rare. If the first character of a UTF-16BE stylesheet is U+EFBB and the second is U+BFxx. Then it would get misinterpreted as the UTF-8 BOM. However the probability that these two characters would occur next to each other seems extremely unlikely given the two blocks the characters would come from. Not so theoretical is the possible interaction between UTF-16LE and UTF-8 U+BBEF followed by U+xxBF does not seem to be a totally improbable combination for a Korean stylesheet. (U+BFEF is in the Hangul Syllables block.) However, I don't think this is a major problem. In all of these cases the alternative would be to fall back to Step (3) the linking document's charset. However, it might be wise that if it were explicitly added to the algorithm that if the stylesheet should prove to be unparsable or undecodable given the presumed encoding, then a UA (MAY/SHOULD/MUST) try again starting at the next step: [Not certain which of the three modifiers ought to be used here. Example: A UTF-16LE stylesheet starting off with the characters U+BBEF U+A7BF U+C0D9 U+B325 is accessed from an HTML document also encoded in UTF-16LE but with out-of-band info that declares the stylesheet to be UTF-16BE. Result of Step 1) It tries to parse it as UTF-16BE, but discovers unpaired surrogates and other things that keep it from being parsable so it then tries again, starting at Step 2). Result of Step 2) It discovers what looks like a UTF-8 BOM, but again discovers that the stream is not UTF-8 so it tries again using Step 3) 3) And here it discovers that it can understand it as a UTF-16LE stylesheet, and so it uses that interpretation.
Received on Wednesday, 25 February 2004 13:08:24 UTC