Re: [CSS21] response to issue 115 (and 44) from Ernest Cline on 2004-02-25 (www-style@w3.org from February 2004)

From: Ernest Cline <ernestcline@mindspring.com>
Date: Wed, 25 Feb 2004 13:08:27 -0500
To: "W3C CSS List" <www-style@w3.org>
Message-ID: <410-2200423251882762@mindspring.com>

Just thought of some complications for what has already been
given.

The following 16 codepoints: U+1FEFF, U+2FEFF, ... , U+10FEFF,
can if they are the first byte of UTF-32 stylesheet, and that uses
the 4321 or 3412 byte ordering, be misinterpreted as the BOM
for UTF-16 followed by the character U+0001 to U+0010.
This is largely theoretical at this point, as with the exception
of the private use characters U+FFEFF and U+10FEFF, none
of these have been assigned as of yet and stylesheets that
use private use characters are no doubt rare.

If the first character of a UTF-16BE stylesheet is U+EFBB and
the second is U+BFxx.  Then it would get misinterpreted as
the UTF-8 BOM.  However the probability that these two
characters would occur next to each other seems extremely
unlikely given the two blocks the characters would come from.

Not so theoretical is the possible interaction between UTF-16LE
and UTF-8  U+BBEF followed by U+xxBF does not seem to be
a totally improbable combination for a Korean stylesheet.
(U+BFEF is in the Hangul Syllables block.)

However, I don't think this is a major problem.

In all of these cases the alternative would be to fall back
to Step (3) the linking document's charset.  However,
it might be wise that if it were explicitly added to the algorithm
that if the stylesheet should prove to be unparsable or
undecodable given the presumed encoding, then a UA
(MAY/SHOULD/MUST) try again starting at the next step:
[Not certain which of the three modifiers ought to be used here.

Example: A UTF-16LE stylesheet starting off with the characters
U+BBEF U+A7BF U+C0D9 U+B325  is accessed from an HTML
document also encoded in UTF-16LE but with out-of-band info
that declares the stylesheet to be UTF-16BE.

Result of Step 1) It tries to parse it as UTF-16BE, but discovers
unpaired surrogates and other things that keep it from being
parsable so it then tries again, starting at Step 2).

Result of Step 2) It discovers what looks like a UTF-8 BOM, but
again discovers that the stream is not UTF-8 so it tries again
using Step 3)

3) And here it discovers that it can understand it as a UTF-16LE
stylesheet, and so it uses that interpretation.

Received on Wednesday, 25 February 2004 13:08:24 UTC