Re: [CSS21] response to issue 115 (and 44) from Henri Sivonen on 2004-02-23 (www-style@w3.org from February 2004)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Mon, 23 Feb 2004 19:57:40 +0200
To: WWW Style <www-style@w3.org>
Message-Id: <C5FCFB2E-6629-11D8-B2C2-003065B8CF0E@iki.fi>

On Feb 23, 2004, at 16:09, Jukka K. Korpela wrote:

> On Mon, 23 Feb 2004, Henri Sivonen wrote:
>
>> On Feb 21, 2004, at 00:26, Bert Bos wrote:
>>
>>>  4) If all else fails, assume UTF-8.
>>
>> Why not windows-1252 (with the few undefined bytes mapped to
>> *something* so that all byte streams can be converted some
>> "characters")?
>
> Either guess is bound to be wrong in some cases. And if the guess turns
> out to result in something containing undefined octets, I think we can
> relatively safely guess that the guess was wrong.

Yes, but restarting the parser at that point is expensive. If browsers 
weren't interactive applications that parse data from a network stream, 
they could first check whether the byte stream happens to be a valid 
UTF-8 byte stream, because valid UTF-8 streams don't tend to occur 
accidentally.

>> Anyway, it's just
>> plain stupid to use non-ASCII outside comments in a style sheet that
>> doesn't have a character encoding label and doesn't have a BOM, so in
>> the relatively rare cases where this heuristic fails, the author would
>> have only him/herself to blame.
>
> Indeed. And currently most style sheets contain Ascii only.

Except non-ASCII occurs in comments--especially in comments that are 
not in English. In order to be useful in practice, the last resort 
needs to handle the case with declarations are in ASCII but the 
comments contain non-ASCII gremlins. Assuming UTF-8 and using a 
draconian UTF-8 decoder would cause perceived breakage.

> This is all about error processing, unless I'm missing something.

Not exactly if the guessing is made part of an official sniffing 
algorithm. (In XML, for example, the UTF-8 default is not about error 
processing but about defaulting.)

> And it seems that it's about a small minority of cases (_within_ the
> current minority of style sheets for which this is relevant at all).
> I think it would best to simply state that if the encoding cannot
> be determined in the three given steps, browsers
> a) may apply whatever error processing they find suitable

But, as Ian Hickson pointed out, then the spec would be less useful and 
everyone would have to just reverse engineer the market leader.

> b) should assume Ascii, if the style sheet
> contains only octets with most significant bit set to zero.

Why would assuming ASCII be more useful than assuming windows-1252? The 
windows-1252 assumption works for ASCII, ISO-8859-1 and windows-1252. 
It also covers cases where the encoding is an arbitrary superset of 
ASCII and the non-ASCII characters only occur in comments (including 
UTF-8 comments).

-- 
Henri Sivonen
hsivonen@iki.fi
http://iki.fi/hsivonen/

Received on Monday, 23 February 2004 12:58:47 UTC