- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Mon, 23 Feb 2004 19:57:40 +0200
- To: WWW Style <www-style@w3.org>
On Feb 23, 2004, at 16:09, Jukka K. Korpela wrote: > On Mon, 23 Feb 2004, Henri Sivonen wrote: > >> On Feb 21, 2004, at 00:26, Bert Bos wrote: >> >>> 4) If all else fails, assume UTF-8. >> >> Why not windows-1252 (with the few undefined bytes mapped to >> *something* so that all byte streams can be converted some >> "characters")? > > Either guess is bound to be wrong in some cases. And if the guess turns > out to result in something containing undefined octets, I think we can > relatively safely guess that the guess was wrong. Yes, but restarting the parser at that point is expensive. If browsers weren't interactive applications that parse data from a network stream, they could first check whether the byte stream happens to be a valid UTF-8 byte stream, because valid UTF-8 streams don't tend to occur accidentally. >> Anyway, it's just >> plain stupid to use non-ASCII outside comments in a style sheet that >> doesn't have a character encoding label and doesn't have a BOM, so in >> the relatively rare cases where this heuristic fails, the author would >> have only him/herself to blame. > > Indeed. And currently most style sheets contain Ascii only. Except non-ASCII occurs in comments--especially in comments that are not in English. In order to be useful in practice, the last resort needs to handle the case with declarations are in ASCII but the comments contain non-ASCII gremlins. Assuming UTF-8 and using a draconian UTF-8 decoder would cause perceived breakage. > This is all about error processing, unless I'm missing something. Not exactly if the guessing is made part of an official sniffing algorithm. (In XML, for example, the UTF-8 default is not about error processing but about defaulting.) > And it seems that it's about a small minority of cases (_within_ the > current minority of style sheets for which this is relevant at all). > I think it would best to simply state that if the encoding cannot > be determined in the three given steps, browsers > a) may apply whatever error processing they find suitable But, as Ian Hickson pointed out, then the spec would be less useful and everyone would have to just reverse engineer the market leader. > b) should assume Ascii, if the style sheet > contains only octets with most significant bit set to zero. Why would assuming ASCII be more useful than assuming windows-1252? The windows-1252 assumption works for ASCII, ISO-8859-1 and windows-1252. It also covers cases where the encoding is an arbitrary superset of ASCII and the non-ASCII characters only occur in comments (including UTF-8 comments). -- Henri Sivonen hsivonen@iki.fi http://iki.fi/hsivonen/
Received on Monday, 23 February 2004 12:58:47 UTC