Re: Guessing "correct" character set (was: [CSS21] response to issue 115 (and 44)) from Mikko Rantalainen on 2004-02-21 (www-style@w3.org from February 2004)

From: Mikko Rantalainen <mira@cc.jyu.fi>
Date: Sat, 21 Feb 2004 08:11:02 -0500 (EST)
To: WWW Style <www-style@w3.org>
Message-ID: <4037590A.9000804@cc.jyu.fi>

Bert Bos / 2004-02-21 00:26:
> This problem of finding the encoding of a file is complicated, not
> just because it is so hard to imagine for spec writers and programmers
> what a program actually sees when the encoding is wrong, but also for
> other reasons:
> 
>   - Most HTTP servers don't send the charset param, we're not going to
>     change that overnight.

> So, if we assume that we can change the browsers in time, what do we
> want in CSS3? I'd say this:

I would suggest following:

1) If HTTP header defines character set, then use it
2) If HTTP header doesn't define character set, use UTF-8.
(no more rules)

However, for historical documents (that is, majority of the 
documents in the web already) I think the recommended behaviour of 
the user agent would be to ask the user what to do, in case the 
"character set is UTF-8 unless explicitly told otherwise" assumption 
results to invalid byte sequences. Perhaps recommend displaying a 
dialog of some kind that has some kind of interface to modify 
character sets of *all documents* (html, css, javascript) missing 
the explicit charcter set in HTTP headers. Make the user agent 
explain that the problem is because the page author doesn't follow 
standards and the user agent needs advice to be able to represent 
the content correctly. This should be the default behavior, some 
user agents may allow opt-in to automagic guess mechanism which may 
or may not work.

UTF-8 can represent every character anybody needs so that doesn't 
cause problems to you as a document author in case you cannot fix 
the HTTP header. Just transcode from your current character set to 
UTF-8. Shouldn't be a problem while authoring NEW documents.

As for the historical documents, I think the spec could include 
informal section explaining some common problems contained in old 
documents. Supporting automagic charset selection that overrides the 
above rules 1) and 2) should be optional. Recommend reporting the 
problem to the user and asking for more advice instead. The more we 
can make the document author feel that he gets all the blame, the 
faster he'll fix the document. If he doesn't care, it might be that 
the document isn't worth reading anyway.

Blame the author of broken document, not the user.

Changing everything to UTF-8 is going to be painful process, no 
matter how you do it. I rather take more pain for a little time than 
the other way around.

-- 
Mikko

Received on Saturday, 21 February 2004 09:16:10 UTC