Re: guessing character encoding (was HTML WG) from Sander Tekelenburg on 2007-07-18 (public-html@w3.org from July 2007)

From: Sander Tekelenburg <st@isoc.nl>
Date: Wed, 18 Jul 2007 18:56:28 +0200
To: public-html@w3.org
Message-Id: <p0624065cc2c3ed2d030c@[192.168.0.102]>
At 08:20 +0300 UTC, on 2007-07-18, Dmitry Turin wrote:

[<http://html60.chat.ru/site/html60/ru/index_ru.htm>]

> ST> Is there any particular reason why you're relying on UAs to guess what
> ST> character repertoire the document is in? [...]
> RB> Servers rarely include a charset
> RB> header and that might be a good thing, because those would likely be
> RB> often wrong too.

The default server config should indeed not claim a character repertoire. But
the author should configure one.

> AF> It is an author's error to publish document without
> AF> providing information of what encoding is used in it.
>
>   Guessing is not in deal. Purpose is to give possibility to user
> to change encoding manually in browser menu and follow along anchors.

AFAIK every browser allows the user to change the character repertoire
anyway, always. Even if it is claimed by the server/document. (I don't know
if that's just wisdom, or spec-required). But then still, if you serve your
documents with the correct charset info, users don't *need* to change
anything. The UA will apply the claimed character repertoire.

>   Let's enter terms:
> 'falling of encoding', which means, that browser show document as
> writed in other encoding, than document is;
> 'anchor falling', which means, that 'falling in encoding' occurs in new
>document,
> after user has followed along <a href> in previous document.

If the page pointed to by the anchor is served with the proper charset info,
I don't see why it should fail -- even if it uses a different character
repertoire than the previous  page.

>   I met three case with anchor falling:
> (1) at serfing in documents on server
> (1.1) new document does not contain frames, i.e. is a single document
> (1.2) anchor falling occurs in frame

Documents that are loaded in an iframe have their own http headers too, so I
don't see why iframes would be a special case.

I'll grant you I haven't experimented serving mixed character repertoires
though -- a main document with one charset, and one with a different charset
embedded in an iframe. Do UAs get that wrong?

> (2) at serfing in documents on local file system
> after downloading of site -
> anchor falling occurs, because <meta content="text/html; charset="> and
> real encoding differ each other.

If you use meta http-equiv to provide the charset, you must (of course)
ensure that it is the exact same as the http header (or, in HTML5, that the
http header claims no charset).

The only situation in which I can imagine you'd set a meta charset that is
different from the http charset is when [1] the server is misconfigured to
serve some default charset value and [2] you cannot change that. But in that
case you should simply change to a better server.

Btw, for shared hosts, people seem to simply assume that they cannot generate
proper HTTP headers. But for instance Apache allows each user to configure
their own area of the server. So unless the admin crippled that, you can
generate a proper HTTP Content-Type header through .htaccess. If that's
crippled, and you have something like PHP available, you can use that to
generate proper HTTP headers.

[...]

>   What's about guessing algorithm to improve today's browsers,

HTML5 already defines that algorithm:
<http://www.whatwg.org/specs/web-apps/current-work/multipage/section-parsing.html#determining0>

But that's error recovery -- no reason for authors to rely on that. No matter
how well the algorithm is thought out, and even assuming all UAs implement it
flawlessly (unlikely), you're still relying on the UA to understand what you
mean and discard what you say. Even if an author understands the algorithm
well enough to rely on it, doing so excludes pre-HTML5 UAs from presenting
the document reliably.

> maybe there is reason to borrow it from russian text editors,
> which auto-detect encoding.

Obviously Opera and IE already do ;) (Or well, at least in this case what
they happen to do what you hoped for.) As for text editors, BBEdit does and
I'd expect many others do too. I don't know if 'WYSIWYG' editors like GoLive,
Dreamweaver, Freeway, etc. do Nvu probably does?


-- 
Sander Tekelenburg
The Web Repair Initiative: <http://webrepair.org/>
Received on Wednesday, 18 July 2007 17:06:47 UTC