Re: guessing character encoding (was HTML WG) from Sander Tekelenburg on 2007-07-13 (public-html@w3.org from July 2007)

From: Sander Tekelenburg <st@isoc.nl>
Date: Fri, 13 Jul 2007 18:22:20 +0200
To: public-html@w3.org
Message-Id: <p06240628c2bd50709aff@[192.168.0.102]>

At 08:19 +0300 UTC, on 2007-07-13, Dmitry Turin wrote:

> Good day, Robert.
>
> RB> I was wondering what character encoding you use to serve up this page:
> RB> <http://html60.chat.ru/site/html60/ru/index_ru.htm>
> RB> We're trying to conduct some tests on current UAs and this page might
> RB> be helpful. Do you know what charset it uses?
>
> All pages in russian language are coded in WIN-1251.
> These documents are displayed truely both in IE and Opera.

Only because they happen to guess what you intend. They're not presented as
you intend in iCab3.0.3, Firefox2.0.0.4, Safari2.0.4 (because neither the
server nor the document itself say what character repertoire the document is
in).

Is there any particular reason why you're relying on UAs to guess what
character repertoire the document is in? (I believe HTML5 aims to define a
perfect guessing algorithm, but  AFAIK the idea is 'just' to unify UA
behaviour. I don't believe the intention is that authors rely on that --
they're still expected to provide the proper Content-Type header, or a <meta
charset="value">:
<http://www.whatwg.org/specs/web-apps/current-work/multipage/section-document.html#charset0>)

Now I'm aware that apparently there is some practical problem with authoring
cyrillic, in that 4 or 5 different encodings are commonly used. Russian
Apache deals with that through content-negotiation:
<http://apache.lexa.ru/english/>. But I see no reason for authors to rely on
UAs to just magically guess the correct character repertoire. Or is there?

-- 
Sander Tekelenburg
The Web Repair Initiative: <http://webrepair.org/>

Received on Friday, 13 July 2007 16:46:46 UTC