Re: guessing character encoding (was HTML WG) from Dmitry Turin on 2007-07-18 (public-html@w3.org from July 2007)

From: Dmitry Turin <html60@narod.ru>
Date: Wed, 18 Jul 2007 08:20:08 +0300
To: public-html@w3.org
Message-ID: <1562278813.20070718082008@narod.ru>

ST> Is there any particular reason why you're relying on UAs to guess what
ST> character repertoire the document is in?
ST> But I see no reason for authors to rely on
ST> UAs to just magically guess the correct character repertoire.
RB> Servers rarely include a charset  
RB> header and that might be a good thing, because those would likely be  
RB> often wrong too.
AF> It is an author's error to publish document without 
AF> providing information of what encoding is used in it.

  Guessing is not in deal. Purpose is to give possibility to user
to change encoding manually in browser menu and follow along anchors.

  Let's enter terms:
'falling of encoding', which means, that browser show document as
writed in other encoding, than document is;
'anchor falling', which means, that 'falling in encoding' occurs in new document,
after user has followed along <a href> in previous document.
  I met three case with anchor falling:
(1) at serfing in documents on server
(1.1) new document does not contain frames, i.e. is a single document
(1.2) anchor falling occurs in frame
(2) at serfing in documents on local file system
after downloading of site -
anchor falling occurs, because <meta content="text/html; charset="> and
real encoding differ each other.
  In case of #1.1 user is forced to use browser menu in each next document,
in case of #1.2 he cann't change encoding in frame
(except to save frame paper in local filesystem and to open saved file),
in case of #1.3 he is forced to convert files in directory and
subdirectories recursively by additional program.
  Given shows, that this problem should exist in all alphabets,
letters of which have codes 128-255.

  As to my site, i don't use frames (#1.2), and #2 is prevented by
accessibility of archive of site. Thus only #1.1
(conflict between actual encoding, 'Content-Type' and <meta content="text/html; charset=">)
can threaten me.
  I decided, that free hosting will be enough to show documents for discussion.
This means, that papers have increased probability of anchor falling.

---

  What's about guessing algorithm to improve today's browsers,
maybe there is reason to borrow it from russian text editors,
which auto-detect encoding.
Statistically task is relieved,
that only two of five encodings are used in practice
('windows' and 'koi-8' are used; 'dos', 'iso', 'unicode' are not used).

Received on Wednesday, 18 July 2007 12:46:17 UTC