Re: guessing character encoding (was HTML WG) from Robert Burns on 2007-07-13 (public-html@w3.org from July 2007)

From: Robert Burns <rob@robburns.com>
Date: Fri, 13 Jul 2007 15:22:56 -0500
To: Andrew Fedoniouk <news@terrainformatica.com>
Cc: <public-html@w3.org>, "Sander Tekelenburg" <st@isoc.nl>
Message-Id: <5A96FE01-C238-48EC-8B90-6FBB03CB543C@robburns.com>

On Jul 13, 2007, at 2:00 PM, Andrew Fedoniouk wrote:

>
>
> ----- Original Message ----- From: "Sander Tekelenburg" <st@isoc.nl>
>
>> At 08:19 +0300 UTC, on 2007-07-13, Dmitry Turin wrote:
>>
>>> Good day, Robert.
>>>
>>> RB> I was wondering what character encoding you use to serve up  
>>> this page:
>>> RB> <http://html60.chat.ru/site/html60/ru/index_ru.htm>
>>> RB> We're trying to conduct some tests on current UAs and this  
>>> page might
>>> RB> be helpful. Do you know what charset it uses?
>>>
>>> All pages in russian language are coded in WIN-1251.
>>> These documents are displayed truely both in IE and Opera.
>>
>> Only because they happen to guess what you intend. They're not  
>> presented as
>> you intend in iCab3.0.3, Firefox2.0.0.4, Safari2.0.4 (because  
>> neither the
>> server nor the document itself say what character repertoire the  
>> document is
>> in).
>
> Sander, that is just a bug.

I couldn't tell what you were referring to as "a bug" in the above  
text. Could you elaborate?

> HTML documents in Russian must indicate encoding.
> This particular page will work in IE and only on Russian version
> of Windows OS as in case of unknown encoding IE uses current
> system encoding settings (So called "current ANSI code page").

This is just one anecdotal example of why I think HTML5 should  
provide much more author guidance (and probably UA guidance) on  
character encodings. Its something many authors do not understand. I  
also think its the kind of detail that most authors probably  
shouldn't need to understand if it was handled properly in existing  
tools and existing standards. Since so much authoring goes on by  
simply copying code, authors end up copying meta tags that express  
completely incorrect encodings. Servers rarely include a charset  
header and that might be a good thing, because those would likely be  
often wrong too.

Given that' its not really handled well, I think we should do  
something. I think BOMs are the best way to go, but obviously they  
don't work with everything (and not all tools support them either).   
Even better would be a byte sequence registry or something like that,  
but that's way outside the scope of our WG.

Anyway, its worth further testing and its worth considering ways  
HTML5 might address the problem. Perhaps all we can do is push  
authors to use the Unicode encodings more (and that means authoring  
tools need to have proper support too). We don't really help encoding  
related security issues, by ignoring the problem.

Take care,
Rob

Received on Friday, 13 July 2007 20:23:04 UTC