Re: The problem with the encoding koi8-r

2013-12-06 17:00, Michael[tm] Smith wrote:

> "Jukka K. Korpela" <jkorpela@cs.tut.fi>, 2013-12-03 00:42 +0200:
>
>> 2013-12-01 14:10, NIKA TOUR Co. Ltd. wrote:
>>> Here is a link - /http://nika-tour.org/Excursions/ysupov_palace_ru.html
>>
>> It is a windows-1251 encoded page, properly declared as windows-1251 when
>> viewed in a browser.
>>
>> But it seems that the server has been (mis)configured to declare koi8-r when
>> requested by the validator. This is something that you need to take to your
>> server admin.
>
> I get "Content-Type: text/html; charset=koi8-r" for it in Firefox and
> Chromium. If you're seeing windows-1251 in the header, I wonder whether it
> might be trying to send something different based on user locale setting or
> IP address. Or something.

Indeed; the HTTP headers also say
Vary: accept-charset, user-agent
Using http://web-sniffer.net with different settings for "User agent", I 
get both windows-1252 and koi8-r results.

I suppose the problem has been mostly fixed now; the original poster 
sent be personal mail, saying "I've already solved that problem." I 
guess the problem was that the server did not consistently send the data 
in the declared encoding.

It isn't quite consistent even now, since when the response has koi8-r 
declared and used in content, the content still has <meta 
charset="windows-1251">. The reason why this mostly works is that the 
<meta ...> tag is ignored when the encoding is declared in an HTTP 
header. Except that if a user saves a page locally, then later opens it 
from disk, it will be displayed wrongly, because there won't be any HTTP 
headers.

Yucca

P.S. The validator's warning "Legacy encoding koi8-r used. Documents 
should use UTF-8" might generally be regarded as excessive UTF-8 
evangelism, but here it might be useful. It's difficult to imagine a 
reason to use varyingly windows-1251 and koi8-r (is there a client that 
knows one of them but not the other?), apparently based on User-Agent 
string rather than Accept-Charset. Using just either of them in all 
responses should work fine. The benefits of UTF-8 are not very tangible 
here, and there is the obvious problem of data size increase (two bytes 
per each Cyrillic letter vs. one byte).

Received on Friday, 6 December 2013 15:50:43 UTC