Re: Dangers of non-UTF-8 Re: Details on internal encoding declarations from Alexey Proskuryakov on 2008-05-23 (public-html@w3.org from May 2008)

From: Alexey Proskuryakov <ap@webkit.org>
Date: Fri, 23 May 2008 15:22:35 +0400
To: Henri Sivonen <hsivonen@iki.fi>
Cc: Ian Hickson <ian@hixie.ch>, HTML WG <public-html@w3.org>
Message-Id: <47F8EDAA-5CFA-41AE-8517-868248E65EC8@webkit.org>

On May 23, 2008, at 3:02 PM, Henri Sivonen wrote:

> I am aware of this. The server cannot know if the user typed a  
> character or a string that looks like an NCR, so I think that is  
> dataloss in the strict sense.

That's true, but this data loss happens with UTF-8 documents, too -  
entering "т" and "&#1090;" in Google search field results in identical  
requests, despite Google start page being UTF-8.

As such, I'm not sure if it's a problem worth highlighting. While  
UTF-8 is a nice general purpose solution, it is has its downsides, and  
switching a Russian page from windows-1251 to UTF-8 often makes  
roughly as much sense as switching an English one to UTF-16.

- WBR, Alexey Proskuryakov

Received on Friday, 23 May 2008 11:23:22 UTC