W3C home > Mailing lists > Public > public-html@w3.org > May 2008

Re: Dangers of non-UTF-8 Re: Details on internal encoding declarations

From: Alexey Proskuryakov <ap@webkit.org>
Date: Fri, 23 May 2008 15:22:35 +0400
Cc: Ian Hickson <ian@hixie.ch>, HTML WG <public-html@w3.org>
Message-Id: <47F8EDAA-5CFA-41AE-8517-868248E65EC8@webkit.org>
To: Henri Sivonen <hsivonen@iki.fi>


On May 23, 2008, at 3:02 PM, Henri Sivonen wrote:

> I am aware of this. The server cannot know if the user typed a  
> character or a string that looks like an NCR, so I think that is  
> dataloss in the strict sense.


That's true, but this data loss happens with UTF-8 documents, too -  
entering "" and "&#1090;" in Google search field results in identical  
requests, despite Google start page being UTF-8.

As such, I'm not sure if it's a problem worth highlighting. While  
UTF-8 is a nice general purpose solution, it is has its downsides, and  
switching a Russian page from windows-1251 to UTF-8 often makes  
roughly as much sense as switching an English one to UTF-16.

- WBR, Alexey Proskuryakov
Received on Friday, 23 May 2008 11:23:22 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 9 May 2012 00:16:17 GMT