- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Thu, 23 Nov 2006 10:17:19 +0200 (EET)
- To: 'Unicode' <unicode@unicode.org>
- cc: www-international@w3.org
- Message-ID: <Pine.GSO.4.64.0611231008260.1204@mustatilhi.cs.tut.fi>
On Wed, 22 Nov 2006, Martin Duerst wrote: >> Text encoded as UTF-8, then reinterpreted using an 8-bit encoding (often >> Latin-1 or Windows-1252), and then re-encoded incorrectly as UTF-8 for >> a second time. > > Yes. The W3C site has quite a lot of these, too, even if they are > fortunately usually limited to single characters such as the copyright > sign. Here's an example: > http://www.w3.org/2001/Annotea/User/Papers.html That page is a somewhat different case. There's more than the copyright sign that is wrong there, namely the registered sign and two occurrences of e with acute (in the name "José"), too. Moreover, the page says <?xml version="1.0" encoding="iso-8859-1"?> _and_ <meta http-equiv="content-type" content="application/xhtml+xml; charset=UTF-8" /> but what really matters is the HTTP header Content-Type: text/html; charset=iso-8859-1 If you manually change the encoding used by a browser to UTF-8, the é's become right and the two other non-ASCII characters become a little less obscured by extra characters before them. There _is_ a "double UTF-8" involved, too, but the primary problem is that the declared encoding is not the one actually used on the page. -- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Thursday, 23 November 2006 08:17:36 UTC