- From: Philip Taylor <pjt47@cam.ac.uk>
- Date: Fri, 27 Jun 2008 17:06:07 +0100
- To: Julian Reschke <julian.reschke@gmx.de>
- CC: "public-html@w3.org WG" <public-html@w3.org>
Julian Reschke wrote: > We really should try to define a way that yields UTF-8 based encoding > independently of the document's encoding. We also really shouldn't break existing sites that work perfectly well in current web browsers. E.g. http://www.yildizburo.com.tr/ says <a href="urunlist.php?tur=FAX MAKİNALARI&kategori=Laser Fax" class="textmenu">Laser Fax</a> encoded in Windows-1254. Clicking that link, Firefox/Opera/Safari go to urunlist.php?tur=FAX%20MAK%DDNALARI&kategori=Laser%20Fax while IE goes to urunlist.php?tur=FAX%20MAKİNALARI&kategori=Laser%20Fax where the İ is a raw 0xDD byte. Both variations load the correct page. Using UTF-8, i.e. urunlist.php?tur=FAX%20MAK%C4%B0NALARI&kategori=Laser%20Fax returns a page with no data, which is bad. Looking at random pages listed in dmoz.org (which seems quite biased towards English sites), something like 0.5% have non-ASCII characters in <a href> query strings, and (judging by eye) maybe half of those are not UTF-8, so it's a widespread issue and there's no chance of fixing all those sites. That imposes some constraints on any proposed solution, and means "queries are always converted to percent-encoded UTF-8" is inadequate. It seems there's still some flexibility (e.g. IE converting unmappable characters to "?", vs FF converting unmappable strings to UTF-8), though I have no idea how nice a solution is possible within the limits. -- Philip Taylor pjt47@cam.ac.uk
Received on Friday, 27 June 2008 16:06:49 UTC