Re: expected results for URI encoding tests? from Philip Taylor on 2008-06-27 (public-html@w3.org from June 2008)

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Fri, 27 Jun 2008 17:06:07 +0100
To: Julian Reschke <julian.reschke@gmx.de>
CC: "public-html@w3.org WG" <public-html@w3.org>
Message-ID: <48650FEF.3080907@cam.ac.uk>

Julian Reschke wrote:
> We really should try to define a way that yields UTF-8 based encoding 
> independently of the document's encoding.

We also really shouldn't break existing sites that work perfectly well 
in current web browsers. E.g. http://www.yildizburo.com.tr/ says

   <a href="urunlist.php?tur=FAX MAKİNALARI&kategori=Laser Fax" 
class="textmenu">Laser Fax</a>

encoded in Windows-1254. Clicking that link, Firefox/Opera/Safari go to

   urunlist.php?tur=FAX%20MAK%DDNALARI&kategori=Laser%20Fax

while IE goes to

   urunlist.php?tur=FAX%20MAKİNALARI&kategori=Laser%20Fax

where the İ is a raw 0xDD byte. Both variations load the correct page.

Using UTF-8, i.e.

   urunlist.php?tur=FAX%20MAK%C4%B0NALARI&kategori=Laser%20Fax

returns a page with no data, which is bad.

Looking at random pages listed in dmoz.org (which seems quite biased 
towards English sites), something like 0.5% have non-ASCII characters in 
<a href> query strings, and (judging by eye) maybe half of those are not 
UTF-8, so it's a widespread issue and there's no chance of fixing all 
those sites.

That imposes some constraints on any proposed solution, and means 
"queries are always converted to percent-encoded UTF-8" is inadequate. 
It seems there's still some flexibility (e.g. IE converting unmappable 
characters to "?", vs FF converting unmappable strings to UTF-8), though 
I have no idea how nice a solution is possible within the limits.

-- 
Philip Taylor
pjt47@cam.ac.uk

Received on Friday, 27 June 2008 16:06:49 UTC