How browsers display IRI's with mixed encodings from Chris Weber on 2011-07-22 (public-iri@w3.org from July 2011)

From: Chris Weber <chris@lookout.net>
Date: Thu, 21 Jul 2011 17:02:13 -0700
To: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <4E28BE05.1050008@lookout.net>

I'm going on a tangent from Martin's intent in the previous email, but 
it seems in the same vein overall.  I was including some mixed encoding 
tests - iso-8859-1 mixed with UTF-8 in a hyperlink on an transitional 
HTML page served with the "iso-8859-1" Content-Type.  The results are 
similar to Martin's test in the way bytes representing UTF-8 will be 
treated as such (most often) even in an iso-8859-1 page encoding.

 From the test page at <http://lookout.net/test/iri/mixenc.php> Test 3 
mixes the raw bytes which would represent U+FF21 FULLWIDTH LATIN CAPITAL 
LETTER A in UTF-8, along with iso-8859-1 raw bytes for the "ü" in 
"Dürst".  The following hyperlink represents the test case where <0xNN> 
is a raw byte.

http://www.example.com/D<0xFC>rst/?<0xEF 0xBC 0xA1>

The results of the display are as follows.

Opera (11.50, Win7):
   http://www.example.com/DÃ¼rst/?%EF%BC%A1

Firefox (5.0, Win7):
   http://www.example.com/Dürst/?���

IE (8.0.7601.17514, Win7):
   http://www.example.com/Dürst/?ï¼��

Chrome (12.0.742.122, Win7):St
   http://www.example.com/Dürst/?���

Safari (5.0.4 (7533.20.27)):
   http://www.example.com/Dürst/?���

With the exception of IE, all of the above generated the following HTTP 
request :

   GET /D%C3%BCrst/?%EF%BC%A1

IE of course does not escape the bytes in the query string.

   GET /D%C3%BCrst/?Ａ

I tried to capture some of these test results into a table form at:
<https://spreadsheets0.google.com/spreadsheet/ccc?key=0AifoWoA0trUndEZSTlRRNnd5MzE3N3RYOVlIVFFMREE&hl=en_US#gid=5>

A question for browser implementers - In some cases it's obvious (Opera 
and MSIE) and others not so much: Do you know if the status bar display 
is using the page encoding or has converted the URI to UTF-8 for 
display?      

Best regards,
Chris

Received on Friday, 22 July 2011 00:02:43 UTC