W3C home > Mailing lists > Public > public-iri@w3.org > July 2011

How browsers display IRI's with mixed encodings

From: Chris Weber <chris@lookout.net>
Date: Thu, 21 Jul 2011 17:02:13 -0700
Message-ID: <4E28BE05.1050008@lookout.net>
I'm going on a tangent from Martin's intent in the previous email, but 
it seems in the same vein overall.  I was including some mixed encoding 
tests - iso-8859-1 mixed with UTF-8 in a hyperlink on an transitional 
HTML page served with the "iso-8859-1" Content-Type.  The results are 
similar to Martin's test in the way bytes representing UTF-8 will be 
treated as such (most often) even in an iso-8859-1 page encoding.

 From the test page at <http://lookout.net/test/iri/mixenc.php> Test 3 
mixes the raw bytes which would represent U+FF21 FULLWIDTH LATIN CAPITAL 
LETTER A in UTF-8, along with iso-8859-1 raw bytes for the "ü" in 
"Dürst".  The following hyperlink represents the test case where <0xNN> 
is a raw byte.

http://www.example.com/D<0xFC>rst/?<0xEF 0xBC 0xA1>

The results of the display are as follows.

Opera (11.50, Win7):

Firefox (5.0, Win7):

IE (8.0.7601.17514, Win7):

Chrome (12.0.742.122, Win7):St

Safari (5.0.4 (7533.20.27)):

With the exception of IE, all of the above generated the following HTTP 
request :

   GET /D%C3%BCrst/?%EF%BC%A1

IE of course does not escape the bytes in the query string.

   GET /D%C3%BCrst/?A

I tried to capture some of these test results into a table form at:

A question for browser implementers - In some cases it's obvious (Opera 
and MSIE) and others not so much: Do you know if the status bar display 
is using the page encoding or has converted the URI to UTF-8 for 

Best regards,
Received on Friday, 22 July 2011 00:02:43 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:14:42 UTC