Re: How browsers display IRI's with mixed encodings

Jungshik SHIN (신정식), Fri, 22 Jul 2011 00:46:18 -0700:

>>>> The point of this was to test the display as Martin had, but using 
>>>> unescaped
>>>> bytes.  From the results of Test 3 it looks like Firefox, Chrome, 
>>>> and Safari all
>>>> check for "UTF8ness" in the query component when displaying the 
>>>> IRI in spite of
>>>> the page encoding, hence you can visually see the U+FF21 FULLWIDTH LATIN
>>>> CAPITAL LETTER A.
>>> 
>>> Which I consider to be a serious bug in handling an IRI. In theory, 
>>> the characters in the HTML document are converted to a sequence of 
>>> Unicode characters when the page is parsed. I should have three 
>>> Unicode code points in the query portion of the IRI in the above 
>>> href. They happen to be encoded using three ISO-8859-1 bytes. But 
>>> they just as well could be encoded asA
>> 
>> Martin's test showed that even the path component containing 
>> "%C3%BC" would be percent-decoded and displayed as UTF-8... even 
>> when the page encoding was declared as iso-8859-1.

With Opera and IE8 as exceptions, when it comes to display. However, 
even Opera and IE8 exectues it as UTF-8.

> That's the correct and expected behavior. The path part is always 
> assumed to be in UTF-8 regardless of the referrer page encoding. The 
> query part is a different story. 

I believed that the issue for debate was how %FC should be displayed 
and handled. 

However, while it would give the best user experience to *display* and 
*handle* the %FC in Martin's test as %C3%BC, it might also be 
considered a feature that href="D%FCrst" in Martin's test does not 
work. E.g. if Martin's page was converted from legacy encoding to 
UTF-8, then href="D%FCrst" would stop working even if it *had* worked 
in the legacy encoded page.

Chris, may be you could show an example of when it would be a problem 
if validators would warn against using not-UTF8-based percent encodings?
-- 
Leif Halvard Silli

Received on Friday, 22 July 2011 11:43:45 UTC