Re: How browsers display URIs with %-encoding (Opera/Firefox FAIL)

Hello Chris, others,

On 2011/07/22 7:44, Chris Weber wrote:
> On 7/21/2011 1:15 PM, Leif Halvard Silli wrote:
>> The page in question uses Windows-1252/ISO-8859-1. Question: Would it
>> have made a difference if instead of using ISO-8859-1 based percent
>> encoding, Martin had typed the letter 'ü' directly?
>
> Yes it would, see "test 2" at <http://lookout.net/test/iri/mixenc.php>.
> Using the same browser builds Martin did, but a slightly different test
> setup.

Great test!

> Test 2 from my set maps to Martin's Test 1 in that the "Dürst" is a part
> of the path component and encoded in iso-8859-1 - he percent-encoded %FC
> and I used the raw byte 0xFC. The test case is represented below, where
> <0xHH> represents a raw byte.
>
> http://www.example.com/D<0xFC>rst/
>
> The results of display are below.
>
> Opera (11.50, Win7):
> http://www.example.com/Dürst/
>
> Note here that the raw byte <0xFC> was visibly converted to Unicode
> <0xC3 0xBC> and displayed as iso-8859-1 (presumably) in the display.

Double-encoding! What a mess! I hope this can be fixed soon. I have put 
some people from Opera into the cc. Anne should be on the list anyway, 
but I hope he notices it more quickly.

> Firefox (5.0, Win7):
> http://www.example.com/Dürst/
>
> IE (8.0.7601.17514, Win7):
> http://www.example.com/Dürst/
>
> Chrome (12.0.742.122, Win7):
> http://www.example.com/Dürst/
>
> Safari (5.0.4 (7533.20.27)):
> http://www.example.com/Dürst/
>
> In all of the above cases, the <0xFC> was transcoded to UTF-8 and
> percent-encoded for the generated HTTP request.
>
> http://www.example.com/D%C3%BCrst/

All these just work as expected, according to RFC 3987. If Opera got 
fixed, we would be perfect :-!.


>> Because, if, in a ISO-8859-1 encoded page, hef="D%FCrst" does not work
>> as well as href="Dürst", then I think HTML5 validators in fact should
>> warn against use of percent encoding that isn't UTF-8 based.
>
> That would probably be ideal but would not provide for raw data that
> might need to be passed in the IRI, especially the query component.

The query component is a separate issue, and I think there should be 
separate tests (including browser display) for it.

The other issue is that there might be a server that e.g. has resource 
names encoded in ISO-8859-1, or where a resource name otherwise contains 
a byte such as <0xFC>, and for such a server, changing from 
hef="D%FCrst" to href="Dürst" would be a bad idea.


Regards,    Martin.

Received on Monday, 25 July 2011 06:52:48 UTC