Re: How browsers display IRI's with mixed encodings from Martin J. Dürst on 2011-07-25 (public-iri@w3.org from July 2011)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Mon, 25 Jul 2011 20:21:27 +0900
To: Chris Weber <chris@lookout.net>
CC: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <4E2D51B7.1070508@it.aoyama.ac.jp>
Hello Chris, others,

On 2011/07/22 9:02, Chris Weber wrote:
> I'm going on a tangent from Martin's intent in the previous email, but
> it seems in the same vein overall. I was including some mixed encoding
> tests - iso-8859-1 mixed with UTF-8

UTF-8 -> what may also look like UTF-8

> in a hyperlink on an transitional HTML page

How much does transitional vs. strict or whatever affect IRIs? I never 
thought of that, but obviously, that doesn't mean it couldn't make a 
difference.

> served with the "iso-8859-1" Content-Type. The results are
> similar to Martin's test in the way bytes representing UTF-8 will be
> treated as such (most often) even in an iso-8859-1 page encoding.

In my test, these were %-escaped. %-escaping is a pure URI/IRI thing, 
not related (on paper, at least) to page encoding. What you are using 
are raw bytes, which obviously should be interpreted as characters based 
on the page encoding.

>  From the test page at <http://lookout.net/test/iri/mixenc.php> Test 3
> mixes the raw bytes which would represent U+FF21 FULLWIDTH LATIN CAPITAL
> LETTER A in UTF-8, along with iso-8859-1 raw bytes for the "ü" in
> "Dürst". The following hyperlink represents the test case where <0xNN>
> is a raw byte.

It also mixes path parts and query parts, which for historical reasons 
have to be treated somewhat differently.


> http://www.example.com/D<0xFC>rst/?<0xEF 0xBC 0xA1>
>
> The results of the display are as follows.
>
> Opera (11.50, Win7):
> http://www.example.com/DÃ¼rst/?%EF%BC%A1

The path part is double encoding, hopelessly messed up. The query part 
may be okay because as far as I understand, current browsers interpret 
query parts according to the page encoding.

> Firefox (5.0, Win7):
> http://www.example.com/Dürst/?���

The path part is okay. The query part is clearly borken.

> IE (8.0.7601.17514, Win7):
> http://www.example.com/Dürst/?ï¼��

The path part is okay. The query part may be okay (it displays the 
characters in the document, not some weird reinterpretations of the 
bytes they were represented with).

> Chrome (12.0.742.122, Win7):St
> http://www.example.com/Dürst/?���
>
> Safari (5.0.4 (7533.20.27)):
> http://www.example.com/Dürst/?���

Same as above for Firefox.

> With the exception of IE, all of the above generated the following HTTP
> request :
>
> GET /D%C3%BCrst/?%EF%BC%A1

That's about right.

> IE of course does not escape the bytes in the query string.
>
> GET /D%C3%BCrst/?Ａ

Here the Ａ isn't for real (because we are not on a display), it's just 
an artefact of you choosing to display the bytes with UTF-8. My guess is 
that Apache or other servers are just interpreting the raw 8-bit bytes 
the same as %EF%BC%A1. That would mean that except for the raw/%-encoded 
difference, we have the same thing on the wire for all the browsers we 
tested. Good.



> I tried to capture some of these test results into a table form at:
> <https://spreadsheets0.google.com/spreadsheet/ccc?key=0AifoWoA0trUndEZSTlRRNnd5MzE3N3RYOVlIVFFMREE&hl=en_US#gid=5>

I'm not completely sure yet I understand all of the descriptions.

> A question for browser implementers - In some cases it's obvious (Opera
> and MSIE) and others not so much: Do you know if the status bar display
> is using the page encoding or has converted the URI to UTF-8 for display?

As this is something we can't really test, my guess would be that it's 
(mostly) irrelevant. Display has to look right, stuff over the wire has 
to be the right bits.


Regards,    Martin.
Received on Monday, 25 July 2011 11:22:50 UTC