- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Mon, 25 Jul 2011 20:21:27 +0900
- To: Chris Weber <chris@lookout.net>
- CC: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Hello Chris, others, On 2011/07/22 9:02, Chris Weber wrote: > I'm going on a tangent from Martin's intent in the previous email, but > it seems in the same vein overall. I was including some mixed encoding > tests - iso-8859-1 mixed with UTF-8 UTF-8 -> what may also look like UTF-8 > in a hyperlink on an transitional HTML page How much does transitional vs. strict or whatever affect IRIs? I never thought of that, but obviously, that doesn't mean it couldn't make a difference. > served with the "iso-8859-1" Content-Type. The results are > similar to Martin's test in the way bytes representing UTF-8 will be > treated as such (most often) even in an iso-8859-1 page encoding. In my test, these were %-escaped. %-escaping is a pure URI/IRI thing, not related (on paper, at least) to page encoding. What you are using are raw bytes, which obviously should be interpreted as characters based on the page encoding. > From the test page at <http://lookout.net/test/iri/mixenc.php> Test 3 > mixes the raw bytes which would represent U+FF21 FULLWIDTH LATIN CAPITAL > LETTER A in UTF-8, along with iso-8859-1 raw bytes for the "ü" in > "Dürst". The following hyperlink represents the test case where <0xNN> > is a raw byte. It also mixes path parts and query parts, which for historical reasons have to be treated somewhat differently. > http://www.example.com/D<0xFC>rst/?<0xEF 0xBC 0xA1> > > The results of the display are as follows. > > Opera (11.50, Win7): > http://www.example.com/Dürst/?%EF%BC%A1 The path part is double encoding, hopelessly messed up. The query part may be okay because as far as I understand, current browsers interpret query parts according to the page encoding. > Firefox (5.0, Win7): > http://www.example.com/Dürst/?A The path part is okay. The query part is clearly borken. > IE (8.0.7601.17514, Win7): > http://www.example.com/Dürst/?A The path part is okay. The query part may be okay (it displays the characters in the document, not some weird reinterpretations of the bytes they were represented with). > Chrome (12.0.742.122, Win7):St > http://www.example.com/Dürst/?A > > Safari (5.0.4 (7533.20.27)): > http://www.example.com/Dürst/?A Same as above for Firefox. > With the exception of IE, all of the above generated the following HTTP > request : > > GET /D%C3%BCrst/?%EF%BC%A1 That's about right. > IE of course does not escape the bytes in the query string. > > GET /D%C3%BCrst/?A Here the A isn't for real (because we are not on a display), it's just an artefact of you choosing to display the bytes with UTF-8. My guess is that Apache or other servers are just interpreting the raw 8-bit bytes the same as %EF%BC%A1. That would mean that except for the raw/%-encoded difference, we have the same thing on the wire for all the browsers we tested. Good. > I tried to capture some of these test results into a table form at: > <https://spreadsheets0.google.com/spreadsheet/ccc?key=0AifoWoA0trUndEZSTlRRNnd5MzE3N3RYOVlIVFFMREE&hl=en_US#gid=5> I'm not completely sure yet I understand all of the descriptions. > A question for browser implementers - In some cases it's obvious (Opera > and MSIE) and others not so much: Do you know if the status bar display > is using the page encoding or has converted the URI to UTF-8 for display? As this is something we can't really test, my guess would be that it's (mostly) irrelevant. Display has to look right, stuff over the wire has to be the right bits. Regards, Martin.
Received on Monday, 25 July 2011 11:22:50 UTC