- From: 신정식, 申政湜 <jungshik@google.com>
- Date: Thu, 21 Jul 2011 17:43:26 -0700
- To: Chris Weber <chris@lookout.net>
- Cc: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
- Message-ID: <CADaTyXVgPChkDf41Z80+hHyfRY7=-wdVmuAgj0MOXX8zzFe4Hw@mail.gmail.com>
2011/7/21 Jungshik Shin (신정식, 申政湜) <jungshik@google.com> > > > 2011/7/21 Jungshik Shin (신정식, 申政湜) <jungshik@google.com> > > Hi, >> >> You didn't tell us exactly what you did. Could you tell us what you >> exactly did? >> >> Did you these URLs in an html page (href?)? In what encoding is the html >> page (declared encoding) ? ISO-8859-1 or UTF-8? >> > > I think your html page declared its encoding to be in ISO-8859-1. Then, > it's not an mixed encoding because xEF xBC xA1 is a perfectly fine > ISO-8859-1 sequence. > > GET /D%C3%BCrst/?%EF%BC > Oops. I meant "/D%C3%BCrst/?%EF%BC%A1" > > The above is Chrome's internal representation of the URL in question (aside > from the spec+ host part). When displaying the URL in the omnibox, the path > part is always interpreted as UTF-8. The query part is tested for 'UTF8ness' > (after unescaping). If it *can* be interpreted as UTF-8, it's converted to > characters. Otherwise, it remains %-escaped in the display. > > You can confirm the above by copy'n'pasting the following to Chrome's omnibox : http://foobar.com//D%C3%BCrst/?%EF%BC (note that I dropped '%A1' making the query part (unescaped) invalid as UTF-8. Jungshik > I have to check our code again, but I think that's what's happening. The > way we deal with the query part for display purpose is not very robust and > needs some change. I think we have to carry the referrer page encoding > around and use that info to determine whether or not to unescape (and > convert) instead of just checking for UTF8ness (because some byte sequences > can be valid in UTF-8 and other encodings.). Of course, when a user > directly types (copy'n'paste) a URL with %-escaped query part in the > omnibox, we don't have a referrer encoding and have to resort to either 1) > leaving the query part escaped or 2) doing what we do now (utf8-ness > check). > > Jungshik > > > > >> >> Thanks, >> >> Jungshik >> >> On Thu, Jul 21, 2011 at 5:02 PM, Chris Weber <chris@lookout.net> wrote: >> >>> I'm going on a tangent from Martin's intent in the previous email, but it >>> seems in the same vein overall. I was including some mixed encoding tests - >>> iso-8859-1 mixed with UTF-8 in a hyperlink on an transitional HTML page >>> served with the "iso-8859-1" Content-Type. The results are similar to >>> Martin's test in the way bytes representing UTF-8 will be treated as such >>> (most often) even in an iso-8859-1 page encoding. >>> >>> From the test page at <http://lookout.net/test/iri/**mixenc.php<http://lookout.net/test/iri/mixenc.php>> >>> Test 3 mixes the raw bytes which would represent U+FF21 FULLWIDTH LATIN >>> CAPITAL LETTER A in UTF-8, along with iso-8859-1 raw bytes for the "ü" in >>> "Dürst". The following hyperlink represents the test case where <0xNN> is a >>> raw byte. >>> >>> http://www.example.com/D<0xFC>**rst/?<0xEF 0xBC 0xA1> >>> >>> >> >> >>> The results of the display are as follows. >>> >>> Opera (11.50, Win7): >>> http://www.example.com/Dürst/**?%EF%BC%A1 >>> >>> Firefox (5.0, Win7): >>> http://www.example.com/Dürst/?** >>> >>> IE (8.0.7601.17514, Win7): >>> http://www.example.com/Dürst/?**<http://www.example.com/D%C3%BCrst/?%C3%AF> >>> ¼¡ >>> >>> Chrome (12.0.742.122, Win7):St >>> http://www.example.com/Dürst/?** >>> >>> Safari (5.0.4 (7533.20.27)): >>> http://www.example.com/Dürst/?** >>> >>> With the exception of IE, all of the above generated the following HTTP >>> request : >>> >>> GET /D%C3%BCrst/?%EF%BC%A1 >>> >>> IE of course does not escape the bytes in the query string. >>> >>> GET /D%C3%BCrst/?A >>> >>> I tried to capture some of these test results into a table form at: >>> <https://spreadsheets0.google.**com/spreadsheet/ccc?key=** >>> 0AifoWoA0trUndEZSTlRRNnd5MzE3N**3RYOVlIVFFMREE&hl=en_US#gid=5<https://spreadsheets0.google.com/spreadsheet/ccc?key=0AifoWoA0trUndEZSTlRRNnd5MzE3N3RYOVlIVFFMREE&hl=en_US#gid=5> >>> > >>> >>> A question for browser implementers - In some cases it's obvious (Opera >>> and MSIE) and others not so much: Do you know if the status bar display is >>> using the page encoding or has converted the URI to UTF-8 for display? >>> >>> >>> Best regards, >>> Chris >>> >>> >>> >>> >>> >> >
Received on Friday, 22 July 2011 00:43:51 UTC