Re: How browsers display IRI's with mixed encodings

2011/7/21 Jungshik Shin (신정식, 申政湜) <jungshik@google.com>

>
>
> 2011/7/21 Jungshik Shin (신정식, 申政湜) <jungshik@google.com>
>
> Hi,
>>
>> You didn't tell us exactly what you did. Could you tell us what you
>> exactly did?
>>
>> Did you these URLs in an html page (href?)? In what encoding is the html
>> page (declared encoding) ? ISO-8859-1 or UTF-8?
>>
>
> I think your html page declared its encoding to be in ISO-8859-1. Then,
> it's not an mixed encoding because xEF xBC xA1 is a perfectly fine
> ISO-8859-1 sequence.
>
> GET /D%C3%BCrst/?%EF%BC
>

Oops. I meant  "/D%C3%BCrst/?%EF%BC%A1"

>
> The above is Chrome's internal representation of the URL in question (aside
> from the spec+ host part). When displaying the URL in the omnibox,  the path
> part is always interpreted as UTF-8. The query part is tested for 'UTF8ness'
> (after unescaping). If it *can* be interpreted as UTF-8, it's converted to
> characters. Otherwise, it remains %-escaped in the display.
>
>
You can confirm the above by copy'n'pasting the following to Chrome's
omnibox : http://foobar.com//D%C3%BCrst/?%EF%BC
(note that I dropped '%A1' making the query part (unescaped) invalid as
UTF-8.

Jungshik



> I have to check our code again, but I think that's what's happening. The
> way we deal with the query part for display purpose is not very robust and
> needs some change. I think we have to carry the referrer page encoding
> around and use that info to determine whether or not to unescape (and
> convert) instead of just checking for UTF8ness (because some byte sequences
> can be valid in UTF-8 and other encodings.).  Of course, when a user
> directly types (copy'n'paste) a URL with %-escaped query part  in the
> omnibox, we don't have a referrer encoding and have to resort to either 1)
> leaving the query part escaped or 2) doing what we do now (utf8-ness
> check).
>
> Jungshik
>
>
>
>
>>
>> Thanks,
>>
>> Jungshik
>>
>> On Thu, Jul 21, 2011 at 5:02 PM, Chris Weber <chris@lookout.net> wrote:
>>
>>> I'm going on a tangent from Martin's intent in the previous email, but it
>>> seems in the same vein overall.  I was including some mixed encoding tests -
>>> iso-8859-1 mixed with UTF-8 in a hyperlink on an transitional HTML page
>>> served with the "iso-8859-1" Content-Type.  The results are similar to
>>> Martin's test in the way bytes representing UTF-8 will be treated as such
>>> (most often) even in an iso-8859-1 page encoding.
>>>
>>> From the test page at <http://lookout.net/test/iri/**mixenc.php<http://lookout.net/test/iri/mixenc.php>>
>>> Test 3 mixes the raw bytes which would represent U+FF21 FULLWIDTH LATIN
>>> CAPITAL LETTER A in UTF-8, along with iso-8859-1 raw bytes for the "ü" in
>>> "Dürst".  The following hyperlink represents the test case where <0xNN> is a
>>> raw byte.
>>>
>>> http://www.example.com/D<0xFC>**rst/?<0xEF 0xBC 0xA1>
>>>
>>>
>>
>>
>>> The results of the display are as follows.
>>>
>>> Opera (11.50, Win7):
>>>  http://www.example.com/Dürst/**?%EF%BC%A1
>>>
>>> Firefox (5.0, Win7):
>>>  http://www.example.com/Dürst/?**
>>>
>>> IE (8.0.7601.17514, Win7):
>>>  http://www.example.com/Dürst/?**<http://www.example.com/D%C3%BCrst/?%C3%AF>
>>> ¼¡
>>>
>>> Chrome (12.0.742.122, Win7):St
>>>  http://www.example.com/Dürst/?**
>>>
>>> Safari (5.0.4 (7533.20.27)):
>>>  http://www.example.com/Dürst/?**
>>>
>>> With the exception of IE, all of the above generated the following HTTP
>>> request :
>>>
>>>  GET /D%C3%BCrst/?%EF%BC%A1
>>>
>>> IE of course does not escape the bytes in the query string.
>>>
>>>  GET /D%C3%BCrst/?A
>>>
>>> I tried to capture some of these test results into a table form at:
>>> <https://spreadsheets0.google.**com/spreadsheet/ccc?key=**
>>> 0AifoWoA0trUndEZSTlRRNnd5MzE3N**3RYOVlIVFFMREE&hl=en_US#gid=5<https://spreadsheets0.google.com/spreadsheet/ccc?key=0AifoWoA0trUndEZSTlRRNnd5MzE3N3RYOVlIVFFMREE&hl=en_US#gid=5>
>>> >
>>>
>>> A question for browser implementers - In some cases it's obvious (Opera
>>> and MSIE) and others not so much: Do you know if the status bar display is
>>> using the page encoding or has converted the URI to UTF-8 for display?
>>>
>>>
>>> Best regards,
>>> Chris
>>>
>>>
>>>
>>>
>>>
>>
>

Received on Friday, 22 July 2011 00:43:51 UTC