Re: How browsers display IRI's with mixed encodings from 신정식, 申政湜 on 2011-07-22 (public-iri@w3.org from July 2011)

From: 신정식, 申政湜 <jungshik@google.com>
Date: Thu, 21 Jul 2011 17:29:37 -0700
To: Chris Weber <chris@lookout.net>
Cc: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <CADaTyXXuBJeoPp3kZ2cJg7dJCM_KvdumR0+8QiLA4Cz9DciPfA@mail.gmail.com>
2011/7/21 Jungshik Shin (신정식, 申政湜) <jungshik@google.com>

> Hi,
>
> You didn't tell us exactly what you did. Could you tell us what you exactly
> did?
>
> Did you these URLs in an html page (href?)? In what encoding is the html
> page (declared encoding) ? ISO-8859-1 or UTF-8?
>

I think your html page declared its encoding to be in ISO-8859-1. Then, it's
not an mixed encoding because xEF xBC xA1 is a perfectly fine ISO-8859-1
sequence.

GET /D%C3%BCrst/?%EF%BC

The above is Chrome's internal representation of the URL in question (aside
from the spec+ host part). When displaying the URL in the omnibox,  the path
part is always interpreted as UTF-8. The query part is tested for 'UTF8ness'
(after unescaping). If it *can* be interpreted as UTF-8, it's converted to
characters. Otherwise, it remains %-escaped in the display.

I have to check our code again, but I think that's what's happening. The way
we deal with the query part for display purpose is not very robust and needs
some change. I think we have to carry the referrer page encoding around and
use that info to determine whether or not to unescape (and convert) instead
of just checking for UTF8ness (because some byte sequences can be valid in
UTF-8 and other encodings.).  Of course, when a user directly types
(copy'n'paste) a URL with %-escaped query part  in the omnibox, we don't
have a referrer encoding and have to resort to either 1) leaving the query
part escaped or 2) doing what we do now (utf8-ness check).

Jungshik




>
> Thanks,
>
> Jungshik
>
> On Thu, Jul 21, 2011 at 5:02 PM, Chris Weber <chris@lookout.net> wrote:
>
>> I'm going on a tangent from Martin's intent in the previous email, but it
>> seems in the same vein overall.  I was including some mixed encoding tests -
>> iso-8859-1 mixed with UTF-8 in a hyperlink on an transitional HTML page
>> served with the "iso-8859-1" Content-Type.  The results are similar to
>> Martin's test in the way bytes representing UTF-8 will be treated as such
>> (most often) even in an iso-8859-1 page encoding.
>>
>> From the test page at <http://lookout.net/test/iri/**mixenc.php<http://lookout.net/test/iri/mixenc.php>>
>> Test 3 mixes the raw bytes which would represent U+FF21 FULLWIDTH LATIN
>> CAPITAL LETTER A in UTF-8, along with iso-8859-1 raw bytes for the "ü" in
>> "Dürst".  The following hyperlink represents the test case where <0xNN> is a
>> raw byte.
>>
>> http://www.example.com/D<0xFC>**rst/?<0xEF 0xBC 0xA1>
>>
>>
>
>
>> The results of the display are as follows.
>>
>> Opera (11.50, Win7):
>>  http://www.example.com/DÃ¼rst/**?%EF%BC%A1
>>
>> Firefox (5.0, Win7):
>>  http://www.example.com/Dürst/?**���
>>
>> IE (8.0.7601.17514, Win7):
>>  http://www.example.com/Dürst/?**��<http://www.example.com/D%C3%BCrst/?%C3%AF>
>> ¼¡
>>
>> Chrome (12.0.742.122, Win7):St
>>  http://www.example.com/Dürst/?**���
>>
>> Safari (5.0.4 (7533.20.27)):
>>  http://www.example.com/Dürst/?**���
>>
>> With the exception of IE, all of the above generated the following HTTP
>> request :
>>
>>  GET /D%C3%BCrst/?%EF%BC%A1
>>
>> IE of course does not escape the bytes in the query string.
>>
>>  GET /D%C3%BCrst/?Ａ
>>
>> I tried to capture some of these test results into a table form at:
>> <https://spreadsheets0.google.**com/spreadsheet/ccc?key=**
>> 0AifoWoA0trUndEZSTlRRNnd5MzE3N**3RYOVlIVFFMREE&hl=en_US#gid=5<https://spreadsheets0.google.com/spreadsheet/ccc?key=0AifoWoA0trUndEZSTlRRNnd5MzE3N3RYOVlIVFFMREE&hl=en_US#gid=5>
>> >
>>
>> A question for browser implementers - In some cases it's obvious (Opera
>> and MSIE) and others not so much: Do you know if the status bar display is
>> using the page encoding or has converted the URI to UTF-8 for display?
>>
>>
>> Best regards,
>> Chris
>>
>>
>>
>>
>>
>
Received on Friday, 22 July 2011 00:30:03 UTC