Re: How browsers display IRI's with mixed encodings from 신정식 on 2011-07-22 (public-iri@w3.org from July 2011)

From: 신정식 <jshin1987+w3@gmail.com>
Date: Fri, 22 Jul 2011 00:46:18 -0700
To: Chris Weber <chris@lookout.net>
Cc: "Phillips, Addison" <addison@lab126.com>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <CAE1ONj_B9AG0Ys40fGjVgiHAN8V8yOhPJtw5=Gng_8wdTR2Zfg@mail.gmail.com>
On Thu, Jul 21, 2011 at 10:49 PM, Chris Weber <chris@lookout.net> wrote:

> On 7/21/2011 8:13 PM, Phillips, Addison wrote:
>
>>
>>> I was calling attention to Test 3 which was testing "UTF8ness", as
>>> Jungshik put it,
>>> in the query component.  It sounds like you're referring to Test 1 which
>>> had
>>> "UTF8ness" in the path, for which of course you're right it's a lie and
>>> should
>>> read something more like "Contains a byte sequence which is also valid
>>> UTF-8".
>>>
>>
>> No, I consider Test3 to be invalid also.
>>
>> Here's your href:
>>
>>    <a href='http://www.example.com/**Dürst/?��<http://www.example.com/D%C3%BCrst/?%C3%AF>¼¡'
>> id='test3'>
>>
>> There is no "UTF-8" in the query component. You again have a sequence of
>> ISO-8859-1 characters whose byte representation in the page encoding happens
>> to be a valid UTF-8 sequence. But it is three *characters* in Latin-1, not
>> merely three bytes.
>>
>
> Invalid how?  It was designed to test an HTML page with a stated
> Content-Type charset value "iso-8859-1" and a matching HTTP Content-Type.
>  It contains an anchor href hyperlink that happened to include a 3-byte
> sequence which was not only 3 valid individual characters in the page
> encoding... but together would also represent a single valid character in
> UTF-8 if interpreted that way.  That *was* the test.
>
>
>
>>> The point of this was to test the display as Martin had, but using
>>> unescaped
>>> bytes.  From the results of Test 3 it looks like Firefox, Chrome, and
>>> Safari all
>>> check for "UTF8ness" in the query component when displaying the IRI in
>>> spite of
>>> the page encoding, hence you can visually see the U+FF21 FULLWIDTH LATIN
>>> CAPITAL LETTER A.
>>>
>>
>> Which I consider to be a serious bug in handling an IRI. In theory, the
>> characters in the HTML document are converted to a sequence of Unicode
>> characters when the page is parsed. I should have three Unicode code points
>> in the query portion of the IRI in the above href. They happen to be encoded
>> using three ISO-8859-1 bytes. But they just as well could be encoded
>> as&#x00EF;&#x00BC;&#x00A1;
>>
>>
> Martin's test showed that even the path component containing "%C3%BC" would
> be percent-decoded and displayed as UTF-8... even when the page encoding was
> declared as iso-8859-1.


That's the correct and expected behavior. The path part is always assumed to
be in UTF-8 regardless of the referrer page encoding. The query part is a
different story.

Jungshik

>
>
>  (In fact, I just made the test with a UTF-8 encoded page at
>> http://www.inter-locale.com/**test/iri-test1.html<http://www.inter-locale.com/test/iri-test1.html>
>> )
>>
>
> It looks like the original test case content was converted from iso-8859-1
> to UTF-8 when you copied the HTML over, which I don't think you intended.
>  The original three bytes <EF BC A1> are now <C3 AF C2 BC C2 A1> on your
> test page - I'm looking at the HTTP response on the wire using Fiddler.
>  Don't you love encodings? :)
>
>
>  I'm kind of incensed by this: how is it that regular users are supposed to
>> figure out how to work with IRIs if they characters they *see* are not the
>> characters they end up with? Effectively, the only page encoding that works
>> for encoding a query into a path in the text of an HTML page is UTF-8. I
>> guess that's a good thing (why serve anything else?)
>>
>
> That sounds like Martin's point too.
>
>
>
>>  Whereas Opera and MSIE do not and show you a) the
>>> percent-encoded bytes and b) the bytes represented in their page encoding
>>> respectively - do you agree with that assessment?
>>>
>>>
>> Yes. But I guess my point is: one of the key points about IRIs from the
>> very beginning has been that they used normal Unicode character sequences in
>> a normal manner. You could use a different character encoding for
>> serialization purposes, but the IRI is supposed to be WYSIWYG. The
>> translation to URI should, in my opinion, exhibit the Least Surprise.
>>
>>
> Agreed.
>
>  Addison
>>
>>
>
Received on Friday, 22 July 2011 07:46:55 UTC