Re: How browsers display IRI's with mixed encodings from Chris Weber on 2011-07-22 (public-iri@w3.org from July 2011)

From: Chris Weber <chris@lookout.net>
Date: Thu, 21 Jul 2011 22:49:13 -0700
To: "Phillips, Addison" <addison@lab126.com>
CC: "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <4E290F59.3040907@lookout.net>
On 7/21/2011 8:13 PM, Phillips, Addison wrote:
>>
>> I was calling attention to Test 3 which was testing "UTF8ness", as Jungshik put it,
>> in the query component.  It sounds like you're referring to Test 1 which had
>> "UTF8ness" in the path, for which of course you're right it's a lie and should
>> read something more like "Contains a byte sequence which is also valid UTF-8".
>
> No, I consider Test3 to be invalid also.
>
> Here's your href:
>
>     <a href='http://www.example.com/Dürst/?ï¼��' id='test3'>
>
> There is no "UTF-8" in the query component. You again have a sequence of ISO-8859-1 characters whose byte representation in the page encoding happens to be a valid UTF-8 sequence. But it is three *characters* in Latin-1, not merely three bytes.

Invalid how?  It was designed to test an HTML page with a stated 
Content-Type charset value "iso-8859-1" and a matching HTTP 
Content-Type.  It contains an anchor href hyperlink that happened to 
include a 3-byte sequence which was not only 3 valid individual 
characters in the page encoding... but together would also represent a 
single valid character in UTF-8 if interpreted that way.  That *was* the 
test.

>>
>> The point of this was to test the display as Martin had, but using unescaped
>> bytes.  From the results of Test 3 it looks like Firefox, Chrome, and Safari all
>> check for "UTF8ness" in the query component when displaying the IRI in spite of
>> the page encoding, hence you can visually see the U+FF21 FULLWIDTH LATIN
>> CAPITAL LETTER A.
>
> Which I consider to be a serious bug in handling an IRI. In theory, the characters in the HTML document are converted to a sequence of Unicode characters when the page is parsed. I should have three Unicode code points in the query portion of the IRI in the above href. They happen to be encoded using three ISO-8859-1 bytes. But they just as well could be encoded as&#x00EF;&#x00BC;&#x00A1;
>

Martin's test showed that even the path component containing "%C3%BC" 
would be percent-decoded and displayed as UTF-8... even when the page 
encoding was declared as iso-8859-1.

> (In fact, I just made the test with a UTF-8 encoded page at http://www.inter-locale.com/test/iri-test1.html)

It looks like the original test case content was converted from 
iso-8859-1 to UTF-8 when you copied the HTML over, which I don't think 
you intended.  The original three bytes <EF BC A1> are now <C3 AF C2 BC 
C2 A1> on your test page - I'm looking at the HTTP response on the wire 
using Fiddler.  Don't you love encodings? :)

> I'm kind of incensed by this: how is it that regular users are supposed to figure out how to work with IRIs if they characters they *see* are not the characters they end up with? Effectively, the only page encoding that works for encoding a query into a path in the text of an HTML page is UTF-8. I guess that's a good thing (why serve anything else?)

That sounds like Martin's point too.

>
>> Whereas Opera and MSIE do not and show you a) the
>> percent-encoded bytes and b) the bytes represented in their page encoding
>> respectively - do you agree with that assessment?
>>
>
> Yes. But I guess my point is: one of the key points about IRIs from the very beginning has been that they used normal Unicode character sequences in a normal manner. You could use a different character encoding for serialization purposes, but the IRI is supposed to be WYSIWYG. The translation to URI should, in my opinion, exhibit the Least Surprise.
>

Agreed.

> Addison
>
Received on Friday, 22 July 2011 05:49:42 UTC