RE: How browsers display IRI's with mixed encodings from Phillips, Addison on 2011-07-22 (public-iri@w3.org from July 2011)

From: Phillips, Addison <addison@lab126.com>
Date: Thu, 21 Jul 2011 20:13:58 -0700
To: Chris Weber <chris@lookout.net>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <131F80DEA635F044946897AFDA9AC3476A947B0DB2@EX-SEA31-D.ant.amazon.com>

> 
> I was calling attention to Test 3 which was testing "UTF8ness", as Jungshik put it,
> in the query component.  It sounds like you're referring to Test 1 which had
> "UTF8ness" in the path, for which of course you're right it's a lie and should
> read something more like "Contains a byte sequence which is also valid UTF-8".

No, I consider Test3 to be invalid also. 

Here's your href:

   <a href='http://www.example.com/Dürst/?ï¼��' id='test3'>

There is no "UTF-8" in the query component. You again have a sequence of ISO-8859-1 characters whose byte representation in the page encoding happens to be a valid UTF-8 sequence. But it is three *characters* in Latin-1, not merely three bytes.

> 
> The point of this was to test the display as Martin had, but using unescaped
> bytes.  From the results of Test 3 it looks like Firefox, Chrome, and Safari all
> check for "UTF8ness" in the query component when displaying the IRI in spite of
> the page encoding, hence you can visually see the U+FF21 FULLWIDTH LATIN
> CAPITAL LETTER A.  

Which I consider to be a serious bug in handling an IRI. In theory, the characters in the HTML document are converted to a sequence of Unicode characters when the page is parsed. I should have three Unicode code points in the query portion of the IRI in the above href. They happen to be encoded using three ISO-8859-1 bytes. But they just as well could be encoded as &#x00EF;&#x00BC;&#x00A1;

(In fact, I just made the test with a UTF-8 encoded page at http://www.inter-locale.com/test/iri-test1.html) 

Chrome, IE9, FF shows the "correct" (three Unicode character) sequence
Opera shows the "correct" percent-escaped sequence for three Unicode characters

Now when I change the page encoding to ISO-8859-1...... (see http://www.inter-locale.com/test/iri-test2.html)

... Bad Things happen. Even though I *explicitly* encoded three Unicode characters:

Chrome, FF shows a FULL WIDTH LATIN CAPITAL LETTER A
Opera shows percent-encoded of that character
IE9 does the right thing (three Unicode characters, at least visually)

I'm kind of incensed by this: how is it that regular users are supposed to figure out how to work with IRIs if they characters they *see* are not the characters they end up with? Effectively, the only page encoding that works for encoding a query into a path in the text of an HTML page is UTF-8. I guess that's a good thing (why serve anything else?)


> Whereas Opera and MSIE do not and show you a) the
> percent-encoded bytes and b) the bytes represented in their page encoding
> respectively - do you agree with that assessment?
> 

Yes. But I guess my point is: one of the key points about IRIs from the very beginning has been that they used normal Unicode character sequences in a normal manner. You could use a different character encoding for serialization purposes, but the IRI is supposed to be WYSIWYG. The translation to URI should, in my opinion, exhibit the Least Surprise.

Addison

Received on Friday, 22 July 2011 03:14:50 UTC