- From: Chris Weber <chris@lookout.net>
- Date: Thu, 21 Jul 2011 22:49:13 -0700
- To: "Phillips, Addison" <addison@lab126.com>
- CC: "public-iri@w3.org" <public-iri@w3.org>
On 7/21/2011 8:13 PM, Phillips, Addison wrote: >> >> I was calling attention to Test 3 which was testing "UTF8ness", as Jungshik put it, >> in the query component. It sounds like you're referring to Test 1 which had >> "UTF8ness" in the path, for which of course you're right it's a lie and should >> read something more like "Contains a byte sequence which is also valid UTF-8". > > No, I consider Test3 to be invalid also. > > Here's your href: > > <a href='http://www.example.com/Dürst/?A' id='test3'> > > There is no "UTF-8" in the query component. You again have a sequence of ISO-8859-1 characters whose byte representation in the page encoding happens to be a valid UTF-8 sequence. But it is three *characters* in Latin-1, not merely three bytes. Invalid how? It was designed to test an HTML page with a stated Content-Type charset value "iso-8859-1" and a matching HTTP Content-Type. It contains an anchor href hyperlink that happened to include a 3-byte sequence which was not only 3 valid individual characters in the page encoding... but together would also represent a single valid character in UTF-8 if interpreted that way. That *was* the test. >> >> The point of this was to test the display as Martin had, but using unescaped >> bytes. From the results of Test 3 it looks like Firefox, Chrome, and Safari all >> check for "UTF8ness" in the query component when displaying the IRI in spite of >> the page encoding, hence you can visually see the U+FF21 FULLWIDTH LATIN >> CAPITAL LETTER A. > > Which I consider to be a serious bug in handling an IRI. In theory, the characters in the HTML document are converted to a sequence of Unicode characters when the page is parsed. I should have three Unicode code points in the query portion of the IRI in the above href. They happen to be encoded using three ISO-8859-1 bytes. But they just as well could be encoded asA > Martin's test showed that even the path component containing "%C3%BC" would be percent-decoded and displayed as UTF-8... even when the page encoding was declared as iso-8859-1. > (In fact, I just made the test with a UTF-8 encoded page at http://www.inter-locale.com/test/iri-test1.html) It looks like the original test case content was converted from iso-8859-1 to UTF-8 when you copied the HTML over, which I don't think you intended. The original three bytes <EF BC A1> are now <C3 AF C2 BC C2 A1> on your test page - I'm looking at the HTTP response on the wire using Fiddler. Don't you love encodings? :) > I'm kind of incensed by this: how is it that regular users are supposed to figure out how to work with IRIs if they characters they *see* are not the characters they end up with? Effectively, the only page encoding that works for encoding a query into a path in the text of an HTML page is UTF-8. I guess that's a good thing (why serve anything else?) That sounds like Martin's point too. > >> Whereas Opera and MSIE do not and show you a) the >> percent-encoded bytes and b) the bytes represented in their page encoding >> respectively - do you agree with that assessment? >> > > Yes. But I guess my point is: one of the key points about IRIs from the very beginning has been that they used normal Unicode character sequences in a normal manner. You could use a different character encoding for serialization purposes, but the IRI is supposed to be WYSIWYG. The translation to URI should, in my opinion, exhibit the Least Surprise. > Agreed. > Addison >
Received on Friday, 22 July 2011 05:49:42 UTC