- From: 신정식 <jshin1987+w3@gmail.com>
- Date: Fri, 22 Jul 2011 00:46:18 -0700
- To: Chris Weber <chris@lookout.net>
- Cc: "Phillips, Addison" <addison@lab126.com>, "public-iri@w3.org" <public-iri@w3.org>
- Message-ID: <CAE1ONj_B9AG0Ys40fGjVgiHAN8V8yOhPJtw5=Gng_8wdTR2Zfg@mail.gmail.com>
On Thu, Jul 21, 2011 at 10:49 PM, Chris Weber <chris@lookout.net> wrote: > On 7/21/2011 8:13 PM, Phillips, Addison wrote: > >> >>> I was calling attention to Test 3 which was testing "UTF8ness", as >>> Jungshik put it, >>> in the query component. It sounds like you're referring to Test 1 which >>> had >>> "UTF8ness" in the path, for which of course you're right it's a lie and >>> should >>> read something more like "Contains a byte sequence which is also valid >>> UTF-8". >>> >> >> No, I consider Test3 to be invalid also. >> >> Here's your href: >> >> <a href='http://www.example.com/**Dürst/?<http://www.example.com/D%C3%BCrst/?%C3%AF>¼¡' >> id='test3'> >> >> There is no "UTF-8" in the query component. You again have a sequence of >> ISO-8859-1 characters whose byte representation in the page encoding happens >> to be a valid UTF-8 sequence. But it is three *characters* in Latin-1, not >> merely three bytes. >> > > Invalid how? It was designed to test an HTML page with a stated > Content-Type charset value "iso-8859-1" and a matching HTTP Content-Type. > It contains an anchor href hyperlink that happened to include a 3-byte > sequence which was not only 3 valid individual characters in the page > encoding... but together would also represent a single valid character in > UTF-8 if interpreted that way. That *was* the test. > > > >>> The point of this was to test the display as Martin had, but using >>> unescaped >>> bytes. From the results of Test 3 it looks like Firefox, Chrome, and >>> Safari all >>> check for "UTF8ness" in the query component when displaying the IRI in >>> spite of >>> the page encoding, hence you can visually see the U+FF21 FULLWIDTH LATIN >>> CAPITAL LETTER A. >>> >> >> Which I consider to be a serious bug in handling an IRI. In theory, the >> characters in the HTML document are converted to a sequence of Unicode >> characters when the page is parsed. I should have three Unicode code points >> in the query portion of the IRI in the above href. They happen to be encoded >> using three ISO-8859-1 bytes. But they just as well could be encoded >> asA >> >> > Martin's test showed that even the path component containing "%C3%BC" would > be percent-decoded and displayed as UTF-8... even when the page encoding was > declared as iso-8859-1. That's the correct and expected behavior. The path part is always assumed to be in UTF-8 regardless of the referrer page encoding. The query part is a different story. Jungshik > > > (In fact, I just made the test with a UTF-8 encoded page at >> http://www.inter-locale.com/**test/iri-test1.html<http://www.inter-locale.com/test/iri-test1.html> >> ) >> > > It looks like the original test case content was converted from iso-8859-1 > to UTF-8 when you copied the HTML over, which I don't think you intended. > The original three bytes <EF BC A1> are now <C3 AF C2 BC C2 A1> on your > test page - I'm looking at the HTTP response on the wire using Fiddler. > Don't you love encodings? :) > > > I'm kind of incensed by this: how is it that regular users are supposed to >> figure out how to work with IRIs if they characters they *see* are not the >> characters they end up with? Effectively, the only page encoding that works >> for encoding a query into a path in the text of an HTML page is UTF-8. I >> guess that's a good thing (why serve anything else?) >> > > That sounds like Martin's point too. > > > >> Whereas Opera and MSIE do not and show you a) the >>> percent-encoded bytes and b) the bytes represented in their page encoding >>> respectively - do you agree with that assessment? >>> >>> >> Yes. But I guess my point is: one of the key points about IRIs from the >> very beginning has been that they used normal Unicode character sequences in >> a normal manner. You could use a different character encoding for >> serialization purposes, but the IRI is supposed to be WYSIWYG. The >> translation to URI should, in my opinion, exhibit the Least Surprise. >> >> > Agreed. > > Addison >> >> >
Received on Friday, 22 July 2011 07:46:55 UTC