RE: How browsers display IRI's with mixed encodings from Leif Halvard Silli on 2011-07-22 (public-iri@w3.org from July 2011)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Fri, 22 Jul 2011 03:40:21 +0200
To: "Phillips, Addison" <addison@lab126.com>
Cc: Chris Weber <chris@lookout.net>, "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <20110722034021252246.ebf3242e@xn--mlform-iua.no>

Phillips, Addison, Thu, 21 Jul 2011 17:30:38 -0700:

> What's the fixation with ISO-8859-1? 
 [ ... ]
> In other words, leaving aside the query part for a moment, shouldn't 
> IRI really say that valid UTF-8 sequences are interpreted as 
> characters and invalid UTF-8 sequences are treated as bytes?

Such a rule could be quite elegant: it the URL "looks nice", then it is 
probably correctly encoded. That's the positive side of it, which I 
quite like.

> Forced 
> interpretation of bytes in an unknown encoding leads to errors. 
> Especially since, as Leif pointed out, you can't see the difference 
> in the wire format.

But Martin's test shows that treating it as byte when it isn't valid as 
UTF-8, is not a safe way to success. The situation might be different 
when it comes to fragment URLs: It should be pretty safe - or at least 
safer, in case of a fragment URLs, to interpret them as bytes, as it is 
likely to give the same result regardless how it is intepreted. 

But when it comes to links to external pages, then the only meaningful 
thing seems to me to be to assume that resource has a unicode name.

I would suggest that one should be fixated at UTF-8 rather than on 
HTML5's (global) default (Windows-1252) legacy encoding: Despite that 
href="D%FCrst" in a (local) default legacy encoding such as 
Windows-1251 would render as "D<cyrillic-softs-sign>rst" (and thus not 
lead anywhere), we can assume that href="D%FCrst" works - or could work 
well - in the intended encoding. Thus, it would for the most part work 
very well, if '%FC' is interpreted as the UTF-8 representation of 'ü' 
inside Windows-1252 pages and as the UTF-8 representation of '<cyrillic 
soft-sign>' inside Windows-1251 pages.

What would the drawback to such a solution be?

> Looking at your test page, I'm not sure how valid a test it is. The 
> page declares an encoding of ISO 8859-1. Having a "UTF-8 encoded 
> path" in the page is a lie. Those bytes are all valid windows-1252 
> characters (per HTML5, nearly all browsers treat ISO8859-1 as 
> windows-1252). So the path isn't actually "UTF-8 encoded". To me the 
> test looks broken.

My main gripe with that test is that I think it is quite important to 
test with links which actually leads to a Web resource, rather than 
"dry tests" whcih solely focuses on display.
-- 
Leif H Silli

Received on Friday, 22 July 2011 01:40:59 UTC