- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Fri, 22 Jul 2011 03:40:21 +0200
- To: "Phillips, Addison" <addison@lab126.com>
- Cc: Chris Weber <chris@lookout.net>, "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Phillips, Addison, Thu, 21 Jul 2011 17:30:38 -0700: > What's the fixation with ISO-8859-1? [ ... ] > In other words, leaving aside the query part for a moment, shouldn't > IRI really say that valid UTF-8 sequences are interpreted as > characters and invalid UTF-8 sequences are treated as bytes? Such a rule could be quite elegant: it the URL "looks nice", then it is probably correctly encoded. That's the positive side of it, which I quite like. > Forced > interpretation of bytes in an unknown encoding leads to errors. > Especially since, as Leif pointed out, you can't see the difference > in the wire format. But Martin's test shows that treating it as byte when it isn't valid as UTF-8, is not a safe way to success. The situation might be different when it comes to fragment URLs: It should be pretty safe - or at least safer, in case of a fragment URLs, to interpret them as bytes, as it is likely to give the same result regardless how it is intepreted. But when it comes to links to external pages, then the only meaningful thing seems to me to be to assume that resource has a unicode name. I would suggest that one should be fixated at UTF-8 rather than on HTML5's (global) default (Windows-1252) legacy encoding: Despite that href="D%FCrst" in a (local) default legacy encoding such as Windows-1251 would render as "D<cyrillic-softs-sign>rst" (and thus not lead anywhere), we can assume that href="D%FCrst" works - or could work well - in the intended encoding. Thus, it would for the most part work very well, if '%FC' is interpreted as the UTF-8 representation of 'ΓΌ' inside Windows-1252 pages and as the UTF-8 representation of '<cyrillic soft-sign>' inside Windows-1251 pages. What would the drawback to such a solution be? > Looking at your test page, I'm not sure how valid a test it is. The > page declares an encoding of ISO 8859-1. Having a "UTF-8 encoded > path" in the page is a lie. Those bytes are all valid windows-1252 > characters (per HTML5, nearly all browsers treat ISO8859-1 as > windows-1252). So the path isn't actually "UTF-8 encoded". To me the > test looks broken. My main gripe with that test is that I think it is quite important to test with links which actually leads to a Web resource, rather than "dry tests" whcih solely focuses on display. -- Leif H Silli
Received on Friday, 22 July 2011 01:40:59 UTC