- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Fri, 02 May 2003 08:59:47 +0200
- To: Martin Duerst <duerst@w3.org>
- Cc: public-iri@w3.org, uri@w3.org
* Martin Duerst wrote: >> >Issue http://www.w3.org/International/iri-edit#idnuri-02 is about >> >whether to use %-escaping or punycode to map the domain name part >> >of an IRI to an URI. This was discussed at the IETF in San Francisco, >> >and the general tendency there seemed to be towards using punycode. >> >>The hostname production rule in RFC 2396 (as updated by RFC 2732) >>does not allow %-escaping, using %-escaping for URI conversion is >>thus not an option, so why was this at all subject to discussion? > >Because of a lot of reasons: > >[...] These are reasons to change RFC 2396 in a way that allows %-escapes in the hostname component (and probably other components). Has this been considered and refused? The only reason to create %-escapes in the hostname part when converting IRIs to URIs is simplicity and simplicity is bad when it breaks in conforming implementations. Even though I wrote software that does not handle %-escapes in the hostname component (but rather fails to parse the URI or keeps the escape sequence) and I know a lot of code that neither supports it (many PHP scripts, for example), I would support to change the production rule in RFC 2396bis, e.g. >- IRIs are defined in many specs (XML, XLink, XML Schema,...) without > the exception for domain names, so there is code out there that > produces (or will produce, once it receives IDNs) %-escapes. this is a very good reason to do so. Various W3C Recommendations require conforming applications or recommend to generate invalid URIs (as error recovery from invalid URIs, oh well...) Not that all implementations implement these recommendations or requirements (not even the W3C MarkUp Validator does), but there is indeed a lot of code out there that does, so we have an interoperability problem here. If RFC 2396bis does not change the hostname production rule (and probably others), should there be errata for HTML4, XML, etc. to deal with internationalized domain names? Errata or not, should existing applications be updated to avoid %-escapes in the hostname part? I implemented what HTML4 recommends for invalid URIs in HTML Tidy, it currently "correctly" changes <a href='http://björn.höhrmann.de/'>...</a> to <a href='http://bj%C3%B6rn.h%C3%B6hrmann.de/'>...</a> Should Tidy try to keep the hostname part unchanged, should it use punycode or should Tidy continue to create invalid URI references? Tidy gives a warning if an %URI; attribute value contains invalid characters, should Tidy give an additional warning if the hostname component contains invalid/non-ascii characters? Should Tidy give a specific warning if the hostname component contains %-escapes? As we are here, should Tidy NFC-normalize the %URI; attribute value before escaping it (IRI draft) or should it keep it as-is (HTML 4.01)? >The reasons I know against using %-escapes are: * A lot of existing software does not handle %-escapes in hostnames, it breaks existing conforming software * Spam filters might consider messages containing them as spam, since up to now invalid %-escapes in the hostname were used only by spam messages to hide the real link destination * punycode can safely be transcoded to Unicode; what domain name refers http://bj%F6rn/ to? IRI conversion could generate such a URI if the %F6 is an artifact from URI=>IRI conversion and if you expect implementers to handle %-escapes, you need to define how, punycode already solved this issue. * if you have no idea what %C3%B6 could mean, what looks worse, %-escapes or punycode? Ok, %-escapes might win here... Well, I am pro a change to RFC 2396bis and I agree that the benefits outweight reasons against such a change, and iff RFC 2396bis changes the production rule(s), the IRI specification could use %-escapes rather than punycode. I am uncertain whether it should, punycode is more reliable while %-escapes is consistent with existing defined error recovery behaivour. So, again, why not change RFC 2396bis?
Received on Friday, 2 May 2003 03:00:10 UTC