Re: issue idnuri-02: New approach, new text

* Martin Duerst wrote:
>> >Issue http://www.w3.org/International/iri-edit#idnuri-02 is about
>> >whether to use %-escaping or punycode to map the domain name part
>> >of an IRI to an URI. This was discussed at the IETF in San Francisco,
>> >and the general tendency there seemed to be towards using punycode.
>>
>>The hostname production rule in RFC 2396 (as updated by RFC 2732)
>>does not allow %-escaping, using %-escaping for URI conversion is
>>thus not an option, so why was this at all subject to discussion?
>
>Because of a lot of reasons:
>
>[...]

These are reasons to change RFC 2396 in a way that allows %-escapes
in the hostname component (and probably other components). Has this
been considered and refused?

The only reason to create %-escapes in the hostname part when converting
IRIs to URIs is simplicity and simplicity is bad when it breaks in
conforming implementations. Even though I wrote software that does not
handle %-escapes in the hostname component (but rather fails to parse
the URI or keeps the escape sequence) and I know a lot of code that
neither supports it (many PHP scripts, for example), I would support to
change the production rule in RFC 2396bis, e.g.

>- IRIs are defined in many specs (XML, XLink, XML Schema,...) without
>   the exception for domain names, so there is code out there that
>   produces (or will produce, once it receives IDNs) %-escapes.

this is a very good reason to do so. Various W3C Recommendations require
conforming applications or recommend to generate invalid URIs (as error
recovery from invalid URIs, oh well...) Not that all implementations
implement these recommendations or requirements (not even the W3C MarkUp
Validator does), but there is indeed a lot of code out there that does,
so we have an interoperability problem here.

If RFC 2396bis does not change the hostname production rule (and
probably others), should there be errata for HTML4, XML, etc. to deal
with internationalized domain names? Errata or not, should existing
applications be updated to avoid %-escapes in the hostname part? I
implemented what HTML4 recommends for invalid URIs in HTML Tidy, it
currently "correctly" changes

  <a href='http://björn.höhrmann.de/'>...</a>

to

  <a href='http://bj%C3%B6rn.h%C3%B6hrmann.de/'>...</a>

Should Tidy try to keep the hostname part unchanged, should it use
punycode or should Tidy continue to create invalid URI references? Tidy
gives a warning if an %URI; attribute value contains invalid characters,
should Tidy give an additional warning if the hostname component
contains invalid/non-ascii characters? Should Tidy give a specific
warning if the hostname component contains %-escapes?

As we are here, should Tidy NFC-normalize the %URI; attribute value
before escaping it (IRI draft) or should it keep it as-is (HTML 4.01)?

>The reasons I know against using %-escapes are:

  * A lot of existing software does not handle %-escapes in hostnames,
    it breaks existing conforming software

  * Spam filters might consider messages containing them as spam, since
    up to now invalid %-escapes in the hostname were used only by spam
    messages to hide the real link destination

  * punycode can safely be transcoded to Unicode; what domain name
    refers http://bj%F6rn/ to? IRI conversion could generate such
    a URI if the %F6 is an artifact from URI=>IRI conversion and if you
    expect implementers to handle %-escapes, you need to define how,
    punycode already solved this issue.

  * if you have no idea what %C3%B6 could mean, what looks worse,
    %-escapes or punycode? Ok, %-escapes might win here...

Well, I am pro a change to RFC 2396bis and I agree that the benefits
outweight reasons against such a change, and iff RFC 2396bis changes
the production rule(s), the IRI specification could use %-escapes rather
than punycode. I am uncertain whether it should, punycode is more
reliable while %-escapes is consistent with existing defined error
recovery behaivour.

So, again, why not change RFC 2396bis?

Received on Friday, 2 May 2003 03:00:10 UTC