Re: Using Punicode for host names in IRI -> URI translation from Martin J. Dürst on 2009-11-18 (public-iri@w3.org from November 2009)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Wed, 18 Nov 2009 20:02:21 +0900
To: Larry Masinter <masinter@adobe.com>
CC: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>, Pete Resnick <presnick@qualcomm.com>, Ted Hardie <ted.ietf@gmail.com>
Message-ID: <4B03D43D.1040406@it.aoyama.ac.jp>
Hello Larry, others,

Many thanks for picking up these topics. I'll try to answer separately 
on each of the issues to hopefully simplify the discussion. I'll try to 
follow up on the "phishing" and "comparison" topics tomorrow (may become 
Friday).

On 2009/11/18 3:56, Larry Masinter wrote:
> These are "strawman" proposals in response to the IAB talk at the IETF meeting last week: knock down if you can.
> ----
>
>
>   1.  Punicode for host names vs. non-public-DNS resolution
> A number of the concerns in the about using punicode for domain names when doing
 >IRI ->  URI translation seem to have come from the fact that there are 
widely deployed
 >private networks which don't use punicode at all, but rather send UTF8 
directly to DNS,
 >which the DNS protocol allows.

Yes.


> However, I think for the use case of URI, we could take the position that URIs are really
 >intended to be "UNIFORM" resource identifiers, whose primary use is 
for communication over
 >the world-wide web, and that that use case should predominate, and 
that, for that reason,
 >IRI ->  URI translation MUST use punicode for host names.

Why MUST? I understand the goal is to not send %-encoding through the 
DNS, which is indeed a desirable goal, but URIs and the DNS aren't 
exactly the same, and sending some %-characters to the DNS won't create 
many false positives, so I don't understand why you give this goal such 
a high, virtually absolute, priority.

And what should browsers and other software do with URIs that already 
contain %-encoding in their domain-name part?

>We should then note that private
 >environments with additional mappings may need to deploy software that
>
>   *   uses the IRI form directly (i.e., don't translate IRI ->  URI first)
>   *    translates punicode host names back into UTF8 for sending to locally specified host name mappings  (i.e., undo the IRI->URI translation)
>   *   provide alternative registration or lookup services for punicode version of host names

This looks like the easiest path forward in the short term. And RFC 3987 
already allowed using punycode when converting from an IRI to an URI. (I 
agree that the wording there wasn't optimal, and can and should be 
improved.)

But because domain names in percent-encoded form will turn up anyway in 
various places (i.e. in query parts in URIs,...), it doesn't make sense 
to disallow them in one location in an URI with a "MUST punycode". It 
would be much better to look at the long-term, well-layered structure of 
the whole thing. For me, the answer to the question on slide 23 of 
http://www.ietf.org/proceedings/09nov/slides/plenaryt-1.pdf

 >>>>
     How Many Layers of Encoding?

• How do we encode:
       A domain name…
           in an email address…
               in a “mailto” URL…
                   in a web page?

• Do we use:
   – Punycode (“xn‐‐…”) encoding for the domain name?
   – Email Quoted‐Printable (“=XX”) encoding?
   – URL percent (“%XX”) escaping?
   – HTML ampersand (“&#xxxx;”) codes?
• All of the above?
 >>>>

Is that there are two rules:
1) When you can, use the character directly
2) When you can't, use the escaping convention of the current format

For the above question, it would mean that because we are in a Web page, 
we use the character directly, or use &#xHHHH; if the character cannot 
be represented in the encoding of the Web page.

If we remove the lowest part of the question and talk about
• How do we encode:
       A domain name…
           in an email address…
               in a “mailto” URL?

then we would use URI escaping conventions, i.e. UTF-8-based %HH, of 
course only if we don't use an IRI, in which case again the characters 
could be used directly.

If we remove another layer, we get:
• How do we encode:
       A domain name…
           in an email address?

I don't understand escaping conventions in email addresses per se very 
well, and I guess they may not be applicable because they don't deal 
with escaping arbitrary Unicode characters, but if we are speaking about 
an EAI email address, we wouldn't need any escaping at all.

If we remove another layer, we get:
• How do we encode:
       A domain name?

Here we have to choose punycode or UTF-8 depending on the circumstances. 
It is clear that on the public Internet, in DNS request and response 
packets, we have to use punycode. But if somebody designed an API to 
handle the issues discussed in draft-iab-idn-encoding-01.txt, I very 
much hope s/he would design it with UTF-8 (or whatever other encoding of 
Unicode appropriate for the platform, e.g. UTF-16 for Java,...) at the 
core, potentially with the option to detect (and if necessary "backfix") 
punycode and UTF-8-based %-encoding, rather than with punycode at the core.

Architecturally, the reasons for this should be very clear. What I think 
we have to do for the IRI spec is to try and make things go into the 
right direction architecturally without forbidding short-term shortcuts.

In terms of browser implementations, we are already half-way there (in 
all the cases below, I tested putting an IRI into the address field; all 
tests were done on Windows Vista):

- Opera, Amaya, and Google chrome support %-escaping for non-ASCII 
characters in domain names.

- Mozilla (3.0.15) doesn't support it directly, but converts the 
%-escaping back to Unicode characters in the address field. This means 
that you have to go back again to the address field and hit return to 
find the page. This seems rather accidental, and probably the best thing 
is to fix it, but I'd be glad to hear from somebody involved in Mozilla 
development.

- Safari (4.0.3 (531.9.1)) is even more confusing: It converts the 
address field from percent-escaping to punycode but then claims that the 
punycode can't be reached (although I verified that it got the right 
punycode). It puts the readable (IRI) version of the domain name into a 
Google search box on the page. Again, if you go back to the address 
field and hit return, it finds the page, and converts the address field 
to an IRI. Again, this seems quite accidental, and probably the best 
thing is to fix it so that it works directly.

- IE 7 is the only browser that is straightforward: percent-escaping 
stays as percent-escaping, and the site isn't reached.

Regards,    Martin.

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
Received on Wednesday, 18 November 2009 11:03:24 UTC