RE: Using Punicode for host names in IRI -> URI translation from Larry Masinter on 2009-11-20 (public-iri@w3.org from November 2009)

From: Larry Masinter <masinter@adobe.com>
Date: Fri, 20 Nov 2009 09:24:08 -0800
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>
CC: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>, Pete Resnick <presnick@qualcomm.com>, Ted Hardie <ted.ietf@gmail.com>
Message-ID: <8B62A039C620904E92F1233570534C9B0118DC8FB024@nambx04.corp.adobe.com>
Larry:
>> However, I think for the use case of URI, we could take the position that URIs
>> are really intended to be "UNIFORM" resource identifiers, whose primary use is 
>> for communication over the world-wide web, and that that use case should 
>> predominate, and that, for that reason, IRI ->  URI translation MUST use 
>> punicode for host names.

Martin:

> Why MUST? I understand the goal is to not send %-encoding through the 
> DNS, which is indeed a desirable goal, but URIs and the DNS aren't 
> exactly the same, and sending some %-characters to the DNS won't create 
> many false positives, so I don't understand why you give this goal such 
> a high, virtually absolute, priority.

It isn't a "virtually absolute" priority, it's just a "greater" priority:
I don't see that there are any use cases where doing otherwise would be
better.  What is the advantage in making this optional?

Having an option for how to process URIs -- where some systems
do one thing and others do something else -- is not a situation we should
encourage or promote. It's harmful. 

Please note that I am suggesting requiring punicode ONLY for the
host names in the "iregname" situation, not in the "mailto:"
situation at all.


> And what should browsers and other software do with URIs that already 
> contain %-encoding in their domain-name part?

If there are any -- which I hope is rare -- I think we should recommend
back-converting to UTF-8 and unicode and then punycoding those too.



>>We should then note that private
>>environments with additional mappings may need to deploy software that
>>
>>* uses the IRI form directly (i.e., don't translate IRI ->  URI first)
>>* translates punicode host names back into UTF8 for sending to locally
>>  specified host name mappings  (i.e., undo the IRI->URI translation)
>>*  provide alternative registration or lookup services for punicode 
>>   version of host names

> This looks like the easiest path forward in the short term. And RFC 3987 
> already allowed using punycode when converting from an IRI to an URI. (I 
> agree that the wording there wasn't optimal, and can and should be 
> improved.)

> But because domain names in percent-encoded form will turn up anyway in 
> various places (i.e. in query parts in URIs,...), it doesn't make sense 
> to disallow them in one location in an URI with a "MUST punycode".

I don't think this is an appropriate response. The scope of a MUST
is always "software that is compliant with this specification".

When the Romans were mandating the standard for roads across the
roman empire, they could write:

The width of the wheel ruts of a roman road MUST be 2 full paces. 
However, there are chariots with wider and narrower spacing between
the wheels, so road builders SHOULD allow tolerances.

Converting from an IRI to a URI MUST use punicode to translate
a non-ASCII host name into US-ASCII. However, since there may be
some host names with percent-encoded UTF8 host names, those 
SHOULD be back converted into unicode and then punicoded too.

> It would be much better to look at the long-term, well-layered structure of 
> the whole thing. For me, the answer to the question on slide 23 of 
> http://www.ietf.org/proceedings/09nov/slides/plenaryt-1.pdf


In the long term, systems should use IRIs with unicode directly and
not convert anything into ASCII.

> • How do we encode:
>       A domain name…
>           in an email address…
>               in a “mailto” URL…
>                   in a web page?

This scenario is a red herring in several different ways:
a) I'm only talking about the handling of host names in the
   iregname component things like:

    scheme://iregname/path

   which doesn't apply to domain names.

And secondly, only in the case of interoperability with legacy
environments that do not support Unicode directly. "Web pages"
support using Unicode directly, and UTF-8 support is mandated
and well-supported with all current browsers.  So the answer
is "no encoding" is needed (other than UTF-8 for the web page
to be represented as a sequence of bytes from the sequence
of unicode characters.)

> • Do we use:
>   – Punycode (“xn‐‐…”) encoding for the domain name?
>   – Email Quoted‐Printable (“=XX”) encoding?
>   – URL percent (“%XX”) escaping?
>   – HTML ampersand (“&#xxxx;”) codes?
> • All of the above?

> Is that there are two rules:
> 1) When you can, use the character directly
> 2) When you can't, use the escaping convention of the current format

I agree with the conclusions for this case, but not the
generalization of your "two rules" to other situations. 

Producers of web pages shouldn't use any of these.

Web pages allow direct encoding of unicode domain names
in HTML. Support of UTF-8 is required. All current browsers
and deployed environments support UTF-8.  
So for current software, "when you can't" doesn't apply.

Consumers of web pages MUST handle URL percent escaping
and HTML ampersand codes. Independent of the scheme
and any other encodings allowed. Those encodings are
processed before the scheme is even looked at or parsed.

I think the "mailto:" IRI scheme should disallow Email
Quoted-printable encoding, and so it doesn't apply,
and Punycode encoding for domain names should be 
discouraged, but consumers of mailto IRIs may need
to be prepared to decode them anyway (not sure if there
are workflows that would produce them.)

For legacy software... well, we've covered that. Treat legacy
software that doesn't support UTF-8 as also software that
doesn't support anything other than ASCII.

> For the above question, it would mean that because we are in a Web page, 
> we use the character directly, or use &#xHHHH; if the character cannot 
> be represented in the encoding of the Web page.

I think "web pages should use UTF-8" is a better answer;
if you're going to have to use *any* encoding anywhere,
direct UTF-8 representation of the web page is better than
anything else.

> If we remove the lowest part of the question and talk about
> • How do we encode:
>       A domain name…
>           in an email address…
>               in a “mailto” URL?

> then we would use URI escaping conventions, i.e. UTF-8-based %HH, of 
> course only if we don't use an IRI, in which case again the characters 
> could be used directly.

I don't think I disagree with the conclusion but rather the
way you are saying this -- making the exceptional, unusual,
legacy compatible situation into the main case, and the
correct, appropriate and recommended path as a 
"only if we don't ...".

The normal case should be: use the unicode domain name
directly in an IRI.

Your tests of various browsers seemed to have been dealing with
http: IRIs and not mailto: IRIs which confuses the situation
quite a bit.

Larry
Received on Friday, 20 November 2009 17:26:48 UTC