RE: query on iregname conversion from Larry Masinter on 2009-09-02 (public-iri@w3.org from September 2009)

From: Larry Masinter <masinter@adobe.com>
Date: Wed, 2 Sep 2009 16:28:15 -0700
To: "Roy T. Fielding" <fielding@gbiv.com>
CC: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <8B62A039C620904E92F1233570534C9B0118DB9AC3FB@nambx04.corp.adobe.com>

>> I think we should specify that pct-encoding is always decoded before
>> use of a component in resolution,
>
> Well, the concern was that if you mapped IRI -> URI by pct-encoding
> the entire URI, you would then wind up sending around URIs with
> pct-encoded domain names, into previously compliant URI processors
> that would send the pct-encoded domain name to DNS.

> Why do we care?  Yes, it is possible that such a thing would happen,
> but the result is "not found" (a safe answer).  

Some processors get "not found" and others get a correct result.
This isn't "uniform" behavior. If translation of
http://<nonascii>/path MAY be translated into pct-encoded
form and MAY be also translated into punycode form, then
the end-processors will work differently. It's a lack of
interoperability.

> The same processors
> will need to be updated anyway to check for pct-encoded domains that
> were entered by hand or by reference, or generated by processors
> that do not know about IDNA but do pct-encode anything that is not
> a valid URI character.

Processors that are doing IRI -> URI mapping SHOULD
*also* undo pct-encoded domains, that's fine, we're
asking them to change anyway.

> In other words, the situation exists regardless of how complex we
> make IRI parsing, 

Hmmm, I'm trying to simplify IRI parsing by offering one
algorithm, not two.

> so the best solution is to fix the processor to
> handle both Unicode and pct-encoded octets gracefully rather than
> make IRI syntax scheme-dependent. 

I think there are different things going around as "processor"
in this discussion, so I'm not sure which one you mean.

(a) IRI consumers should handle Unicode and pct-encoded
octets, (check)
(b) these are handled gracefully (well, I dunno, it's all pretty
   clunky, seems like it's less clunky than before)
(c) IRI syntax scheme-dependent (NO! The IRI *syntax* is
   uniform. The IRI processing rules are about the same.
   The only thing more complicated is IRI -> URI translation,
  which right now has two options and I think there should
   be one.)

>  This is no different than the
> introduction of Host in HTTP causing all preexisting clients to
> become gradually obsolete because they could not access the
> increasing number of name-based virtual hosts.

Not sure I understand the analogy.

> Don't you think we can update the IRI document (Proposed Standard) to
> not allow (MUST NOT) or at least not encourage (SHOULD NOT) any
> conversion of IRI -> URI that results in pct-encoded domain names,
> at least more readily than we can update the URI spec and also expect
> updates to http:, ftp:, telnet:, etc. etc. URI scheme implementations
> to mandate pct-decode+punycode-encode transformations
> before DNS resolution?

> No.  I consider that to be an impossible requirement without
> hardcoding the syntax of every scheme into the processor, which
> would be far worse than the disease you are trying to cure.

I'm not sure "disease" and "cure" are the right analogy.
I think there's one bit really: is authority a domain name
or not? Otherwise, there's no 'hard coding' really, just
an option.

I suppose this affects generic IRI -> URI translators, but
there aren't that many of them, and as systems get upgraded
to handle IRIs directly, there will be fewer, not more.
So it seems like a win to me.

Larry

Received on Wednesday, 2 September 2009 23:29:01 UTC