Re: uri handling of hosts is too restrictive from Martin Duerst on 2004-02-15 (uri@w3.org from February 2004)

From: Martin Duerst <duerst@w3.org>
Date: Sun, 15 Feb 2004 11:18:56 -0500
To: "Adam M. Costello BOGUS address, see signature" <BOGUS@BOGUS.nicemice.net>(by way of Martin Duerst <duerst@w3.org>), uri@w3.org
Cc: public-iri@w3.org
Message-Id: <4.2.0.58.J.20040215100619.058bd3c8@localhost>
Hello Adam,

Many thanks for your comments.

At 09:51 04/02/15 -0500, Adam M. Costello BOGUS address, see signature wrote:

>"Roy T. Fielding" <fielding@gbiv.com> wrote:
>
> > This was implemented as part of removing hostname productions in favor
> > of general registered names.
>
>Martin Duerst <duerst@w3.org> replied:
>
> > The restriction of hostnames to DNS was discussed and agreed on at the
> > San Francisco IETF based on interactions with IRIs.
> >
> > The argument was that conversion from IRIs to URIs (defined in the
> > IRI spec) should take care of conversion from non-ASCII characters to
> > punycode in the DNS part.
>
>I was very happy to see the IRI draft take that approach.  The issue is
>explained very well in the issues list (040-reg-name):
>
>     report: Martin Duerst, 20 Mar 2003, URI BOF:
>
>     In order for internationalized characters in the authority
>     component to be handled directly by an IRI processor, it must
>     either
>
>       a) be able to encode the authority characters as %hh and rely on
>          gethostbyname to do the conversion, or
>
>       b) know that the scheme uses hostport and not registry-based names
>          and thus be able to convert the hostname to IDNA form.
>
>     action: Roy T. Fielding, 20 Mar 2003, URI BOF:
>
>     Note that IDNA was created specifically to avoid (a), so that
>     doesn't seem to be a viable alternative for the IETF.
>
>Exactly.  Why go to the trouble of defining a backward-compatible
>encoding (ACE) and then make it impossible to use?

I don't think the current RFC2396bis draft says that you can't use
ACE. If you use ACE, it will just work.

And I think a) is a bit too short, it should read
       a) be able to encode the authority characters as %hh and rely on
          gethostbyname or a layer (just) above it to do the conversion, or


>What's the point of
>downgrading an IRI to a URI if the URI still fails on legacy software?

In practice, things are a little bit more complicated, but that actually
makes this choice a little bit easier.

When implementing IRIs on something like a browser, what I have
seen (or done myself) so far is that it is much easier to implement
the UTF-8 and %-escape steps in one place, and the IDN -> punycode
step much lower in the stack.

The IRI draft (if and when I get around to do the edits this afternoon)
will change to convert everything to %-escapes, but it will contain
a note that points out that for backwards compatibility, in particular
for proxy and similar scenarios where IRI -> URI mapping and DNS
resolution are strictly separated (and under the condition that
the scheme is known to be DNS-based), implementations MAY convert
directly to punycode.

So in theory, this is a black-and-white distinction, but in practice,
it's not.


>RFC-2396 defined the host field as a host name or IPv4 address; there
>was no mention of registered names.

Sorry, wrong. From http://www.ietf.org/rfc/rfc2396.txt:

 >>>>
3.2. Authority Component

    Many URI schemes include a top hierarchical element for a naming
    authority, such that the namespace defined by the remainder of the
    URI is governed by that authority.  This authority component is
    typically defined by an Internet-based server or a scheme-specific
    registry of naming authorities.

       authority     = server | reg_name
 >>>>

And while in San Francisco, the general understanding was that
registry-based naming authorities that use DNS hostnames have
been the only such URIs in deployment, in the meantime, this
understanding has been crumbled in the meantime. In addition,
it was considered highly unadvisable to bet the future of
URIs and IRIs on the DNS.



>Currently, a URI like http://www.w%33.org/ will fail on many browsers,
>which is no problem because the URI is invalid according to RFC-2396.

It works on IE, Opera, and Amaya. And it's not really an issue, because
nobody would actually use that except for testing. For %-escapes
derived from IDNs, it's very easy to make IRIs, IDNs, and this
%-escaping all work without problems. Please remember: a browser
that doesn't support IDNs just doesn't.


>By the way, the draft contains a factual error:
>
> > The reg-name syntax allows for percent-encoded octets, which is
> > necessary to enable internationalized domain names to be provided in
> > URIs;
>
>Every IDN has an ACE form; therefore percent-escapes are not necessary
>for using IDNs in URIs.  Percent-escapes would be necessary for
>using internationalized reg-names (because reg-names are not domain
>names and IDNA does not apply to them), but not necessary for using
>internationalized domain names.

I suggest to change this to:

The reg-name syntax allows for percent-encoded octets, in order to
enable internationalized domain names to be provided in URIs in
an uniform way;


>Stephen Pollei <stephen_pollei@comcast.net> wrote:
>
> > So it's my understanding that lots of names are legal, just not
> > recommended.

>RFC-952 gave the syntax:

>So there is no doubt that host names can contain only ASCII letters,
>digits, hyphens, and dots.  It's an open-and-shut case.

So Stephen's host, with an underscore, just doesn't exist, or what?
Even if every browser actually gets there? Is the tail wagging the
dog here, or what do you think is going on?


Regards,    Martin.
Received on Sunday, 15 February 2004 11:23:30 UTC