Re: uri handling of hosts is too restrictive from Martin Duerst on 2004-02-19 (public-iri@w3.org from February 2004)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 18 Feb 2004 19:13:13 -0500
To: uri@w3.org, uri@w3.org, public-iri@w3.org
Message-Id: <4.2.0.58.J.20040218185541.060f9ee8@localhost>
Hello Adam,

I still plan to look at your previous emails,
but here an answer on this one.

At 11:35 04/02/18 +0000, Adam M. Costello BOGUS address, see signature wrote:

>I wrote:
>
> > Another (more general) way out is to introduce an explicit
> > half-way-house between IRIs and URIs.
>
>After some further thought, I would make a few tweaks to that idea.
>
>First, percent-encoding would always be allowed in all components of all
>IRIs;

It's currently not, in particular in the 'scheme' component.

And it's not really IRIs that should need percent-encoding,
although you need it in some cases, if characters are not
encoded as UTF-8 in the corresponding URI.


>individual schemes would be unable to prohibit percent-encoding
>anywhere.  Second, if an individual scheme restricts a component to
>contain only a certain subset of Unicode characters (for example, the
>ASCII subset), scheme-specific IRI consumers would be required to check
>the component before using it, and fail gracefully if any characters are
>found outside the subset.

What do you mean by 'fail gracefully'? The IRI would just not be
resolved? Or anything else? And why would that have to be checked
before use? Why could it not simply be the result of actual use?


>(That would prevent IRIs from suffering some of the problems we are now
>seeing with URIs.  In URIs, percent-encoding was prohibited in the host
>component, and non-ASCII was prohibited in the host component, and there
>was no requirement telling URI consumers what to do if they should find
>either of those things in the host component, so now we have different
>implementations behaving differently when they encounter such things.)

Well, yes. But that's just a result of how things are implemented,
not a problem in the specification, I guess.


>The rule for converting an IRI to a URI would be:
>
>1) If you recognize the scheme, then verify that no component contains
>characters that it's not supposed to contain.

Again, why check before use? Of course if the IRI or URI is actually
used, there will naturally be some things that don't work out.
But why should we e.g. check that a domain name doesn't contain
a $, when it's just as easy to let this try and resolve and
not be found? For IDNA, a very strict check was important because
there is a justified fear that some registrants will step over
the boundary of what's allowed in terms of characters. But in
general, that's not that much of an issue.


>If the verification
>succeeds, then apply whatever conversions are appropriate for each
>component.

We already made an exception for domain names. I don't want to
make any other exceptions. The goal is not a hodgepodge of
scheme-specific conventions, but to take advantage of the
fact that many URI schemes already are based on UTF-8,
many others allow UTF-8 to be used (in many parts at least)
and UTF-8 is also the recommendation for new schemes.

Schemes and parts that are not based on UTF-8 just have to
live with the fact that they cannot use the benefits of IRIs,
or upgrade themselves.


>2) If the verification failed, or if you didn't recognize the scheme,
>then perform the generic conversion to percent-encoded UTF-8 as described
>in the IRI draft, and prepend the prefix i- to the scheme.

Why should i- be prepended? New schemes can be designed
so that they fit together well with IRIs (if the relevant
BCP guidelines are used, that will be the case automatically).


Regards,    Martin.
Received on Wednesday, 18 February 2004 19:13:45 UTC