Re: uri handling of hosts is too restrictive

Graham Klyne <GK@ninebynine.org> wrote:

> > The rule can be: percent-encoding is allowed everywhere except the
> > scheme, and individual schemes cannot make exceptions to this rule.
>
> I think that's pretty close to what we have (if permitted
> normalization is taken into account - section 6.2.2.2).

I was talking about IRIs there, not URIs.

[This thread was cross-posted to both mailing lists early on, when it
was talking about both URIs and IRIs, but it has progressively shifted
attention more toward IRIs.  I forgot to manually fix the Reply-To:
field in my last message to point to both lists, so your message went
only to the URI list.  I've restored the former To: and Reply-To: fields
for this message, but someone please tell me if it's time to stop the
cross-posting.]

URIs already have a legacy of schemes that prohibit percent-encoding in
some components (all schemes that cite RFC-2396 and contain a hostname
component prohibit percent-encoding in the hostname).  The main point I
was trying to make in the quotation above is that the IRI spec should
avoid that pitfall by explicitly requiring all IRI consumers to expect
percent-encoding in all components (except the scheme component) of all
schemes.  Individual schemes should not be able to ban percent-encoding
anywhere in IRIs.

> > If an individual scheme restricts a component to contain only ASCII
> > characters, then scheme-specific IRI consumers would be required
> > to check the component before using it, and fail gracefully if any
> > non-ASCII characters are found.
> >
> > That's much simpler, requiring only one bit of knowledge about the
> > syntax of the component (whether it allows non-ASCII).
>
> this is about *generic* URI syntax, and I'm currently implementing a
> *generic* URI parser.  How am I supposed to know whether a particular
> scheme restricts a particular component in any particular way?

Please note the phrase "scheme-specific IRI consumers" in the quoted
passage.  The proposed rule is not about URI parsers at all, nor is it
about generic IRI parsers; it is about scheme-specific IRI parsers.
(It's too late to add a similar requirement for scheme-specific URI
consumers, because there's already a huge installed base.)

The proposed rule, in other words, is that if you're going to use
scheme-specific knowledge to use a component in a scheme-specific way,
then you must first use your scheme-specific knowledge to check for
the occurrence of non-ASCII characters where they don't belong.  On
the other hand, if you have no scheme-specific knowledge then you're
incapable of performing that check, but that's okay because you're also
incapable of doing what the check prevents: feeding the component to
a scheme-specific ASCII-assuming operation.  The only operations you
know are generic IRI-component operations, all of which are designed to
handle non-ASCII.

AMC
http://www.nicemice.net/amc/

Received on Thursday, 19 February 2004 06:50:18 UTC