Re: uri handling of hosts is too restrictive from Martin Duerst on 2004-04-27 (public-iri@w3.org from April 2004)

From: Martin Duerst <duerst@w3.org>
Date: Tue, 27 Apr 2004 15:21:00 +0900
To: public-iri@w3.org
Message-Id: <4.2.0.58.J.20040427140741.071a7bd0@localhost>
Hello Adam,

This is all related to issue idnuri-02
(http://www.w3.org/International/iri-edit/#idnuri-02).
I have tentatively closed this issue.


At 20:01 04/02/19 +0000, Adam M. Costello BOGUS address, see signature wrote:

>Martin Duerst <duerst@w3.org> wrote:

> > And it's not really IRIs that should need percent-encoding, although
> > you need it in some cases, if characters are not encoded as UTF-8 in
> > the corresponding URI.
>
>Percent-encoding could also be useful for displaying an IRI when
>the local charset is not Unicode, or when the available fonts are
>insufficient.  If an IRI contains many non-ASCII characters that are
>displayable, plus one character that's not displayable, it might be
>nice to use percent-encoding only for the oddball and display the
>rest intelligibly, rather than convert the entire IRI to a URI.  If
>that displayed IRI is cut & pasted or manually retyped into another
>application, it should be handled properly.

This is currently allowed by the IRI spec. In practice, however,
there may be other ways to display non-displayable characters,
and cut-and-paste is usually able to copy even non-displayable
characters.


>If an individual scheme restricts a component to contain only ASCII
>characters, then scheme-specific IRI consumers would be required
>to check the component before using it, and fail gracefully if any
>non-ASCII characters are found.
>
>That's much simpler, requiring only one bit of knowledge about the
>syntax of the component (whether it allows non-ASCII).

Well, yes, but what exactly is a "scheme-specific IRI consumer"?
In the implementation I know, there is no such thing. IRIs get
converted to %HH, then the scheme-specific logic takes this apart,
then for some schemes, DNS resolution is called, which knows
about %HH and IDNs and does the right thing. What is such an
implementation supposed to do? Why should the spec give requirements
about things that don't exist in implementations?


> > What do you mean by 'fail gracefully'?
>
>If the component is supposed to be a Foo, and a Foo is supposed to be
>ASCII, and the component contains non-ASCII, then you must not use
>the component as a Foo (whatever that means).  If you were about to
>do something that entailed using the component as a Foo (for example,
>passing it to something that takes a Foo as an argument), then you
>must abort the attempt, and the error is something like "invalid Foo
>(non-ASCII)".

This just sounds to me like two very general principles:
- defensive programming
- good error messages

I don't see a particular point in mentioning these in the IRI spec,
because they are also not mentioned in other IETF specs. Nor do
I see any good reason for mentioning them for one particular point
in the IRI spec, because they should apply to all of the spec.


> > And why would that have to be checked before use?  Why could it not
> > simply be the result of actual use?
>
>Because the original Foo spec might be old (even if the IRI scheme
>containing a Foo component is more recent), and might have its own
>installed base of stuff that does not behave interoperably when
>presented with a non-ASCII Foo, and therefore it might have needed to
>introduce a client-side downgrading operation in order to safely extend
>the syntax.  If the IRI consumer blindly tries to use the Foo component
>as a Foo without performing the downgrading operation, the result will
>be unpredictable.

Yes. We have a sloppy spec/implementation on the one hand, and
somebody sending stuff they are not supposed to send on the
other hand. Not surprising that it doesn't work.


>Maybe there will be a misleading error message like
>"Foo xyz not found" even though xyz actually exists,

That's always a possibility for URIs and IRIs. Not all schemes may
be known, and the network isn't perfect,...


>or maybe the
>mangled request will map onto some other Foo by coincidence or malice.

If you think this needs to be documented as a security issue,
please say so (please propose some wording).


>Ideally, the Foo spec should have specified what to do whenever you
>encounter a syntactically invalid Foo, so that Foo implementations bear
>full responsibility for interoperability as the Foo syntax is extended,
>and nothing about the Foo syntax need be known at the IRI-processing
>layer.  But there is one kind of syntax extension where neglect has
>been the rule rather than the exception: the extension from ASCII to
>non-ASCII.  Because it has been so common for protocols to assume ASCII
>without saying enough about how to react to non-ASCII, and because
>the ASCII-to-non-ASCII transition is the same one being made by the
>introduction of IRIs, and because IRIs are uniquely positioned as a
>narrow interface between a wide range of protocols and a wide range of
>applications (sort of like IP is a narrow interface between a wide range
>of networks and a wide range of applications), IRIs are a good place to
>interpose a simple type-safety check.

Well, IRIs are defined as generic. Because the checks needed are
specific to different protocols,..., I don't think that such
checks belong into a generic spec. If a spec needs fixing, it
should be fixed. Using another, vaguely related spec to try
and fix the first spec is probably a bad idea.


> > > (That would prevent IRIs from suffering some of the problems we are
> > > now seeing with URIs.  In URIs, percent-encoding was prohibited
> > > in the host component, and non-ASCII was prohibited in the host
> > > component, and there was no requirement telling URI consumers
> > > what to do if they should find either of those things in the host
> > > component, so now we have different implementations behaving
> > > differently when they encounter such things.)
> >
> > Well, yes.  But that's just a result of how things are implemented,
> > not a problem in the specification, I guess.
>
>I think it's a problem in the specification.  I think we've learned the
>hard way that specs need to say what to do when you encounter unexpected
>syntax, otherwise it's difficult to ever extend the syntax.

I agree. But I don't think the IRI spec is the right place to fix
all the other specs.


>RFC-2396 said the host component does not contain percent-escapes, but
>didn't say what to do if it did contain them, so some implementations
>decode the escapes, and some don't, and neither group is wrong.

And RFC 2396bis fixes that.


> > We already made an exception for domain names.  I don't want to make
> > any other exceptions.  The goal is not a hodgepodge of scheme-specific
> > conventions, but to take advantage of the fact that many URI schemes
> > already are based on UTF-8, many others allow UTF-8 to be used (in
> > many parts at least) and UTF-8 is also the recommendation for new
> > schemes.
>
>I agree with those goals, but there is a distinct possibility that an
>ACE will be defined for email local parts, in which case IRI-to-URI
>converters with knowledge of mailto: will want to use the ACE for
>compatibility with existing mailto: resolvers.

Yes, in the case such a possibility becomes reality, some converters
might do that, if they think that helps. They will do that whether or
not a spec tells them to. On the other hand, the mailto: URI scheme should
be updated to allow %HH (based on UTF-8) in the LHS, and to otherwise
be better internationalized.


>Maybe there are
>other ASCII-only components lurking in existing URI schemes facing
>backward-compatibility challenges similar to those of domain names, and
>maybe they will likewise find it necessary to use the ACE approach to
>internationalization.

Do you know of any?


>The IRI spec would not need to mention any of the individual
>scheme-specific exceptions.  It mentions the IDN exception because ihost
>is a potential component of IRIs in general, and domain names are used
>in a great many schemes, but those reasons wouldn't apply to any other
>exceptional components (like email local parts).

Okay.


> > > 2) If the verification failed, or if you didn't recognize the
> > > scheme, then perform the generic conversion to percent-encoded UTF-8
> > > as described in the IRI draft, and prepend the prefix i- to the
> > > scheme.
> >
> > Why should i- be prepended?
>
>Because URI processing does not include the ASCII-component-check
>(whereas IRI processing, being a new spec, could include the check).
>Blindly dumping non-ASCII characters (even percent-encoded ones) into
>a URI would bypass the check.  If the URI contains a component that
>used to be limited to ASCII, legacy implementations might behave in
>unpredictable ways when that component contains (percent-encoded)
>non-ASCII.

I think there is a tradeoff. Introducing your i- pattern would
mean that the chance that any subsequent URI resolver actually
resolves that URI currently would be zero, and might stay very
close to zero for a very long time. As we know, introducing
a new URI scheme is very hard.

The alternative is to not use the i-, meaning that already
in quite a few implementations, the URI in question can be
resolved, and this number will be increasing faster than in
the i- case, at the expense of an occasional unpredictability
(which in most cases is just a 'not found').

For me, having things actually work, maybe with occasional
hickups, is clearly preferable to a theoretically safe
solution that doesn't work in practice.


>Basically, i-foo: means "this identifier was blindly converted from a
>foo: IRI without foo-specific knowledge, so it does not necessarily
>conform to foo: URI syntax, but it does conform to generic URI syntax,
>and you can certainly recover the foo: IRI".

There are many other ways (e.g. by hand) to create foo: URIs that
don't conform to foo: URI syntax. The IRI draft clearly says that
you are not supposed to use non-ASCII characters where the scheme
can't handle it. Please see
http://www.w3.org/International/iri-edit/draft-duerst-iri.html#UTF8use
for actual text.


>Another answer to your question ("Why should i- be prepended?") is:  So
>that the IRI spec does not invite applications to violate the IDNA spec.
>The ireg-name component is an IDN-aware slot in schemes that use domain
>names there (because the IRI draft invites the usage of non-ASCII domain
>names there and cites IDNA).  The corresponding reg-name slot in the
>URI is IDN-unaware.  To convert a foo: IRI to a foo: URI, IDNA requires
>ToASCII to be applied.  But when the application doesn't know the
>scheme, the IRI draft invites the application to use percent-encoding
>instead, disregarding the IDNA requirement.

Well, I think that IDNA tried very hard to predict all cases of
use of IDNs, and put down general rules that would apply for all
cases. But in general, such things are just impossible. reg-name
is a typical example: a slot that can contain both domain names
and other stuff. And URIs are a typical example: In RFC 2396,
this slot only allowed US-ASCIII. In RFC 2396bis, %HH is also
allowed. Implementations have evolved likewise.

The IRI spec does the best it reasonably can to navigate in this
area. Requiring everything to be prefixed with -i, in practice
making things less working, just to nominally conform to IDNA,
doesn't seem to make sense.

Not every application will know all relevant schemes, but the
number of current schemes using DNS in reg-name is not that large,
and any future schemes can be defined to allow %HH from the start.
So in practice, it is not too difficult for IRI implementations to
follow IDNA, and there is definitely nothing in the IRI spec
that says that implementations should disregards IDNA.


> > New schemes can be designed so that they fit together well with IRIs
> > (if the relevant BCP guidelines are used, that will be the case
> > automatically).
>
>The resolvers of those new schemes can simply strip off the i- prefix if
>they know that the generic IRI-to-URI conversion is sufficient for those
>schemes.  That could be mentioned in the IRI spec and in the guidelines
>for creating new schemes.

Designing things so that the future gets more complicated, rather
than more straightforward, just to deal with some sloppy specs/
implementations, does not seem to be a good idea.


>By the way, I should insert a rule 0 in my proposed IRI-to-URI
>conversion:
>
>0) If the IRI contains no non-ASCII characters (not even percent-encoded
>ones) then stop; it's already a URI.
>
>(Without this rule, if the scheme was unknown, the only effect of the
>other rules would be to prepend the i- prefix, which would be protecting
>nothing.)

Well, yes. And don't add a i- prefix if there already is one,
and make sure we reserve all scheme names starting with i-, and
a few other 'details'. Way too much hassle for what it's worth,
sorry.


Regards,     Martin.
Received on Tuesday, 27 April 2004 02:24:35 UTC