Re: uri handling of hosts is too restrictive from Martin Duerst on 2004-05-04 (public-iri@w3.org from May 2004)

From: Martin Duerst <duerst@w3.org>
Date: Tue, 04 May 2004 17:08:42 +0900
To: public-iri@w3.org
Message-Id: <4.2.0.58.J.20040504170757.07904060@localhost>
It's now a week since I tentatively closed issue
http://www.w3.org/International/iri-edit/#idnuri-02.
I haven't heard about any problem, so I'm closing this issue.

Regards,     Martin.

At 15:21 04/04/27 +0900, Martin Duerst wrote:

>Hello Adam,
>
>This is all related to issue idnuri-02
>(http://www.w3.org/International/iri-edit/#idnuri-02).
>I have tentatively closed this issue.
>
>
>At 20:01 04/02/19 +0000, Adam M. Costello BOGUS address, see signature wrote:
>
>>Martin Duerst <duerst@w3.org> wrote:
>
>> > And it's not really IRIs that should need percent-encoding, although
>> > you need it in some cases, if characters are not encoded as UTF-8 in
>> > the corresponding URI.
>>
>>Percent-encoding could also be useful for displaying an IRI when
>>the local charset is not Unicode, or when the available fonts are
>>insufficient.  If an IRI contains many non-ASCII characters that are
>>displayable, plus one character that's not displayable, it might be
>>nice to use percent-encoding only for the oddball and display the
>>rest intelligibly, rather than convert the entire IRI to a URI.  If
>>that displayed IRI is cut & pasted or manually retyped into another
>>application, it should be handled properly.
>
>This is currently allowed by the IRI spec. In practice, however,
>there may be other ways to display non-displayable characters,
>and cut-and-paste is usually able to copy even non-displayable
>characters.
>
>
>>If an individual scheme restricts a component to contain only ASCII
>>characters, then scheme-specific IRI consumers would be required
>>to check the component before using it, and fail gracefully if any
>>non-ASCII characters are found.
>>
>>That's much simpler, requiring only one bit of knowledge about the
>>syntax of the component (whether it allows non-ASCII).
>
>Well, yes, but what exactly is a "scheme-specific IRI consumer"?
>In the implementation I know, there is no such thing. IRIs get
>converted to %HH, then the scheme-specific logic takes this apart,
>then for some schemes, DNS resolution is called, which knows
>about %HH and IDNs and does the right thing. What is such an
>implementation supposed to do? Why should the spec give requirements
>about things that don't exist in implementations?
>
>
>> > What do you mean by 'fail gracefully'?
>>
>>If the component is supposed to be a Foo, and a Foo is supposed to be
>>ASCII, and the component contains non-ASCII, then you must not use
>>the component as a Foo (whatever that means).  If you were about to
>>do something that entailed using the component as a Foo (for example,
>>passing it to something that takes a Foo as an argument), then you
>>must abort the attempt, and the error is something like "invalid Foo
>>(non-ASCII)".
>
>This just sounds to me like two very general principles:
>- defensive programming
>- good error messages
>
>I don't see a particular point in mentioning these in the IRI spec,
>because they are also not mentioned in other IETF specs. Nor do
>I see any good reason for mentioning them for one particular point
>in the IRI spec, because they should apply to all of the spec.
>
>
>> > And why would that have to be checked before use?  Why could it not
>> > simply be the result of actual use?
>>
>>Because the original Foo spec might be old (even if the IRI scheme
>>containing a Foo component is more recent), and might have its own
>>installed base of stuff that does not behave interoperably when
>>presented with a non-ASCII Foo, and therefore it might have needed to
>>introduce a client-side downgrading operation in order to safely extend
>>the syntax.  If the IRI consumer blindly tries to use the Foo component
>>as a Foo without performing the downgrading operation, the result will
>>be unpredictable.
>
>Yes. We have a sloppy spec/implementation on the one hand, and
>somebody sending stuff they are not supposed to send on the
>other hand. Not surprising that it doesn't work.
>
>
>>Maybe there will be a misleading error message like
>>"Foo xyz not found" even though xyz actually exists,
>
>That's always a possibility for URIs and IRIs. Not all schemes may
>be known, and the network isn't perfect,...
>
>
>>or maybe the
>>mangled request will map onto some other Foo by coincidence or malice.
>
>If you think this needs to be documented as a security issue,
>please say so (please propose some wording).
>
>
>>Ideally, the Foo spec should have specified what to do whenever you
>>encounter a syntactically invalid Foo, so that Foo implementations bear
>>full responsibility for interoperability as the Foo syntax is extended,
>>and nothing about the Foo syntax need be known at the IRI-processing
>>layer.  But there is one kind of syntax extension where neglect has
>>been the rule rather than the exception: the extension from ASCII to
>>non-ASCII.  Because it has been so common for protocols to assume ASCII
>>without saying enough about how to react to non-ASCII, and because
>>the ASCII-to-non-ASCII transition is the same one being made by the
>>introduction of IRIs, and because IRIs are uniquely positioned as a
>>narrow interface between a wide range of protocols and a wide range of
>>applications (sort of like IP is a narrow interface between a wide range
>>of networks and a wide range of applications), IRIs are a good place to
>>interpose a simple type-safety check.
>
>Well, IRIs are defined as generic. Because the checks needed are
>specific to different protocols,..., I don't think that such
>checks belong into a generic spec. If a spec needs fixing, it
>should be fixed. Using another, vaguely related spec to try
>and fix the first spec is probably a bad idea.
>
>
>> > > (That would prevent IRIs from suffering some of the problems we are
>> > > now seeing with URIs.  In URIs, percent-encoding was prohibited
>> > > in the host component, and non-ASCII was prohibited in the host
>> > > component, and there was no requirement telling URI consumers
>> > > what to do if they should find either of those things in the host
>> > > component, so now we have different implementations behaving
>> > > differently when they encounter such things.)
>> >
>> > Well, yes.  But that's just a result of how things are implemented,
>> > not a problem in the specification, I guess.
>>
>>I think it's a problem in the specification.  I think we've learned the
>>hard way that specs need to say what to do when you encounter unexpected
>>syntax, otherwise it's difficult to ever extend the syntax.
>
>I agree. But I don't think the IRI spec is the right place to fix
>all the other specs.
>
>
>>RFC-2396 said the host component does not contain percent-escapes, but
>>didn't say what to do if it did contain them, so some implementations
>>decode the escapes, and some don't, and neither group is wrong.
>
>And RFC 2396bis fixes that.
>
>
>> > We already made an exception for domain names.  I don't want to make
>> > any other exceptions.  The goal is not a hodgepodge of scheme-specific
>> > conventions, but to take advantage of the fact that many URI schemes
>> > already are based on UTF-8, many others allow UTF-8 to be used (in
>> > many parts at least) and UTF-8 is also the recommendation for new
>> > schemes.
>>
>>I agree with those goals, but there is a distinct possibility that an
>>ACE will be defined for email local parts, in which case IRI-to-URI
>>converters with knowledge of mailto: will want to use the ACE for
>>compatibility with existing mailto: resolvers.
>
>Yes, in the case such a possibility becomes reality, some converters
>might do that, if they think that helps. They will do that whether or
>not a spec tells them to. On the other hand, the mailto: URI scheme should
>be updated to allow %HH (based on UTF-8) in the LHS, and to otherwise
>be better internationalized.
>
>
>>Maybe there are
>>other ASCII-only components lurking in existing URI schemes facing
>>backward-compatibility challenges similar to those of domain names, and
>>maybe they will likewise find it necessary to use the ACE approach to
>>internationalization.
>
>Do you know of any?
>
>
>>The IRI spec would not need to mention any of the individual
>>scheme-specific exceptions.  It mentions the IDN exception because ihost
>>is a potential component of IRIs in general, and domain names are used
>>in a great many schemes, but those reasons wouldn't apply to any other
>>exceptional components (like email local parts).
>
>Okay.
>
>
>> > > 2) If the verification failed, or if you didn't recognize the
>> > > scheme, then perform the generic conversion to percent-encoded UTF-8
>> > > as described in the IRI draft, and prepend the prefix i- to the
>> > > scheme.
>> >
>> > Why should i- be prepended?
>>
>>Because URI processing does not include the ASCII-component-check
>>(whereas IRI processing, being a new spec, could include the check).
>>Blindly dumping non-ASCII characters (even percent-encoded ones) into
>>a URI would bypass the check.  If the URI contains a component that
>>used to be limited to ASCII, legacy implementations might behave in
>>unpredictable ways when that component contains (percent-encoded)
>>non-ASCII.
>
>I think there is a tradeoff. Introducing your i- pattern would
>mean that the chance that any subsequent URI resolver actually
>resolves that URI currently would be zero, and might stay very
>close to zero for a very long time. As we know, introducing
>a new URI scheme is very hard.
>
>The alternative is to not use the i-, meaning that already
>in quite a few implementations, the URI in question can be
>resolved, and this number will be increasing faster than in
>the i- case, at the expense of an occasional unpredictability
>(which in most cases is just a 'not found').
>
>For me, having things actually work, maybe with occasional
>hickups, is clearly preferable to a theoretically safe
>solution that doesn't work in practice.
>
>
>>Basically, i-foo: means "this identifier was blindly converted from a
>>foo: IRI without foo-specific knowledge, so it does not necessarily
>>conform to foo: URI syntax, but it does conform to generic URI syntax,
>>and you can certainly recover the foo: IRI".
>
>There are many other ways (e.g. by hand) to create foo: URIs that
>don't conform to foo: URI syntax. The IRI draft clearly says that
>you are not supposed to use non-ASCII characters where the scheme
>can't handle it. Please see
>http://www.w3.org/International/iri-edit/draft-duerst-iri.html#UTF8use
>for actual text.
>
>
>>Another answer to your question ("Why should i- be prepended?") is:  So
>>that the IRI spec does not invite applications to violate the IDNA spec.
>>The ireg-name component is an IDN-aware slot in schemes that use domain
>>names there (because the IRI draft invites the usage of non-ASCII domain
>>names there and cites IDNA).  The corresponding reg-name slot in the
>>URI is IDN-unaware.  To convert a foo: IRI to a foo: URI, IDNA requires
>>ToASCII to be applied.  But when the application doesn't know the
>>scheme, the IRI draft invites the application to use percent-encoding
>>instead, disregarding the IDNA requirement.
>
>Well, I think that IDNA tried very hard to predict all cases of
>use of IDNs, and put down general rules that would apply for all
>cases. But in general, such things are just impossible. reg-name
>is a typical example: a slot that can contain both domain names
>and other stuff. And URIs are a typical example: In RFC 2396,
>this slot only allowed US-ASCIII. In RFC 2396bis, %HH is also
>allowed. Implementations have evolved likewise.
>
>The IRI spec does the best it reasonably can to navigate in this
>area. Requiring everything to be prefixed with -i, in practice
>making things less working, just to nominally conform to IDNA,
>doesn't seem to make sense.
>
>Not every application will know all relevant schemes, but the
>number of current schemes using DNS in reg-name is not that large,
>and any future schemes can be defined to allow %HH from the start.
>So in practice, it is not too difficult for IRI implementations to
>follow IDNA, and there is definitely nothing in the IRI spec
>that says that implementations should disregards IDNA.
>
>
>> > New schemes can be designed so that they fit together well with IRIs
>> > (if the relevant BCP guidelines are used, that will be the case
>> > automatically).
>>
>>The resolvers of those new schemes can simply strip off the i- prefix if
>>they know that the generic IRI-to-URI conversion is sufficient for those
>>schemes.  That could be mentioned in the IRI spec and in the guidelines
>>for creating new schemes.
>
>Designing things so that the future gets more complicated, rather
>than more straightforward, just to deal with some sloppy specs/
>implementations, does not seem to be a good idea.
>
>
>>By the way, I should insert a rule 0 in my proposed IRI-to-URI
>>conversion:
>>
>>0) If the IRI contains no non-ASCII characters (not even percent-encoded
>>ones) then stop; it's already a URI.
>>
>>(Without this rule, if the scheme was unknown, the only effect of the
>>other rules would be to prepend the i- prefix, which would be protecting
>>nothing.)
>
>Well, yes. And don't add a i- prefix if there already is one,
>and make sure we reserve all scheme names starting with i-, and
>a few other 'details'. Way too much hassle for what it's worth,
>sorry.
>
>
>Regards,     Martin.
Received on Tuesday, 4 May 2004 04:17:52 UTC