- From: Martin Duerst <duerst@w3.org>
- Date: Tue, 27 Apr 2004 15:21:00 +0900
- To: public-iri@w3.org
Hello Adam, This is all related to issue idnuri-02 (http://www.w3.org/International/iri-edit/#idnuri-02). I have tentatively closed this issue. At 20:01 04/02/19 +0000, Adam M. Costello BOGUS address, see signature wrote: >Martin Duerst <duerst@w3.org> wrote: > > And it's not really IRIs that should need percent-encoding, although > > you need it in some cases, if characters are not encoded as UTF-8 in > > the corresponding URI. > >Percent-encoding could also be useful for displaying an IRI when >the local charset is not Unicode, or when the available fonts are >insufficient. If an IRI contains many non-ASCII characters that are >displayable, plus one character that's not displayable, it might be >nice to use percent-encoding only for the oddball and display the >rest intelligibly, rather than convert the entire IRI to a URI. If >that displayed IRI is cut & pasted or manually retyped into another >application, it should be handled properly. This is currently allowed by the IRI spec. In practice, however, there may be other ways to display non-displayable characters, and cut-and-paste is usually able to copy even non-displayable characters. >If an individual scheme restricts a component to contain only ASCII >characters, then scheme-specific IRI consumers would be required >to check the component before using it, and fail gracefully if any >non-ASCII characters are found. > >That's much simpler, requiring only one bit of knowledge about the >syntax of the component (whether it allows non-ASCII). Well, yes, but what exactly is a "scheme-specific IRI consumer"? In the implementation I know, there is no such thing. IRIs get converted to %HH, then the scheme-specific logic takes this apart, then for some schemes, DNS resolution is called, which knows about %HH and IDNs and does the right thing. What is such an implementation supposed to do? Why should the spec give requirements about things that don't exist in implementations? > > What do you mean by 'fail gracefully'? > >If the component is supposed to be a Foo, and a Foo is supposed to be >ASCII, and the component contains non-ASCII, then you must not use >the component as a Foo (whatever that means). If you were about to >do something that entailed using the component as a Foo (for example, >passing it to something that takes a Foo as an argument), then you >must abort the attempt, and the error is something like "invalid Foo >(non-ASCII)". This just sounds to me like two very general principles: - defensive programming - good error messages I don't see a particular point in mentioning these in the IRI spec, because they are also not mentioned in other IETF specs. Nor do I see any good reason for mentioning them for one particular point in the IRI spec, because they should apply to all of the spec. > > And why would that have to be checked before use? Why could it not > > simply be the result of actual use? > >Because the original Foo spec might be old (even if the IRI scheme >containing a Foo component is more recent), and might have its own >installed base of stuff that does not behave interoperably when >presented with a non-ASCII Foo, and therefore it might have needed to >introduce a client-side downgrading operation in order to safely extend >the syntax. If the IRI consumer blindly tries to use the Foo component >as a Foo without performing the downgrading operation, the result will >be unpredictable. Yes. We have a sloppy spec/implementation on the one hand, and somebody sending stuff they are not supposed to send on the other hand. Not surprising that it doesn't work. >Maybe there will be a misleading error message like >"Foo xyz not found" even though xyz actually exists, That's always a possibility for URIs and IRIs. Not all schemes may be known, and the network isn't perfect,... >or maybe the >mangled request will map onto some other Foo by coincidence or malice. If you think this needs to be documented as a security issue, please say so (please propose some wording). >Ideally, the Foo spec should have specified what to do whenever you >encounter a syntactically invalid Foo, so that Foo implementations bear >full responsibility for interoperability as the Foo syntax is extended, >and nothing about the Foo syntax need be known at the IRI-processing >layer. But there is one kind of syntax extension where neglect has >been the rule rather than the exception: the extension from ASCII to >non-ASCII. Because it has been so common for protocols to assume ASCII >without saying enough about how to react to non-ASCII, and because >the ASCII-to-non-ASCII transition is the same one being made by the >introduction of IRIs, and because IRIs are uniquely positioned as a >narrow interface between a wide range of protocols and a wide range of >applications (sort of like IP is a narrow interface between a wide range >of networks and a wide range of applications), IRIs are a good place to >interpose a simple type-safety check. Well, IRIs are defined as generic. Because the checks needed are specific to different protocols,..., I don't think that such checks belong into a generic spec. If a spec needs fixing, it should be fixed. Using another, vaguely related spec to try and fix the first spec is probably a bad idea. > > > (That would prevent IRIs from suffering some of the problems we are > > > now seeing with URIs. In URIs, percent-encoding was prohibited > > > in the host component, and non-ASCII was prohibited in the host > > > component, and there was no requirement telling URI consumers > > > what to do if they should find either of those things in the host > > > component, so now we have different implementations behaving > > > differently when they encounter such things.) > > > > Well, yes. But that's just a result of how things are implemented, > > not a problem in the specification, I guess. > >I think it's a problem in the specification. I think we've learned the >hard way that specs need to say what to do when you encounter unexpected >syntax, otherwise it's difficult to ever extend the syntax. I agree. But I don't think the IRI spec is the right place to fix all the other specs. >RFC-2396 said the host component does not contain percent-escapes, but >didn't say what to do if it did contain them, so some implementations >decode the escapes, and some don't, and neither group is wrong. And RFC 2396bis fixes that. > > We already made an exception for domain names. I don't want to make > > any other exceptions. The goal is not a hodgepodge of scheme-specific > > conventions, but to take advantage of the fact that many URI schemes > > already are based on UTF-8, many others allow UTF-8 to be used (in > > many parts at least) and UTF-8 is also the recommendation for new > > schemes. > >I agree with those goals, but there is a distinct possibility that an >ACE will be defined for email local parts, in which case IRI-to-URI >converters with knowledge of mailto: will want to use the ACE for >compatibility with existing mailto: resolvers. Yes, in the case such a possibility becomes reality, some converters might do that, if they think that helps. They will do that whether or not a spec tells them to. On the other hand, the mailto: URI scheme should be updated to allow %HH (based on UTF-8) in the LHS, and to otherwise be better internationalized. >Maybe there are >other ASCII-only components lurking in existing URI schemes facing >backward-compatibility challenges similar to those of domain names, and >maybe they will likewise find it necessary to use the ACE approach to >internationalization. Do you know of any? >The IRI spec would not need to mention any of the individual >scheme-specific exceptions. It mentions the IDN exception because ihost >is a potential component of IRIs in general, and domain names are used >in a great many schemes, but those reasons wouldn't apply to any other >exceptional components (like email local parts). Okay. > > > 2) If the verification failed, or if you didn't recognize the > > > scheme, then perform the generic conversion to percent-encoded UTF-8 > > > as described in the IRI draft, and prepend the prefix i- to the > > > scheme. > > > > Why should i- be prepended? > >Because URI processing does not include the ASCII-component-check >(whereas IRI processing, being a new spec, could include the check). >Blindly dumping non-ASCII characters (even percent-encoded ones) into >a URI would bypass the check. If the URI contains a component that >used to be limited to ASCII, legacy implementations might behave in >unpredictable ways when that component contains (percent-encoded) >non-ASCII. I think there is a tradeoff. Introducing your i- pattern would mean that the chance that any subsequent URI resolver actually resolves that URI currently would be zero, and might stay very close to zero for a very long time. As we know, introducing a new URI scheme is very hard. The alternative is to not use the i-, meaning that already in quite a few implementations, the URI in question can be resolved, and this number will be increasing faster than in the i- case, at the expense of an occasional unpredictability (which in most cases is just a 'not found'). For me, having things actually work, maybe with occasional hickups, is clearly preferable to a theoretically safe solution that doesn't work in practice. >Basically, i-foo: means "this identifier was blindly converted from a >foo: IRI without foo-specific knowledge, so it does not necessarily >conform to foo: URI syntax, but it does conform to generic URI syntax, >and you can certainly recover the foo: IRI". There are many other ways (e.g. by hand) to create foo: URIs that don't conform to foo: URI syntax. The IRI draft clearly says that you are not supposed to use non-ASCII characters where the scheme can't handle it. Please see http://www.w3.org/International/iri-edit/draft-duerst-iri.html#UTF8use for actual text. >Another answer to your question ("Why should i- be prepended?") is: So >that the IRI spec does not invite applications to violate the IDNA spec. >The ireg-name component is an IDN-aware slot in schemes that use domain >names there (because the IRI draft invites the usage of non-ASCII domain >names there and cites IDNA). The corresponding reg-name slot in the >URI is IDN-unaware. To convert a foo: IRI to a foo: URI, IDNA requires >ToASCII to be applied. But when the application doesn't know the >scheme, the IRI draft invites the application to use percent-encoding >instead, disregarding the IDNA requirement. Well, I think that IDNA tried very hard to predict all cases of use of IDNs, and put down general rules that would apply for all cases. But in general, such things are just impossible. reg-name is a typical example: a slot that can contain both domain names and other stuff. And URIs are a typical example: In RFC 2396, this slot only allowed US-ASCIII. In RFC 2396bis, %HH is also allowed. Implementations have evolved likewise. The IRI spec does the best it reasonably can to navigate in this area. Requiring everything to be prefixed with -i, in practice making things less working, just to nominally conform to IDNA, doesn't seem to make sense. Not every application will know all relevant schemes, but the number of current schemes using DNS in reg-name is not that large, and any future schemes can be defined to allow %HH from the start. So in practice, it is not too difficult for IRI implementations to follow IDNA, and there is definitely nothing in the IRI spec that says that implementations should disregards IDNA. > > New schemes can be designed so that they fit together well with IRIs > > (if the relevant BCP guidelines are used, that will be the case > > automatically). > >The resolvers of those new schemes can simply strip off the i- prefix if >they know that the generic IRI-to-URI conversion is sufficient for those >schemes. That could be mentioned in the IRI spec and in the guidelines >for creating new schemes. Designing things so that the future gets more complicated, rather than more straightforward, just to deal with some sloppy specs/ implementations, does not seem to be a good idea. >By the way, I should insert a rule 0 in my proposed IRI-to-URI >conversion: > >0) If the IRI contains no non-ASCII characters (not even percent-encoded >ones) then stop; it's already a URI. > >(Without this rule, if the scheme was unknown, the only effect of the >other rules would be to prepend the i- prefix, which would be protecting >nothing.) Well, yes. And don't add a i- prefix if there already is one, and make sure we reserve all scheme names starting with i-, and a few other 'details'. Way too much hassle for what it's worth, sorry. Regards, Martin.
Received on Tuesday, 27 April 2004 02:24:35 UTC