- From: Martin Duerst <duerst@w3.org>
- Date: Tue, 04 May 2004 17:08:42 +0900
- To: public-iri@w3.org
It's now a week since I tentatively closed issue http://www.w3.org/International/iri-edit/#idnuri-02. I haven't heard about any problem, so I'm closing this issue. Regards, Martin. At 15:21 04/04/27 +0900, Martin Duerst wrote: >Hello Adam, > >This is all related to issue idnuri-02 >(http://www.w3.org/International/iri-edit/#idnuri-02). >I have tentatively closed this issue. > > >At 20:01 04/02/19 +0000, Adam M. Costello BOGUS address, see signature wrote: > >>Martin Duerst <duerst@w3.org> wrote: > >> > And it's not really IRIs that should need percent-encoding, although >> > you need it in some cases, if characters are not encoded as UTF-8 in >> > the corresponding URI. >> >>Percent-encoding could also be useful for displaying an IRI when >>the local charset is not Unicode, or when the available fonts are >>insufficient. If an IRI contains many non-ASCII characters that are >>displayable, plus one character that's not displayable, it might be >>nice to use percent-encoding only for the oddball and display the >>rest intelligibly, rather than convert the entire IRI to a URI. If >>that displayed IRI is cut & pasted or manually retyped into another >>application, it should be handled properly. > >This is currently allowed by the IRI spec. In practice, however, >there may be other ways to display non-displayable characters, >and cut-and-paste is usually able to copy even non-displayable >characters. > > >>If an individual scheme restricts a component to contain only ASCII >>characters, then scheme-specific IRI consumers would be required >>to check the component before using it, and fail gracefully if any >>non-ASCII characters are found. >> >>That's much simpler, requiring only one bit of knowledge about the >>syntax of the component (whether it allows non-ASCII). > >Well, yes, but what exactly is a "scheme-specific IRI consumer"? >In the implementation I know, there is no such thing. IRIs get >converted to %HH, then the scheme-specific logic takes this apart, >then for some schemes, DNS resolution is called, which knows >about %HH and IDNs and does the right thing. What is such an >implementation supposed to do? Why should the spec give requirements >about things that don't exist in implementations? > > >> > What do you mean by 'fail gracefully'? >> >>If the component is supposed to be a Foo, and a Foo is supposed to be >>ASCII, and the component contains non-ASCII, then you must not use >>the component as a Foo (whatever that means). If you were about to >>do something that entailed using the component as a Foo (for example, >>passing it to something that takes a Foo as an argument), then you >>must abort the attempt, and the error is something like "invalid Foo >>(non-ASCII)". > >This just sounds to me like two very general principles: >- defensive programming >- good error messages > >I don't see a particular point in mentioning these in the IRI spec, >because they are also not mentioned in other IETF specs. Nor do >I see any good reason for mentioning them for one particular point >in the IRI spec, because they should apply to all of the spec. > > >> > And why would that have to be checked before use? Why could it not >> > simply be the result of actual use? >> >>Because the original Foo spec might be old (even if the IRI scheme >>containing a Foo component is more recent), and might have its own >>installed base of stuff that does not behave interoperably when >>presented with a non-ASCII Foo, and therefore it might have needed to >>introduce a client-side downgrading operation in order to safely extend >>the syntax. If the IRI consumer blindly tries to use the Foo component >>as a Foo without performing the downgrading operation, the result will >>be unpredictable. > >Yes. We have a sloppy spec/implementation on the one hand, and >somebody sending stuff they are not supposed to send on the >other hand. Not surprising that it doesn't work. > > >>Maybe there will be a misleading error message like >>"Foo xyz not found" even though xyz actually exists, > >That's always a possibility for URIs and IRIs. Not all schemes may >be known, and the network isn't perfect,... > > >>or maybe the >>mangled request will map onto some other Foo by coincidence or malice. > >If you think this needs to be documented as a security issue, >please say so (please propose some wording). > > >>Ideally, the Foo spec should have specified what to do whenever you >>encounter a syntactically invalid Foo, so that Foo implementations bear >>full responsibility for interoperability as the Foo syntax is extended, >>and nothing about the Foo syntax need be known at the IRI-processing >>layer. But there is one kind of syntax extension where neglect has >>been the rule rather than the exception: the extension from ASCII to >>non-ASCII. Because it has been so common for protocols to assume ASCII >>without saying enough about how to react to non-ASCII, and because >>the ASCII-to-non-ASCII transition is the same one being made by the >>introduction of IRIs, and because IRIs are uniquely positioned as a >>narrow interface between a wide range of protocols and a wide range of >>applications (sort of like IP is a narrow interface between a wide range >>of networks and a wide range of applications), IRIs are a good place to >>interpose a simple type-safety check. > >Well, IRIs are defined as generic. Because the checks needed are >specific to different protocols,..., I don't think that such >checks belong into a generic spec. If a spec needs fixing, it >should be fixed. Using another, vaguely related spec to try >and fix the first spec is probably a bad idea. > > >> > > (That would prevent IRIs from suffering some of the problems we are >> > > now seeing with URIs. In URIs, percent-encoding was prohibited >> > > in the host component, and non-ASCII was prohibited in the host >> > > component, and there was no requirement telling URI consumers >> > > what to do if they should find either of those things in the host >> > > component, so now we have different implementations behaving >> > > differently when they encounter such things.) >> > >> > Well, yes. But that's just a result of how things are implemented, >> > not a problem in the specification, I guess. >> >>I think it's a problem in the specification. I think we've learned the >>hard way that specs need to say what to do when you encounter unexpected >>syntax, otherwise it's difficult to ever extend the syntax. > >I agree. But I don't think the IRI spec is the right place to fix >all the other specs. > > >>RFC-2396 said the host component does not contain percent-escapes, but >>didn't say what to do if it did contain them, so some implementations >>decode the escapes, and some don't, and neither group is wrong. > >And RFC 2396bis fixes that. > > >> > We already made an exception for domain names. I don't want to make >> > any other exceptions. The goal is not a hodgepodge of scheme-specific >> > conventions, but to take advantage of the fact that many URI schemes >> > already are based on UTF-8, many others allow UTF-8 to be used (in >> > many parts at least) and UTF-8 is also the recommendation for new >> > schemes. >> >>I agree with those goals, but there is a distinct possibility that an >>ACE will be defined for email local parts, in which case IRI-to-URI >>converters with knowledge of mailto: will want to use the ACE for >>compatibility with existing mailto: resolvers. > >Yes, in the case such a possibility becomes reality, some converters >might do that, if they think that helps. They will do that whether or >not a spec tells them to. On the other hand, the mailto: URI scheme should >be updated to allow %HH (based on UTF-8) in the LHS, and to otherwise >be better internationalized. > > >>Maybe there are >>other ASCII-only components lurking in existing URI schemes facing >>backward-compatibility challenges similar to those of domain names, and >>maybe they will likewise find it necessary to use the ACE approach to >>internationalization. > >Do you know of any? > > >>The IRI spec would not need to mention any of the individual >>scheme-specific exceptions. It mentions the IDN exception because ihost >>is a potential component of IRIs in general, and domain names are used >>in a great many schemes, but those reasons wouldn't apply to any other >>exceptional components (like email local parts). > >Okay. > > >> > > 2) If the verification failed, or if you didn't recognize the >> > > scheme, then perform the generic conversion to percent-encoded UTF-8 >> > > as described in the IRI draft, and prepend the prefix i- to the >> > > scheme. >> > >> > Why should i- be prepended? >> >>Because URI processing does not include the ASCII-component-check >>(whereas IRI processing, being a new spec, could include the check). >>Blindly dumping non-ASCII characters (even percent-encoded ones) into >>a URI would bypass the check. If the URI contains a component that >>used to be limited to ASCII, legacy implementations might behave in >>unpredictable ways when that component contains (percent-encoded) >>non-ASCII. > >I think there is a tradeoff. Introducing your i- pattern would >mean that the chance that any subsequent URI resolver actually >resolves that URI currently would be zero, and might stay very >close to zero for a very long time. As we know, introducing >a new URI scheme is very hard. > >The alternative is to not use the i-, meaning that already >in quite a few implementations, the URI in question can be >resolved, and this number will be increasing faster than in >the i- case, at the expense of an occasional unpredictability >(which in most cases is just a 'not found'). > >For me, having things actually work, maybe with occasional >hickups, is clearly preferable to a theoretically safe >solution that doesn't work in practice. > > >>Basically, i-foo: means "this identifier was blindly converted from a >>foo: IRI without foo-specific knowledge, so it does not necessarily >>conform to foo: URI syntax, but it does conform to generic URI syntax, >>and you can certainly recover the foo: IRI". > >There are many other ways (e.g. by hand) to create foo: URIs that >don't conform to foo: URI syntax. The IRI draft clearly says that >you are not supposed to use non-ASCII characters where the scheme >can't handle it. Please see >http://www.w3.org/International/iri-edit/draft-duerst-iri.html#UTF8use >for actual text. > > >>Another answer to your question ("Why should i- be prepended?") is: So >>that the IRI spec does not invite applications to violate the IDNA spec. >>The ireg-name component is an IDN-aware slot in schemes that use domain >>names there (because the IRI draft invites the usage of non-ASCII domain >>names there and cites IDNA). The corresponding reg-name slot in the >>URI is IDN-unaware. To convert a foo: IRI to a foo: URI, IDNA requires >>ToASCII to be applied. But when the application doesn't know the >>scheme, the IRI draft invites the application to use percent-encoding >>instead, disregarding the IDNA requirement. > >Well, I think that IDNA tried very hard to predict all cases of >use of IDNs, and put down general rules that would apply for all >cases. But in general, such things are just impossible. reg-name >is a typical example: a slot that can contain both domain names >and other stuff. And URIs are a typical example: In RFC 2396, >this slot only allowed US-ASCIII. In RFC 2396bis, %HH is also >allowed. Implementations have evolved likewise. > >The IRI spec does the best it reasonably can to navigate in this >area. Requiring everything to be prefixed with -i, in practice >making things less working, just to nominally conform to IDNA, >doesn't seem to make sense. > >Not every application will know all relevant schemes, but the >number of current schemes using DNS in reg-name is not that large, >and any future schemes can be defined to allow %HH from the start. >So in practice, it is not too difficult for IRI implementations to >follow IDNA, and there is definitely nothing in the IRI spec >that says that implementations should disregards IDNA. > > >> > New schemes can be designed so that they fit together well with IRIs >> > (if the relevant BCP guidelines are used, that will be the case >> > automatically). >> >>The resolvers of those new schemes can simply strip off the i- prefix if >>they know that the generic IRI-to-URI conversion is sufficient for those >>schemes. That could be mentioned in the IRI spec and in the guidelines >>for creating new schemes. > >Designing things so that the future gets more complicated, rather >than more straightforward, just to deal with some sloppy specs/ >implementations, does not seem to be a good idea. > > >>By the way, I should insert a rule 0 in my proposed IRI-to-URI >>conversion: >> >>0) If the IRI contains no non-ASCII characters (not even percent-encoded >>ones) then stop; it's already a URI. >> >>(Without this rule, if the scheme was unknown, the only effect of the >>other rules would be to prepend the i- prefix, which would be protecting >>nothing.) > >Well, yes. And don't add a i- prefix if there already is one, >and make sure we reserve all scheme names starting with i-, and >a few other 'details'. Way too much hassle for what it's worth, >sorry. > > >Regards, Martin.
Received on Tuesday, 4 May 2004 04:17:52 UTC