- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Tue, 15 Sep 2009 12:32:19 +0900
- To: Larry Masinter <masinter@adobe.com>
- CC: "Roy T. Fielding" <fielding@gbiv.com>, Erik van der Poel <erikv@google.com>, "Henry S. Thompson" <ht@cogsci.ed.ac.uk>, "tag@w3.org" <tag@w3.org>, "public-iri@w3.org" <public-iri@w3.org>, Michel SUIGNARD <Michel@suignard.com>
[I have added public-iri@w3.org to the cc list.] On 2009/09/03 9:12, Larry Masinter wrote: > This took a while, but here's the next cut at recasting IRIs and dealing with "web address": > > > http://larry.masinter.net/iribis-hack.html > http://larry.masinter.net/iribis-hack.txt > http://larry.masinter.net/iribis-hack.xml > http://tools.ietf.org/rfcdiff?url1=draft-duerst-iri-bis.txt&url2=http://larry.masinter.net/iribis-hack.txt I have read through this new draft, trying to concentrate on the changed pieces. Overall, I'm scared about the tendency to use MUST without much more careful examination. I have nothing against *allowing* scheme-specific short-cuts, optimizations, or short-time backwards compatibility variants, but in the long term, UTF-8 is much more important than punycode, and scheme-independent processing isn't something to be thrown away easily. But I'm quite confident that such a result can be obtained with a new version of the draft which is much closer to the current -06.txt than the one discussed here. In some more detail: Title: I strongly suggest removing "URI" from the title and limit the current effort on the "allow scheme-specific conversion" part and the "LEIRI/Web address" part. These alone are quite serious, and we can always make another rewrite effort once these have been addressed really successfully, although I can't see the need for getting too much involved in URIs at all at the moment. Abstract: "TO ALLOW RECONCILIATION WITH CURRENT PRACTICE": This is too strong. At least change to "SOME CURRENT PRACTICE", as there are definitely implementations that can handle %-encoding in regnames, and there will be more as time goes on. (see Roy's point about the Host header in HTTP). Abstract, shortest para: 'addition' ... 'additional': reword to avoid repetition. Document structure: Section 5.5 is the wrong place for LEIRIs and friends (I'll call these legacy addresses from now on). What we need is a short notice in section 5 (Normalization/Comparison) about legacy addresses, but legacy addresses in and by themselves need a separate section (the more I think about it, the more my conclusion is that an appendix is the best place). Introduction: "increasing numbers of protocols" -> "an increasing number of protocols" (one number, many protocols) (that may have been in there for ages) Definitions: "parsed IRI component": Don't start a definition with "similarly". (definitions should be reasonably usable outside of context) 3., before 3.1: This clearly needs more text talking about the overall choices and procedures. 3.1 "Convert to UCS" -> "Converting to UCS" (some other titles have the same problem; verbs don't work well in titles) 3.1, first para: Remove "or octet stream...". Of course the "sequence of Unicode characters" will be represented somehow, but that's not relevant here. 3.12 para 2: "benormalized" -> "be normalized" 3.2, para 1: "IRI. this" -> "IRI. This" 3.2, para 1: Is the intent to say that for relative URIs, they should be absolutized first, and then parsed? If yes, then say so. If no, say what else. I'm absolutely not sure that this will work; we have to very carefully check all kinds of interactions (relative -> absolute does some parsing as far as I understand, and HTML5 tries to convert '\' to '/' in paths, which probably also interacts. 3.2, para 2: What about unknown schemes? Simply give up, or what? 3.2, para 3: Needs much more care and detail, and can't stay a Note. 3.2, para 4: "Subseqent processing rules may be used to define other syntactic components.": What exactly is this supposed to mean??? 3.3, para 3 (NOTE): Why is a MAY harmful? IRIs are well-defined, and we have to allow implementations to process only valid ones, and not other garbage. "The non-printable characters should be stripped by most software, so by the time you get here...": This reads like a "survival of the fittest" for control characters. 3.3: "Hex encode" -> %-encode (That's what both RFC 3986 and 3987 have used, and even if many people (incl. me) don't like it, there's no reason to change it just to create even more confusion. 3.4, para 1: Again an unjustified MUST. There are implementations that don't do this, for good reasons, and they work and shouldn't be made nonconformant. Also, we have to work on what to do for IDNA2008 here. 3.4, para 2: If ToASCII fails, then it fails. End of story. That's another reason why converting to %-encoding makes sense; IRIs/URIs cannot and shouldn't be concerned with the details of the various namespaces that they contain or grandfather. 3.4, Note 1: "The server side implementation would be responsible": "would be" -> "is". 3.4, Note 2: What about e.g. http://r%C3%A9sum%C3%A9.example.org in an IRI? Will that get converted to punycode, or not? 3.4, Note 3: This needs to go somewhere else, it doesn't fit here. 3.5: This is webaddress-specific, needs to be moved. 3.6: Now we suddenly have a SHOULD. Does this trump all the MUSTs in the details, or what. 3.7.1, last example: There's some inconsistency re. "natto" (maybe from a long time ago) 7.: Clarifying that URI schemes are also IRI schemes is a good idea. But this does it the wrong way: It separates URI schemes and IRI schemes, and claims that only four schemes (ftp, http, https, impa) can be used with IRIs when actually there are quite a few more. (what was the criterion for obtaining the above small list?) That's what I have for the moment. Regards, Martin. -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Tuesday, 15 September 2009 11:25:05 UTC