Re: IRI update for "parse then (if needed, translate)" vs "translate, then parse"

[I have added public-iri@w3.org to the cc list.]

On 2009/09/03 9:12, Larry Masinter wrote:
> This took a while, but here's the next cut at recasting IRIs and dealing with "web address":
>
>
> http://larry.masinter.net/iribis-hack.html
> http://larry.masinter.net/iribis-hack.txt
> http://larry.masinter.net/iribis-hack.xml
 > 
http://tools.ietf.org/rfcdiff?url1=draft-duerst-iri-bis.txt&url2=http://larry.masinter.net/iribis-hack.txt


I have read through this new draft, trying to concentrate on the changed 
pieces.

Overall, I'm scared about the tendency to use MUST without much more 
careful examination.

I have nothing against *allowing* scheme-specific short-cuts, 
optimizations, or short-time backwards compatibility variants, but in 
the long term, UTF-8 is much more important than punycode, and 
scheme-independent processing isn't something to be thrown away easily.

But I'm quite confident that such a result can be obtained with a new 
version of the draft which is much closer to the current -06.txt than 
the one discussed here.

In some more detail:

Title: I strongly suggest removing "URI" from the title and limit the 
current effort on the "allow scheme-specific conversion" part and the 
"LEIRI/Web address" part. These alone are quite serious, and we can 
always make another rewrite effort once these have been addressed really 
successfully, although I can't see the need for getting too much 
involved in URIs at all at the moment.

Abstract: "TO ALLOW RECONCILIATION WITH CURRENT PRACTICE": This is too 
strong. At least change to "SOME CURRENT PRACTICE", as there are 
definitely implementations that can handle %-encoding in regnames, and 
there will be more as time goes on. (see Roy's point about the Host 
header in HTTP).

Abstract, shortest para: 'addition' ... 'additional': reword to avoid 
repetition.

Document structure: Section 5.5 is the wrong place for LEIRIs and 
friends (I'll call these legacy addresses from now on). What we need is 
a short notice in section 5 (Normalization/Comparison) about legacy 
addresses, but legacy addresses in and by themselves need a separate 
section (the more I think about it, the more my conclusion is that an 
appendix is the best place).

Introduction: "increasing numbers of protocols" -> "an increasing number 
of protocols" (one number, many protocols) (that may have been in there 
for ages)

Definitions: "parsed IRI component": Don't start a definition with 
"similarly". (definitions should be reasonably usable outside of context)

3., before 3.1: This clearly needs more text talking about the overall 
choices and procedures.

3.1 "Convert to UCS" -> "Converting to UCS" (some other titles have the 
same problem; verbs don't work well in titles)

3.1, first para: Remove "or octet stream...". Of course the "sequence of 
Unicode characters" will be represented somehow, but that's not relevant 
here.

3.12 para 2: "benormalized" -> "be normalized"

3.2, para 1: "IRI. this" -> "IRI. This"

3.2, para 1: Is the intent to say that for relative URIs, they should be 
absolutized first, and then parsed? If yes, then say so. If no, say what 
else. I'm absolutely not sure that this will work; we have to very 
carefully check all kinds of interactions (relative -> absolute does 
some parsing as far as I understand, and HTML5 tries to convert '\' to 
'/' in paths, which probably also interacts.

3.2, para 2: What about unknown schemes? Simply give up, or what?

3.2, para 3: Needs much more care and detail, and can't stay a Note.

3.2, para 4: "Subseqent processing rules may be used to define other 
syntactic components.": What exactly is this supposed to mean???

3.3, para 3 (NOTE): Why is a MAY harmful? IRIs are well-defined, and we 
have to allow implementations to process only valid ones, and not other 
garbage.
"The non-printable characters should be stripped by most software, so by 
the time you get here...": This reads like a "survival of the fittest" 
for control characters.

3.3: "Hex encode" -> %-encode (That's what both RFC 3986 and 3987 have 
used, and even if many people (incl. me) don't like it, there's no 
reason to change it just to create even more confusion.

3.4, para 1: Again an unjustified MUST. There are implementations that 
don't do this, for good reasons, and they work and shouldn't be made 
nonconformant. Also, we have to work on what to do for IDNA2008 here.

3.4, para 2: If ToASCII fails, then it fails. End of story. That's 
another reason why converting to %-encoding makes sense; IRIs/URIs 
cannot and shouldn't be concerned with the details of the various 
namespaces that they contain or grandfather.

3.4, Note 1: "The server side implementation would be responsible": 
"would be" -> "is".

3.4, Note 2: What about e.g. http://r%C3%A9sum%C3%A9.example.org in an 
IRI? Will that get converted to punycode, or not?

3.4, Note 3: This needs to go somewhere else, it doesn't fit here.

3.5: This is webaddress-specific, needs to be moved.

3.6: Now we suddenly have a SHOULD. Does this trump all the MUSTs in the 
details, or what.

3.7.1, last example: There's some inconsistency re. "natto" (maybe from 
a long time ago)


7.: Clarifying that URI schemes are also IRI schemes is a good idea. But 
this does it the wrong way: It separates URI schemes and IRI schemes, 
and claims that only four schemes (ftp, http, https, impa) can be used 
with IRIs when actually there are quite a few more. (what was the 
criterion for obtaining the above small list?)


That's what I have for the moment.


Regards,   Martin.

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp

Received on Tuesday, 15 September 2009 11:25:05 UTC