Re: IRI update for "parse then (if needed, translate)" vs "translate, then parse" from Martin J. Dürst on 2009-09-15 (public-iri@w3.org from September 2009)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Tue, 15 Sep 2009 12:32:19 +0900
To: Larry Masinter <masinter@adobe.com>
CC: "Roy T. Fielding" <fielding@gbiv.com>, Erik van der Poel <erikv@google.com>, "Henry S. Thompson" <ht@cogsci.ed.ac.uk>, "tag@w3.org" <tag@w3.org>, "public-iri@w3.org" <public-iri@w3.org>, Michel SUIGNARD <Michel@suignard.com>
Message-ID: <4AAF0AC3.5050505@it.aoyama.ac.jp>

[I have added public-iri@w3.org to the cc list.]

On 2009/09/03 9:12, Larry Masinter wrote:
> This took a while, but here's the next cut at recasting IRIs and dealing with "web address":
>
>
> http://larry.masinter.net/iribis-hack.html
> http://larry.masinter.net/iribis-hack.txt
> http://larry.masinter.net/iribis-hack.xml
>
http://tools.ietf.org/rfcdiff?url1=draft-duerst-iri-bis.txt&url2=http://larry.masinter.net/iribis-hack.txt

I have read through this new draft, trying to concentrate on the changed
pieces.

Overall, I'm scared about the tendency to use MUST without much more
careful examination.

I have nothing against *allowing* scheme-specific short-cuts,
optimizations, or short-time backwards compatibility variants, but in
the long term, UTF-8 is much more important than punycode, and
scheme-independent processing isn't something to be thrown away easily.

But I'm quite confident that such a result can be obtained with a new
version of the draft which is much closer to the current -06.txt than
the one discussed here.

In some more detail:

Title: I strongly suggest removing "URI" from the title and limit the
current effort on the "allow scheme-specific conversion" part and the
"LEIRI/Web address" part. These alone are quite serious, and we can
always make another rewrite effort once these have been addressed really
successfully, although I can't see the need for getting too much
involved in URIs at all at the moment.

Abstract: "TO ALLOW RECONCILIATION WITH CURRENT PRACTICE": This is too
strong. At least change to "SOME CURRENT PRACTICE", as there are
definitely implementations that can handle %-encoding in regnames, and
there will be more as time goes on. (see Roy's point about the Host
header in HTTP).

Abstract, shortest para: 'addition' ... 'additional': reword to avoid
repetition.

Document structure: Section 5.5 is the wrong place for LEIRIs and
friends (I'll call these legacy addresses from now on). What we need is
a short notice in section 5 (Normalization/Comparison) about legacy
addresses, but legacy addresses in and by themselves need a separate
section (the more I think about it, the more my conclusion is that an
appendix is the best place).

Introduction: "increasing numbers of protocols" -> "an increasing number
of protocols" (one number, many protocols) (that may have been in there
for ages)

Definitions: "parsed IRI component": Don't start a definition with
"similarly". (definitions should be reasonably usable outside of context)

3., before 3.1: This clearly needs more text talking about the overall
choices and procedures.

3.1 "Convert to UCS" -> "Converting to UCS" (some other titles have the
same problem; verbs don't work well in titles)

3.1, first para: Remove "or octet stream...". Of course the "sequence of
Unicode characters" will be represented somehow, but that's not relevant
here.

3.12 para 2: "benormalized" -> "be normalized"

3.2, para 1: "IRI. this" -> "IRI. This"

3.2, para 1: Is the intent to say that for relative URIs, they should be
absolutized first, and then parsed? If yes, then say so. If no, say what
else. I'm absolutely not sure that this will work; we have to very
carefully check all kinds of interactions (relative -> absolute does
some parsing as far as I understand, and HTML5 tries to convert '\' to
'/' in paths, which probably also interacts.

3.2, para 2: What about unknown schemes? Simply give up, or what?

3.2, para 3: Needs much more care and detail, and can't stay a Note.

3.2, para 4: "Subseqent processing rules may be used to define other
syntactic components.": What exactly is this supposed to mean???

3.3, para 3 (NOTE): Why is a MAY harmful? IRIs are well-defined, and we
have to allow implementations to process only valid ones, and not other
garbage.
"The non-printable characters should be stripped by most software, so by
the time you get here...": This reads like a "survival of the fittest"
for control characters.

3.3: "Hex encode" -> %-encode (That's what both RFC 3986 and 3987 have
used, and even if many people (incl. me) don't like it, there's no
reason to change it just to create even more confusion.

3.4, para 1: Again an unjustified MUST. There are implementations that
don't do this, for good reasons, and they work and shouldn't be made
nonconformant. Also, we have to work on what to do for IDNA2008 here.

3.4, para 2: If ToASCII fails, then it fails. End of story. That's
another reason why converting to %-encoding makes sense; IRIs/URIs
cannot and shouldn't be concerned with the details of the various
namespaces that they contain or grandfather.

3.4, Note 1: "The server side implementation would be responsible":
"would be" -> "is".

3.4, Note 2: What about e.g. http://r%C3%A9sum%C3%A9.example.org in an
IRI? Will that get converted to punycode, or not?

3.4, Note 3: This needs to go somewhere else, it doesn't fit here.

3.5: This is webaddress-specific, needs to be moved.

3.6: Now we suddenly have a SHOULD. Does this trump all the MUSTs in the
details, or what.

3.7.1, last example: There's some inconsistency re. "natto" (maybe from
a long time ago)

7.: Clarifying that URI schemes are also IRI schemes is a good idea. But
this does it the wrong way: It separates URI schemes and IRI schemes,
and claims that only four schemes (ftp, http, https, impa) can be used
with IRIs when actually there are quite a few more. (what was the
criterion for obtaining the above small list?)

That's what I have for the moment.

Regards, Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp

Received on Tuesday, 15 September 2009 11:25:05 UTC