Re: IRI update for "parse then (if needed, translate)" vs "translate, then parse" from Erik van der Poel on 2009-09-15 (public-iri@w3.org from September 2009)

From: Erik van der Poel <erikv@google.com>
Date: Tue, 15 Sep 2009 08:43:08 -0700
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Cc: Larry Masinter <masinter@adobe.com>, "Roy T. Fielding" <fielding@gbiv.com>, "Henry S. Thompson" <ht@cogsci.ed.ac.uk>, "tag@w3.org" <tag@w3.org>, "public-iri@w3.org" <public-iri@w3.org>, Michel SUIGNARD <Michel@suignard.com>
Message-ID: <c07a32650909150843y32652448m3108003d1344301e@mail.gmail.com>
Yes, a relative must definitely be parsed before absolutizing. Two
backslashes after http: are also treated as (forward) slashes by major
browsers.

Erik

On Mon, Sep 14, 2009 at 8:32 PM, "Martin J. Dürst"
<duerst@it.aoyama.ac.jp> wrote:
> [I have added public-iri@w3.org to the cc list.]
>
> On 2009/09/03 9:12, Larry Masinter wrote:
>>
>> This took a while, but here's the next cut at recasting IRIs and dealing
>> with "web address":
>>
>>
>> http://larry.masinter.net/iribis-hack.html
>> http://larry.masinter.net/iribis-hack.txt
>> http://larry.masinter.net/iribis-hack.xml
>
>>
>> http://tools.ietf.org/rfcdiff?url1=draft-duerst-iri-bis.txt&url2=http://larry.masinter.net/iribis-hack.txt
>
>
> I have read through this new draft, trying to concentrate on the changed
> pieces.
>
> Overall, I'm scared about the tendency to use MUST without much more careful
> examination.
>
> I have nothing against *allowing* scheme-specific short-cuts, optimizations,
> or short-time backwards compatibility variants, but in the long term, UTF-8
> is much more important than punycode, and scheme-independent processing
> isn't something to be thrown away easily.
>
> But I'm quite confident that such a result can be obtained with a new
> version of the draft which is much closer to the current -06.txt than the
> one discussed here.
>
> In some more detail:
>
> Title: I strongly suggest removing "URI" from the title and limit the
> current effort on the "allow scheme-specific conversion" part and the
> "LEIRI/Web address" part. These alone are quite serious, and we can always
> make another rewrite effort once these have been addressed really
> successfully, although I can't see the need for getting too much involved in
> URIs at all at the moment.
>
> Abstract: "TO ALLOW RECONCILIATION WITH CURRENT PRACTICE": This is too
> strong. At least change to "SOME CURRENT PRACTICE", as there are definitely
> implementations that can handle %-encoding in regnames, and there will be
> more as time goes on. (see Roy's point about the Host header in HTTP).
>
> Abstract, shortest para: 'addition' ... 'additional': reword to avoid
> repetition.
>
> Document structure: Section 5.5 is the wrong place for LEIRIs and friends
> (I'll call these legacy addresses from now on). What we need is a short
> notice in section 5 (Normalization/Comparison) about legacy addresses, but
> legacy addresses in and by themselves need a separate section (the more I
> think about it, the more my conclusion is that an appendix is the best
> place).
>
> Introduction: "increasing numbers of protocols" -> "an increasing number of
> protocols" (one number, many protocols) (that may have been in there for
> ages)
>
> Definitions: "parsed IRI component": Don't start a definition with
> "similarly". (definitions should be reasonably usable outside of context)
>
> 3., before 3.1: This clearly needs more text talking about the overall
> choices and procedures.
>
> 3.1 "Convert to UCS" -> "Converting to UCS" (some other titles have the same
> problem; verbs don't work well in titles)
>
> 3.1, first para: Remove "or octet stream...". Of course the "sequence of
> Unicode characters" will be represented somehow, but that's not relevant
> here.
>
> 3.12 para 2: "benormalized" -> "be normalized"
>
> 3.2, para 1: "IRI. this" -> "IRI. This"
>
> 3.2, para 1: Is the intent to say that for relative URIs, they should be
> absolutized first, and then parsed? If yes, then say so. If no, say what
> else. I'm absolutely not sure that this will work; we have to very carefully
> check all kinds of interactions (relative -> absolute does some parsing as
> far as I understand, and HTML5 tries to convert '\' to '/' in paths, which
> probably also interacts.
>
> 3.2, para 2: What about unknown schemes? Simply give up, or what?
>
> 3.2, para 3: Needs much more care and detail, and can't stay a Note.
>
> 3.2, para 4: "Subseqent processing rules may be used to define other
> syntactic components.": What exactly is this supposed to mean???
>
> 3.3, para 3 (NOTE): Why is a MAY harmful? IRIs are well-defined, and we have
> to allow implementations to process only valid ones, and not other garbage.
> "The non-printable characters should be stripped by most software, so by the
> time you get here...": This reads like a "survival of the fittest" for
> control characters.
>
> 3.3: "Hex encode" -> %-encode (That's what both RFC 3986 and 3987 have used,
> and even if many people (incl. me) don't like it, there's no reason to
> change it just to create even more confusion.
>
> 3.4, para 1: Again an unjustified MUST. There are implementations that don't
> do this, for good reasons, and they work and shouldn't be made
> nonconformant. Also, we have to work on what to do for IDNA2008 here.
>
> 3.4, para 2: If ToASCII fails, then it fails. End of story. That's another
> reason why converting to %-encoding makes sense; IRIs/URIs cannot and
> shouldn't be concerned with the details of the various namespaces that they
> contain or grandfather.
>
> 3.4, Note 1: "The server side implementation would be responsible": "would
> be" -> "is".
>
> 3.4, Note 2: What about e.g. http://r%C3%A9sum%C3%A9.example.org in an IRI?
> Will that get converted to punycode, or not?
>
> 3.4, Note 3: This needs to go somewhere else, it doesn't fit here.
>
> 3.5: This is webaddress-specific, needs to be moved.
>
> 3.6: Now we suddenly have a SHOULD. Does this trump all the MUSTs in the
> details, or what.
>
> 3.7.1, last example: There's some inconsistency re. "natto" (maybe from a
> long time ago)
>
>
> 7.: Clarifying that URI schemes are also IRI schemes is a good idea. But
> this does it the wrong way: It separates URI schemes and IRI schemes, and
> claims that only four schemes (ftp, http, https, impa) can be used with IRIs
> when actually there are quite a few more. (what was the criterion for
> obtaining the above small list?)
>
>
> That's what I have for the moment.
>
>
> Regards,   Martin.
>
> --
> #-# Martin J. Dürst, Professor, Aoyama Gakuin University
> #-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
>
Received on Tuesday, 15 September 2009 15:43:55 UTC