Re: FW: New Version Notification for draft-duerst-iri-bis-07 from Erik van der Poel on 2009-11-01 (public-iri@w3.org from November 2009)

From: Erik van der Poel <erikv@google.com>
Date: Sun, 1 Nov 2009 07:41:25 -0800
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Cc: Larry Masinter <masinter@adobe.com>, "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <c07a32650911010741r73d2e999l5e3eb6ef2e7bc9b4@mail.gmail.com>

Thanks for the new iri-bis-07 draft. Many of the changes are in the
right direction.

It's great that there are detailed steps for conversion between IRIs
and URIs (in both directions), but to ensure interoperability (while
maintaining security), we need to know how to convert the domain name
part of a URI into a DNS packet (or other name lookup protocol). We
also need to know how to convert the domain name to the HTTP Host:
header. I suppose the HTTP-specific rules should be specified in the
HTTP spec(s), but we probably don't want to put DNS-specific rules
into the main DNS spec(s), do we? In particular, I'm thinking about
the recommendations and rules regarding such things as %2E (%-encoded
dot).

Although we probably want to recommend "pure" IRIs and URIs (to
content producers), we will find mixtures of %-encoded and
not-%-encoded text in the real world. We probably need to be a bit
more explicit about rules and recommendations for this in the URI <->
IRI conversions (in both directions).

In the IRI to URI conversion steps, we now parse the IRI before
performing any Punycoding and %-encoding. This matches current
implementations. However, I believe we need the analogous change in
the URI to IRI conversion steps. I.e. we need to parse the URI and
then use a single character encoding (charset) for each URI component
(mainly /path and ?query). The current draft says "Re-percent-encode
any octet produced in step 2 that is not part of a strictly legal
UTF-8 octet sequence." This would break some URIs, since it specifies
a per-octet rule rather than the per-component rule. In the IRI to URI
conversion, we only have one charset (the "document" charset), but in
the URI to IRI conversion, we potentially have more than one charset
(e.g. /path is UTF-8 and ?query is GB2312). Such mixtures are rare,
and content producers should be warned not to use them, but
implementers need to know how to process such exceptions.

Erik

On Thu, Oct 29, 2009 at 12:18 AM, "Martin J. Dürst"
<duerst@it.aoyama.ac.jp> wrote:
> On 2009/10/29 10:20, Larry Masinter wrote:
>>
>> Due to some personal difficulties, the split of the document
>> into three parts (parsing, domain names, BCP on character
>> handling, BIDI, etc.) didn't happen. However, Martin did
>> heroically get a new draft out based on some if the
>> interim work.
>
> I admit that I got a new draft out, but I have to strongly deny
> "heroically". Most of the changes are from Larry, and the only thing I did
> was to tweak a few things where I had opinions that differed somewhat from
> Larry, and to submit a draft before the deadline just so that we have
> something in the repository.
>
> Anyway, please have a look and comment!
>
> Regards,   Martin.
>
> --
> #-# Martin J. Dürst, Professor, Aoyama Gakuin University
> #-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
>
>

Received on Sunday, 1 November 2009 15:41:59 UTC