Re: Announcing draft-duerst-iri-02.txt from Martin Duerst on 2002-11-26 (www-international@w3.org from October to December 2002)

From: Martin Duerst <duerst@w3.org>
Date: Wed, 27 Nov 2002 05:26:39 +0900
To: "Mark Davis" <mark.davis@jtcsv.com>
Cc: <www-international@w3.org>
Message-Id: <4.2.0.58.J.20021127052439.049879a0@localhost>
Hello Mark,

We just updated that section and added some (very simple)
examples. Please have a look again at
http://www.w3.org/International/iri-edit/draft-duerst-iri.txt

Regards,    Martin.

At 16:22 02/11/20 -0800, Mark Davis wrote:
>It would just protect against a dumb reading. And lord knows, you will get
>those.
>
>Suppose Joe Blow has: "A<C2>B<C0>".
>
> > > >   3) Re-escape any octets produced in step 2) that are not part of
> > > >      a/any strictly legal UTF-8 octet sequence.
>
>For Rule 3, Joe thinks: "Hmmm. C2 *is* part of a legal UTF-8 octet sequence,
>the sequence <C2><80>. Therefore, I should not re-escape it. However, <C0>
>is never part of any legal UTF-8 octet sequence. Therefore I won't escape
>that one!"
>
>So after applying Rule 3, Joe has: "A<C2>B%C0".
>
>Mark
>__________________________________
>http://www.macchiato.com
>笆コ  窶廢ppur si muove窶�笳�
>----- Original Message -----
>From: "Martin Duerst" <duerst@w3.org>
>To: "Mark Davis" <mark.davis@jtcsv.com>
>Cc: <www-international@w3.org>
>Sent: Wednesday, November 20, 2002 14:08
>Subject: Re: Announcing draft-duerst-iri-02.txt
>
>
> >
> > Hello Mark,
> >
> > Many thanks for your comment.
> >
> > At 13:41 02/11/18 -0800, Mark Davis wrote:
> > > >   3) Re-escape any octets produced in step 2) that are not part of
> > > >      a/any strictly legal UTF-8 octet sequence.
> > >
> > >Changing 'any' for the second doesn't work. And some of the octets may
>have
> > >come from #1 (I guess)
> >
> > As a URI is always purely ASCII, there is no possibility of
> > non-utf-8-octets to come in in step 1.
> >
> >
> > >I'd recommend:
> > >
> > >Re-escape any octet that is not part of a strictly legal UTF-8 octet
> > >sequence within the sequence of octets representing the URI.
> >
> > I have made the first change, namely  'octets'  ->  'octet'.
> >
> > I'm not sure the second part,
> > "within the sequence of octets representing the URI".
> > It seems a bit too obvious to me.
> >
> > Regards,   Martin.
> >
> >
> > >[a bit clumsy -- perhaps that buffer can be given a name in defintion #1]
> > >
> > >Mark
> > >__________________________________
> > >http://www.macchiato.com
> > >隨�さ  遯カ蟒「ppur si muove遯カ繝サ隨ウ繝サ
> > >----- Original Message -----
> > >From: "Martin Duerst" <duerst@w3.org>
> > >To: "Mark Davis" <mark@macchiato.com>
> > >Cc: <www-international@w3.org>
> > >Sent: Monday, November 18, 2002 12:47
> > >Subject: Re: Announcing draft-duerst-iri-02.txt
> > >
> > >
> > > >
> > > > Hello Mark,
> > > >
> > > > Many thanks for your comments. Some detail questions below.
> > > > Looking forward to your feedback.
> > > >
> > > > At 13:16 02/11/15 -0800, Mark Davis wrote:
> > > > >One of the steps is the following:
> > > > >
> > > > >        3) Re-escape any octets that are not part of a strictly legal
> > >UTF-
> > > > >           8 octet sequence.
> > > > >This needs to be clearer. Suppose you have the invalid sequence:
> > > > >
> > > > >...<C2><C3><80>...
> > > > >
> > > > >One could re-escape the entire sequence,
> > > > >
> > > > >...%C2%C3%80...
> > > > >
> > > > >or one could re-escape the minimal-length invalid sequences,
>preceding
> > > > >from right to left.
> > > > >
> > > > >...%C2<C3><80>...
> > > > >
> > > > >I assume that the latter is what is meant, but it should be clearer
>in
> > >the
> > > > >text of the clause. For that matter, any single octet above <7F> is
> > > > >invalid, so a perverse reading of the clause would require all of
>them to
> > > > >be escaped!
> > > >
> > > > My interpretation is that <C3><80> is a strictly legal UTF-8
> > > > sequence, and therefore the <C3> and <80> octets are part of a
> > > > strictly legal UTF-8 octet sequence, and so only <C2> can be
> > > > re-escaped.
> > > >
> > > > What would you propose to make this easier to understand?
> > > > Would it be better to replace 'a' by 'any'?
> > > >
> > > >   3) Re-escape any octets that are not part of any strictly legal
> > > >      UTF-8 octet sequence.
> > > >
> > > > Or do you have another idea of how to make this clearer?
> > > >
> > > > >        4) Re-escape all octets that in UTF-8 represent characters
>that
> > > > >           are not appropriate according to Section 5.1.
> > > > >Should this not also say Section 4.1?
> > > >
> > > > Good point. Done.
> > > >
> > > >
> > > > >It is also unclear what to do with a sequence like %G1. Does it turn
>into
> > > > >%25G1?
> > > >
> > > > That's not a legal URI, so it is not a legal input. So we should
> > > > never get it. If we get it, it's not converted to an octet in
> > > > step 2), and can therefore not be re-escaped. But maybe it would
> > > > help to say clearly that the 're-escape' refers to those octets
> > > > produced in step 2):
> > > >
> > > >   3) Re-escape any octets produced in step 2) that are not part of
> > > >      a/any strictly legal UTF-8 octet sequence.
> > > >
> > > > What do you think?
> > > >
> > > >
> > > > Regards,    Martin.
> > > >
> > > >
> >
> >
Received on Tuesday, 26 November 2002 15:42:39 UTC