Re: Announcing draft-duerst-iri-02.txt from Mark Davis on 2002-11-21 (www-international@w3.org from October to December 2002)

From: Mark Davis <mark.davis@jtcsv.com>
Date: Wed, 20 Nov 2002 16:22:22 -0800
To: "Martin Duerst" <duerst@w3.org>
Cc: <www-international@w3.org>
Message-ID: <006901c290f4$106771f0$7cde2b09@DAVIS1>
It would just protect against a dumb reading. And lord knows, you will get
those.

Suppose Joe Blow has: "A<C2>B<C0>".

> > >   3) Re-escape any octets produced in step 2) that are not part of
> > >      a/any strictly legal UTF-8 octet sequence.

For Rule 3, Joe thinks: "Hmmm. C2 *is* part of a legal UTF-8 octet sequence,
the sequence <C2><80>. Therefore, I should not re-escape it. However, <C0>
is never part of any legal UTF-8 octet sequence. Therefore I won't escape
that one!"

So after applying Rule 3, Joe has: "A<C2>B%C0".

Mark
__________________________________
http://www.macchiato.com
►  “Eppur si muove” ◄

----- Original Message -----
From: "Martin Duerst" <duerst@w3.org>
To: "Mark Davis" <mark.davis@jtcsv.com>
Cc: <www-international@w3.org>
Sent: Wednesday, November 20, 2002 14:08
Subject: Re: Announcing draft-duerst-iri-02.txt


>
> Hello Mark,
>
> Many thanks for your comment.
>
> At 13:41 02/11/18 -0800, Mark Davis wrote:
> > >   3) Re-escape any octets produced in step 2) that are not part of
> > >      a/any strictly legal UTF-8 octet sequence.
> >
> >Changing 'any' for the second doesn't work. And some of the octets may
have
> >come from #1 (I guess)
>
> As a URI is always purely ASCII, there is no possibility of
> non-utf-8-octets to come in in step 1.
>
>
> >I'd recommend:
> >
> >Re-escape any octet that is not part of a strictly legal UTF-8 octet
> >sequence within the sequence of octets representing the URI.
>
> I have made the first change, namely  'octets'  ->  'octet'.
>
> I'm not sure the second part,
> "within the sequence of octets representing the URI".
> It seems a bit too obvious to me.
>
> Regards,   Martin.
>
>
> >[a bit clumsy -- perhaps that buffer can be given a name in defintion #1]
> >
> >Mark
> >__________________________________
> >http://www.macchiato.com
> >笆コ  窶廢ppur si muove窶・笳・
> >----- Original Message -----
> >From: "Martin Duerst" <duerst@w3.org>
> >To: "Mark Davis" <mark@macchiato.com>
> >Cc: <www-international@w3.org>
> >Sent: Monday, November 18, 2002 12:47
> >Subject: Re: Announcing draft-duerst-iri-02.txt
> >
> >
> > >
> > > Hello Mark,
> > >
> > > Many thanks for your comments. Some detail questions below.
> > > Looking forward to your feedback.
> > >
> > > At 13:16 02/11/15 -0800, Mark Davis wrote:
> > > >One of the steps is the following:
> > > >
> > > >        3) Re-escape any octets that are not part of a strictly legal
> >UTF-
> > > >           8 octet sequence.
> > > >This needs to be clearer. Suppose you have the invalid sequence:
> > > >
> > > >...<C2><C3><80>...
> > > >
> > > >One could re-escape the entire sequence,
> > > >
> > > >...%C2%C3%80...
> > > >
> > > >or one could re-escape the minimal-length invalid sequences,
preceding
> > > >from right to left.
> > > >
> > > >...%C2<C3><80>...
> > > >
> > > >I assume that the latter is what is meant, but it should be clearer
in
> >the
> > > >text of the clause. For that matter, any single octet above <7F> is
> > > >invalid, so a perverse reading of the clause would require all of
them to
> > > >be escaped!
> > >
> > > My interpretation is that <C3><80> is a strictly legal UTF-8
> > > sequence, and therefore the <C3> and <80> octets are part of a
> > > strictly legal UTF-8 octet sequence, and so only <C2> can be
> > > re-escaped.
> > >
> > > What would you propose to make this easier to understand?
> > > Would it be better to replace 'a' by 'any'?
> > >
> > >   3) Re-escape any octets that are not part of any strictly legal
> > >      UTF-8 octet sequence.
> > >
> > > Or do you have another idea of how to make this clearer?
> > >
> > > >        4) Re-escape all octets that in UTF-8 represent characters
that
> > > >           are not appropriate according to Section 5.1.
> > > >Should this not also say Section 4.1?
> > >
> > > Good point. Done.
> > >
> > >
> > > >It is also unclear what to do with a sequence like %G1. Does it turn
into
> > > >%25G1?
> > >
> > > That's not a legal URI, so it is not a legal input. So we should
> > > never get it. If we get it, it's not converted to an octet in
> > > step 2), and can therefore not be re-escaped. But maybe it would
> > > help to say clearly that the 're-escape' refers to those octets
> > > produced in step 2):
> > >
> > >   3) Re-escape any octets produced in step 2) that are not part of
> > >      a/any strictly legal UTF-8 octet sequence.
> > >
> > > What do you think?
> > >
> > >
> > > Regards,    Martin.
> > >
> > >
>
>
Received on Wednesday, 20 November 2002 19:22:32 UTC