Re: Announcing draft-duerst-iri-02.txt from Martin Duerst on 2002-11-20 (www-international@w3.org from October to December 2002)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 21 Nov 2002 07:08:09 +0900
To: "Mark Davis" <mark.davis@jtcsv.com>
Cc: <www-international@w3.org>
Message-Id: <4.2.0.58.J.20021121064653.02a5bb68@localhost>

Hello Mark,

Many thanks for your comment.

At 13:41 02/11/18 -0800, Mark Davis wrote:
> >   3) Re-escape any octets produced in step 2) that are not part of
> >      a/any strictly legal UTF-8 octet sequence.
>
>Changing 'any' for the second doesn't work. And some of the octets may have
>come from #1 (I guess)

As a URI is always purely ASCII, there is no possibility of
non-utf-8-octets to come in in step 1.


>I'd recommend:
>
>Re-escape any octet that is not part of a strictly legal UTF-8 octet
>sequence within the sequence of octets representing the URI.

I have made the first change, namely  'octets'  ->  'octet'.

I'm not sure the second part,
"within the sequence of octets representing the URI".
It seems a bit too obvious to me.

Regards,   Martin.


>[a bit clumsy -- perhaps that buffer can be given a name in defintion #1]
>
>Mark
>__________________________________
>http://www.macchiato.com
>笆コ  窶廢ppur si muove窶�笳�
>----- Original Message -----
>From: "Martin Duerst" <duerst@w3.org>
>To: "Mark Davis" <mark@macchiato.com>
>Cc: <www-international@w3.org>
>Sent: Monday, November 18, 2002 12:47
>Subject: Re: Announcing draft-duerst-iri-02.txt
>
>
> >
> > Hello Mark,
> >
> > Many thanks for your comments. Some detail questions below.
> > Looking forward to your feedback.
> >
> > At 13:16 02/11/15 -0800, Mark Davis wrote:
> > >One of the steps is the following:
> > >
> > >        3) Re-escape any octets that are not part of a strictly legal
>UTF-
> > >           8 octet sequence.
> > >This needs to be clearer. Suppose you have the invalid sequence:
> > >
> > >...<C2><C3><80>...
> > >
> > >One could re-escape the entire sequence,
> > >
> > >...%C2%C3%80...
> > >
> > >or one could re-escape the minimal-length invalid sequences, preceding
> > >from right to left.
> > >
> > >...%C2<C3><80>...
> > >
> > >I assume that the latter is what is meant, but it should be clearer in
>the
> > >text of the clause. For that matter, any single octet above <7F> is
> > >invalid, so a perverse reading of the clause would require all of them to
> > >be escaped!
> >
> > My interpretation is that <C3><80> is a strictly legal UTF-8
> > sequence, and therefore the <C3> and <80> octets are part of a
> > strictly legal UTF-8 octet sequence, and so only <C2> can be
> > re-escaped.
> >
> > What would you propose to make this easier to understand?
> > Would it be better to replace 'a' by 'any'?
> >
> >   3) Re-escape any octets that are not part of any strictly legal
> >      UTF-8 octet sequence.
> >
> > Or do you have another idea of how to make this clearer?
> >
> > >        4) Re-escape all octets that in UTF-8 represent characters that
> > >           are not appropriate according to Section 5.1.
> > >Should this not also say Section 4.1?
> >
> > Good point. Done.
> >
> >
> > >It is also unclear what to do with a sequence like %G1. Does it turn into
> > >%25G1?
> >
> > That's not a legal URI, so it is not a legal input. So we should
> > never get it. If we get it, it's not converted to an octet in
> > step 2), and can therefore not be re-escaped. But maybe it would
> > help to say clearly that the 're-escape' refers to those octets
> > produced in step 2):
> >
> >   3) Re-escape any octets produced in step 2) that are not part of
> >      a/any strictly legal UTF-8 octet sequence.
> >
> > What do you think?
> >
> >
> > Regards,    Martin.
> >
> >

Received on Wednesday, 20 November 2002 17:26:46 UTC