Re: Announcing draft-duerst-iri-02.txt from Mark Davis on 2002-11-18 (www-international@w3.org from October to December 2002)

From: Mark Davis <mark.davis@jtcsv.com>
Date: Mon, 18 Nov 2002 13:41:04 -0800
To: "Martin Duerst" <duerst@w3.org>
Cc: <www-international@w3.org>
Message-ID: <00a001c28f4b$32cba040$82de2b09@DAVIS1>

>   3) Re-escape any octets produced in step 2) that are not part of
>      a/any strictly legal UTF-8 octet sequence.

Changing 'any' for the second doesn't work. And some of the octets may have
come from #1 (I guess) I'd recommend:

Re-escape any octet that is not part of a strictly legal UTF-8 octet
sequence within the sequence of octets representing the URI.

[a bit clumsy -- perhaps that buffer can be given a name in defintion #1]

Mark
__________________________________
http://www.macchiato.com
►  “Eppur si muove” ◄

----- Original Message -----
From: "Martin Duerst" <duerst@w3.org>
To: "Mark Davis" <mark@macchiato.com>
Cc: <www-international@w3.org>
Sent: Monday, November 18, 2002 12:47
Subject: Re: Announcing draft-duerst-iri-02.txt


>
> Hello Mark,
>
> Many thanks for your comments. Some detail questions below.
> Looking forward to your feedback.
>
> At 13:16 02/11/15 -0800, Mark Davis wrote:
> >One of the steps is the following:
> >
> >        3) Re-escape any octets that are not part of a strictly legal
UTF-
> >           8 octet sequence.
> >This needs to be clearer. Suppose you have the invalid sequence:
> >
> >...<C2><C3><80>...
> >
> >One could re-escape the entire sequence,
> >
> >...%C2%C3%80...
> >
> >or one could re-escape the minimal-length invalid sequences, preceding
> >from right to left.
> >
> >...%C2<C3><80>...
> >
> >I assume that the latter is what is meant, but it should be clearer in
the
> >text of the clause. For that matter, any single octet above <7F> is
> >invalid, so a perverse reading of the clause would require all of them to
> >be escaped!
>
> My interpretation is that <C3><80> is a strictly legal UTF-8
> sequence, and therefore the <C3> and <80> octets are part of a
> strictly legal UTF-8 octet sequence, and so only <C2> can be
> re-escaped.
>
> What would you propose to make this easier to understand?
> Would it be better to replace 'a' by 'any'?
>
>   3) Re-escape any octets that are not part of any strictly legal
>      UTF-8 octet sequence.
>
> Or do you have another idea of how to make this clearer?
>
> >        4) Re-escape all octets that in UTF-8 represent characters that
> >           are not appropriate according to Section 5.1.
> >Should this not also say Section 4.1?
>
> Good point. Done.
>
>
> >It is also unclear what to do with a sequence like %G1. Does it turn into
> >%25G1?
>
> That's not a legal URI, so it is not a legal input. So we should
> never get it. If we get it, it's not converted to an octet in
> step 2), and can therefore not be re-escaped. But maybe it would
> help to say clearly that the 're-escape' refers to those octets
> produced in step 2):
>
>   3) Re-escape any octets produced in step 2) that are not part of
>      a/any strictly legal UTF-8 octet sequence.
>
> What do you think?
>
>
> Regards,    Martin.
>
>

Received on Monday, 18 November 2002 18:10:49 UTC