- From: Mark Davis <mark.davis@jtcsv.com>
- Date: Mon, 18 Nov 2002 13:41:04 -0800
- To: "Martin Duerst" <duerst@w3.org>
- Cc: <www-international@w3.org>
> 3) Re-escape any octets produced in step 2) that are not part of > a/any strictly legal UTF-8 octet sequence. Changing 'any' for the second doesn't work. And some of the octets may have come from #1 (I guess) I'd recommend: Re-escape any octet that is not part of a strictly legal UTF-8 octet sequence within the sequence of octets representing the URI. [a bit clumsy -- perhaps that buffer can be given a name in defintion #1] Mark __________________________________ http://www.macchiato.com ► “Eppur si muove” ◄ ----- Original Message ----- From: "Martin Duerst" <duerst@w3.org> To: "Mark Davis" <mark@macchiato.com> Cc: <www-international@w3.org> Sent: Monday, November 18, 2002 12:47 Subject: Re: Announcing draft-duerst-iri-02.txt > > Hello Mark, > > Many thanks for your comments. Some detail questions below. > Looking forward to your feedback. > > At 13:16 02/11/15 -0800, Mark Davis wrote: > >One of the steps is the following: > > > > 3) Re-escape any octets that are not part of a strictly legal UTF- > > 8 octet sequence. > >This needs to be clearer. Suppose you have the invalid sequence: > > > >...<C2><C3><80>... > > > >One could re-escape the entire sequence, > > > >...%C2%C3%80... > > > >or one could re-escape the minimal-length invalid sequences, preceding > >from right to left. > > > >...%C2<C3><80>... > > > >I assume that the latter is what is meant, but it should be clearer in the > >text of the clause. For that matter, any single octet above <7F> is > >invalid, so a perverse reading of the clause would require all of them to > >be escaped! > > My interpretation is that <C3><80> is a strictly legal UTF-8 > sequence, and therefore the <C3> and <80> octets are part of a > strictly legal UTF-8 octet sequence, and so only <C2> can be > re-escaped. > > What would you propose to make this easier to understand? > Would it be better to replace 'a' by 'any'? > > 3) Re-escape any octets that are not part of any strictly legal > UTF-8 octet sequence. > > Or do you have another idea of how to make this clearer? > > > 4) Re-escape all octets that in UTF-8 represent characters that > > are not appropriate according to Section 5.1. > >Should this not also say Section 4.1? > > Good point. Done. > > > >It is also unclear what to do with a sequence like %G1. Does it turn into > >%25G1? > > That's not a legal URI, so it is not a legal input. So we should > never get it. If we get it, it's not converted to an octet in > step 2), and can therefore not be re-escaped. But maybe it would > help to say clearly that the 're-escape' refers to those octets > produced in step 2): > > 3) Re-escape any octets produced in step 2) that are not part of > a/any strictly legal UTF-8 octet sequence. > > What do you think? > > > Regards, Martin. > >
Received on Monday, 18 November 2002 18:10:49 UTC