- From: Mark Davis <mark.davis@jtcsv.com>
- Date: Mon, 2 Dec 2002 11:22:08 -0800
- To: "Martin Duerst" <duerst@w3.org>
- Cc: <www-international@w3.org>
However, having examples that illustrate the correct conversions will ameliorate the problems with the wording. A few comments: 1. You have > This example contains the sequence '%fc', which is the same character > as in the previous example, but represented using iso-8859-1. This is misleading: %fc may or may not be represent U+00FC LATIN SMALL LETTER U WITH DIAERESIS; it all depends on the encoding that it originated from. The wording would be better as: This example contains the sequence '%fc', which would represent a U+00FC LATIN SMALL LETTER U WITH DIAERESIS in the iso-8859-1 encoding. (It would represent other characters in other encodings. For example, %fc in in iso-8859-5 represents a U+045C CYRILLIC SMALL LETTER KJE.) 2. a minor wording item. For each example, you say " This example " when it would be clearer to say "The following example". 3. using uppercase for the hex makes it stand out better. e.g. http://www.example.org/D%fcrst => http://www.example.org/D%FCrst Mark __________________________________ http://www.macchiato.com ► “Eppur si muove” ◄ ----- Original Message ----- From: "Martin Duerst" <duerst@w3.org> To: "Mark Davis" <mark.davis@jtcsv.com> Cc: <www-international@w3.org> Sent: Tuesday, November 26, 2002 12:26 Subject: Re: Announcing draft-duerst-iri-02.txt > > Hello Mark, > > We just updated that section and added some (very simple) > examples. Please have a look again at > http://www.w3.org/International/iri-edit/draft-duerst-iri.txt > > Regards, Martin. > > At 16:22 02/11/20 -0800, Mark Davis wrote: > >It would just protect against a dumb reading. And lord knows, you will get > >those. > > > >Suppose Joe Blow has: "A<C2>B<C0>". > > > > > > > 3) Re-escape any octets produced in step 2) that are not part of > > > > > a/any strictly legal UTF-8 octet sequence. > > > >For Rule 3, Joe thinks: "Hmmm. C2 *is* part of a legal UTF-8 octet sequence, > >the sequence <C2><80>. Therefore, I should not re-escape it. However, <C0> > >is never part of any legal UTF-8 octet sequence. Therefore I won't escape > >that one!" > > > >So after applying Rule 3, Joe has: "A<C2>B%C0". > > > >Mark > >__________________________________ > >http://www.macchiato.com > >笆コ 窶廢ppur si muove窶・笳・ > >----- Original Message ----- > >From: "Martin Duerst" <duerst@w3.org> > >To: "Mark Davis" <mark.davis@jtcsv.com> > >Cc: <www-international@w3.org> > >Sent: Wednesday, November 20, 2002 14:08 > >Subject: Re: Announcing draft-duerst-iri-02.txt > > > > > > > > > > Hello Mark, > > > > > > Many thanks for your comment. > > > > > > At 13:41 02/11/18 -0800, Mark Davis wrote: > > > > > 3) Re-escape any octets produced in step 2) that are not part of > > > > > a/any strictly legal UTF-8 octet sequence. > > > > > > > >Changing 'any' for the second doesn't work. And some of the octets may > >have > > > >come from #1 (I guess) > > > > > > As a URI is always purely ASCII, there is no possibility of > > > non-utf-8-octets to come in in step 1. > > > > > > > > > >I'd recommend: > > > > > > > >Re-escape any octet that is not part of a strictly legal UTF-8 octet > > > >sequence within the sequence of octets representing the URI. > > > > > > I have made the first change, namely 'octets' -> 'octet'. > > > > > > I'm not sure the second part, > > > "within the sequence of octets representing the URI". > > > It seems a bit too obvious to me. > > > > > > Regards, Martin. > > > > > > > > > >[a bit clumsy -- perhaps that buffer can be given a name in defintion #1] > > > > > > > >Mark > > > >__________________________________ > > > >http://www.macchiato.com > > > >隨・さ 遯カ蟒「ppur si muove遯カ繝サ隨ウ繝サ > > > >----- Original Message ----- > > > >From: "Martin Duerst" <duerst@w3.org> > > > >To: "Mark Davis" <mark@macchiato.com> > > > >Cc: <www-international@w3.org> > > > >Sent: Monday, November 18, 2002 12:47 > > > >Subject: Re: Announcing draft-duerst-iri-02.txt > > > > > > > > > > > > > > > > > > Hello Mark, > > > > > > > > > > Many thanks for your comments. Some detail questions below. > > > > > Looking forward to your feedback. > > > > > > > > > > At 13:16 02/11/15 -0800, Mark Davis wrote: > > > > > >One of the steps is the following: > > > > > > > > > > > > 3) Re-escape any octets that are not part of a strictly legal > > > >UTF- > > > > > > 8 octet sequence. > > > > > >This needs to be clearer. Suppose you have the invalid sequence: > > > > > > > > > > > >...<C2><C3><80>... > > > > > > > > > > > >One could re-escape the entire sequence, > > > > > > > > > > > >...%C2%C3%80... > > > > > > > > > > > >or one could re-escape the minimal-length invalid sequences, > >preceding > > > > > >from right to left. > > > > > > > > > > > >...%C2<C3><80>... > > > > > > > > > > > >I assume that the latter is what is meant, but it should be clearer > >in > > > >the > > > > > >text of the clause. For that matter, any single octet above <7F> is > > > > > >invalid, so a perverse reading of the clause would require all of > >them to > > > > > >be escaped! > > > > > > > > > > My interpretation is that <C3><80> is a strictly legal UTF-8 > > > > > sequence, and therefore the <C3> and <80> octets are part of a > > > > > strictly legal UTF-8 octet sequence, and so only <C2> can be > > > > > re-escaped. > > > > > > > > > > What would you propose to make this easier to understand? > > > > > Would it be better to replace 'a' by 'any'? > > > > > > > > > > 3) Re-escape any octets that are not part of any strictly legal > > > > > UTF-8 octet sequence. > > > > > > > > > > Or do you have another idea of how to make this clearer? > > > > > > > > > > > 4) Re-escape all octets that in UTF-8 represent characters > >that > > > > > > are not appropriate according to Section 5.1. > > > > > >Should this not also say Section 4.1? > > > > > > > > > > Good point. Done. > > > > > > > > > > > > > > > >It is also unclear what to do with a sequence like %G1. Does it turn > >into > > > > > >%25G1? > > > > > > > > > > That's not a legal URI, so it is not a legal input. So we should > > > > > never get it. If we get it, it's not converted to an octet in > > > > > step 2), and can therefore not be re-escaped. But maybe it would > > > > > help to say clearly that the 're-escape' refers to those octets > > > > > produced in step 2): > > > > > > > > > > 3) Re-escape any octets produced in step 2) that are not part of > > > > > a/any strictly legal UTF-8 octet sequence. > > > > > > > > > > What do you think? > > > > > > > > > > > > > > > Regards, Martin. > > > > > > > > > > > > > > > > > >
Received on Monday, 2 December 2002 14:22:19 UTC