W3C home > Mailing lists > Public > www-international@w3.org > October to December 2002

Re: Announcing draft-duerst-iri-02.txt

From: Mark Davis <mark.davis@jtcsv.com>
Date: Mon, 2 Dec 2002 11:22:08 -0800
Message-ID: <003801c29a38$1bf9d880$6ede2b09@DAVIS1>
To: "Martin Duerst" <duerst@w3.org>
Cc: <www-international@w3.org>

However, having examples that illustrate the correct conversions will
ameliorate the problems with the wording. A few comments:

1. You have

>   This example contains the sequence '%fc', which is the same character
>   as in the previous example, but represented using iso-8859-1.

This is misleading: %fc may or may not be represent U+00FC LATIN SMALL
LETTER U WITH DIAERESIS; it all depends on the encoding that it originated
from. The wording would be better as:

   This example contains the sequence '%fc', which would represent a U+00FC
LATIN SMALL LETTER U WITH DIAERESIS in the iso-8859-1 encoding. (It would
represent other characters in other encodings. For example, %fc in  in
iso-8859-5 represents a U+045C CYRILLIC SMALL LETTER KJE.)

2. a minor wording item. For each example, you say "   This example " when
it would be clearer to say "The following example".

3. using uppercase for the hex makes it stand out better. e.g.

http://www.example.org/D%fcrst
=>
http://www.example.org/D%FCrst


Mark
__________________________________
http://www.macchiato.com
►  “Eppur si muove” ◄

----- Original Message -----
From: "Martin Duerst" <duerst@w3.org>
To: "Mark Davis" <mark.davis@jtcsv.com>
Cc: <www-international@w3.org>
Sent: Tuesday, November 26, 2002 12:26
Subject: Re: Announcing draft-duerst-iri-02.txt


>
> Hello Mark,
>
> We just updated that section and added some (very simple)
> examples. Please have a look again at
> http://www.w3.org/International/iri-edit/draft-duerst-iri.txt
>
> Regards,    Martin.
>
> At 16:22 02/11/20 -0800, Mark Davis wrote:
> >It would just protect against a dumb reading. And lord knows, you will
get
> >those.
> >
> >Suppose Joe Blow has: "A<C2>B<C0>".
> >
> > > > >   3) Re-escape any octets produced in step 2) that are not part of
> > > > >      a/any strictly legal UTF-8 octet sequence.
> >
> >For Rule 3, Joe thinks: "Hmmm. C2 *is* part of a legal UTF-8 octet
sequence,
> >the sequence <C2><80>. Therefore, I should not re-escape it. However,
<C0>
> >is never part of any legal UTF-8 octet sequence. Therefore I won't escape
> >that one!"
> >
> >So after applying Rule 3, Joe has: "A<C2>B%C0".
> >
> >Mark
> >__________________________________
> >http://www.macchiato.com
> >笆コ  窶廢ppur si muove窶・笳・
> >----- Original Message -----
> >From: "Martin Duerst" <duerst@w3.org>
> >To: "Mark Davis" <mark.davis@jtcsv.com>
> >Cc: <www-international@w3.org>
> >Sent: Wednesday, November 20, 2002 14:08
> >Subject: Re: Announcing draft-duerst-iri-02.txt
> >
> >
> > >
> > > Hello Mark,
> > >
> > > Many thanks for your comment.
> > >
> > > At 13:41 02/11/18 -0800, Mark Davis wrote:
> > > > >   3) Re-escape any octets produced in step 2) that are not part of
> > > > >      a/any strictly legal UTF-8 octet sequence.
> > > >
> > > >Changing 'any' for the second doesn't work. And some of the octets
may
> >have
> > > >come from #1 (I guess)
> > >
> > > As a URI is always purely ASCII, there is no possibility of
> > > non-utf-8-octets to come in in step 1.
> > >
> > >
> > > >I'd recommend:
> > > >
> > > >Re-escape any octet that is not part of a strictly legal UTF-8 octet
> > > >sequence within the sequence of octets representing the URI.
> > >
> > > I have made the first change, namely  'octets'  ->  'octet'.
> > >
> > > I'm not sure the second part,
> > > "within the sequence of octets representing the URI".
> > > It seems a bit too obvious to me.
> > >
> > > Regards,   Martin.
> > >
> > >
> > > >[a bit clumsy -- perhaps that buffer can be given a name in defintion
#1]
> > > >
> > > >Mark
> > > >__________________________________
> > > >http://www.macchiato.com
> > > >隨・さ  遯カ蟒「ppur si muove遯カ繝サ隨ウ繝サ
> > > >----- Original Message -----
> > > >From: "Martin Duerst" <duerst@w3.org>
> > > >To: "Mark Davis" <mark@macchiato.com>
> > > >Cc: <www-international@w3.org>
> > > >Sent: Monday, November 18, 2002 12:47
> > > >Subject: Re: Announcing draft-duerst-iri-02.txt
> > > >
> > > >
> > > > >
> > > > > Hello Mark,
> > > > >
> > > > > Many thanks for your comments. Some detail questions below.
> > > > > Looking forward to your feedback.
> > > > >
> > > > > At 13:16 02/11/15 -0800, Mark Davis wrote:
> > > > > >One of the steps is the following:
> > > > > >
> > > > > >        3) Re-escape any octets that are not part of a strictly
legal
> > > >UTF-
> > > > > >           8 octet sequence.
> > > > > >This needs to be clearer. Suppose you have the invalid sequence:
> > > > > >
> > > > > >...<C2><C3><80>...
> > > > > >
> > > > > >One could re-escape the entire sequence,
> > > > > >
> > > > > >...%C2%C3%80...
> > > > > >
> > > > > >or one could re-escape the minimal-length invalid sequences,
> >preceding
> > > > > >from right to left.
> > > > > >
> > > > > >...%C2<C3><80>...
> > > > > >
> > > > > >I assume that the latter is what is meant, but it should be
clearer
> >in
> > > >the
> > > > > >text of the clause. For that matter, any single octet above <7F>
is
> > > > > >invalid, so a perverse reading of the clause would require all of
> >them to
> > > > > >be escaped!
> > > > >
> > > > > My interpretation is that <C3><80> is a strictly legal UTF-8
> > > > > sequence, and therefore the <C3> and <80> octets are part of a
> > > > > strictly legal UTF-8 octet sequence, and so only <C2> can be
> > > > > re-escaped.
> > > > >
> > > > > What would you propose to make this easier to understand?
> > > > > Would it be better to replace 'a' by 'any'?
> > > > >
> > > > >   3) Re-escape any octets that are not part of any strictly legal
> > > > >      UTF-8 octet sequence.
> > > > >
> > > > > Or do you have another idea of how to make this clearer?
> > > > >
> > > > > >        4) Re-escape all octets that in UTF-8 represent
characters
> >that
> > > > > >           are not appropriate according to Section 5.1.
> > > > > >Should this not also say Section 4.1?
> > > > >
> > > > > Good point. Done.
> > > > >
> > > > >
> > > > > >It is also unclear what to do with a sequence like %G1. Does it
turn
> >into
> > > > > >%25G1?
> > > > >
> > > > > That's not a legal URI, so it is not a legal input. So we should
> > > > > never get it. If we get it, it's not converted to an octet in
> > > > > step 2), and can therefore not be re-escaped. But maybe it would
> > > > > help to say clearly that the 're-escape' refers to those octets
> > > > > produced in step 2):
> > > > >
> > > > >   3) Re-escape any octets produced in step 2) that are not part of
> > > > >      a/any strictly legal UTF-8 octet sequence.
> > > > >
> > > > > What do you think?
> > > > >
> > > > >
> > > > > Regards,    Martin.
> > > > >
> > > > >
> > >
> > >
>
>
Received on Monday, 2 December 2002 14:22:19 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:16:59 GMT