Re: Announcing draft-duerst-iri-02.txt from Martin Duerst on 2002-11-18 (www-international@w3.org from October to December 2002)

From: Martin Duerst <duerst@w3.org>
Date: Tue, 19 Nov 2002 05:47:16 +0900
To: "Mark Davis" <mark@macchiato.com>
Cc: <www-international@w3.org>
Message-Id: <4.2.0.58.J.20021116082221.07a100a8@localhost>

Hello Mark,

Many thanks for your comments. Some detail questions below.
Looking forward to your feedback.

At 13:16 02/11/15 -0800, Mark Davis wrote:
>One of the steps is the following:
>
>        3) Re-escape any octets that are not part of a strictly legal UTF-
>           8 octet sequence.
>This needs to be clearer. Suppose you have the invalid sequence:
>
>...<C2><C3><80>...
>
>One could re-escape the entire sequence,
>
>...%C2%C3%80...
>
>or one could re-escape the minimal-length invalid sequences, preceding 
>from right to left.
>
>...%C2<C3><80>...
>
>I assume that the latter is what is meant, but it should be clearer in the 
>text of the clause. For that matter, any single octet above <7F> is 
>invalid, so a perverse reading of the clause would require all of them to 
>be escaped!

My interpretation is that <C3><80> is a strictly legal UTF-8
sequence, and therefore the <C3> and <80> octets are part of a
strictly legal UTF-8 octet sequence, and so only <C2> can be
re-escaped.

What would you propose to make this easier to understand?
Would it be better to replace 'a' by 'any'?

  3) Re-escape any octets that are not part of any strictly legal
     UTF-8 octet sequence.

Or do you have another idea of how to make this clearer?

>        4) Re-escape all octets that in UTF-8 represent characters that
>           are not appropriate according to Section 5.1.
>Should this not also say Section 4.1?

Good point. Done.

>It is also unclear what to do with a sequence like %G1. Does it turn into 
>%25G1?

That's not a legal URI, so it is not a legal input. So we should
never get it. If we get it, it's not converted to an octet in
step 2), and can therefore not be re-escaped. But maybe it would
help to say clearly that the 're-escape' refers to those octets
produced in step 2):

  3) Re-escape any octets produced in step 2) that are not part of
     a/any strictly legal UTF-8 octet sequence.

What do you think?

Regards,    Martin.

Received on Monday, 18 November 2002 15:57:31 UTC