- From: Martin Duerst <duerst@w3.org>
- Date: Tue, 19 Nov 2002 05:47:16 +0900
- To: "Mark Davis" <mark@macchiato.com>
- Cc: <www-international@w3.org>
Hello Mark,
Many thanks for your comments. Some detail questions below.
Looking forward to your feedback.
At 13:16 02/11/15 -0800, Mark Davis wrote:
>One of the steps is the following:
>
> 3) Re-escape any octets that are not part of a strictly legal UTF-
> 8 octet sequence.
>This needs to be clearer. Suppose you have the invalid sequence:
>
>...<C2><C3><80>...
>
>One could re-escape the entire sequence,
>
>...%C2%C3%80...
>
>or one could re-escape the minimal-length invalid sequences, preceding
>from right to left.
>
>...%C2<C3><80>...
>
>I assume that the latter is what is meant, but it should be clearer in the
>text of the clause. For that matter, any single octet above <7F> is
>invalid, so a perverse reading of the clause would require all of them to
>be escaped!
My interpretation is that <C3><80> is a strictly legal UTF-8
sequence, and therefore the <C3> and <80> octets are part of a
strictly legal UTF-8 octet sequence, and so only <C2> can be
re-escaped.
What would you propose to make this easier to understand?
Would it be better to replace 'a' by 'any'?
3) Re-escape any octets that are not part of any strictly legal
UTF-8 octet sequence.
Or do you have another idea of how to make this clearer?
> 4) Re-escape all octets that in UTF-8 represent characters that
> are not appropriate according to Section 5.1.
>Should this not also say Section 4.1?
Good point. Done.
>It is also unclear what to do with a sequence like %G1. Does it turn into
>%25G1?
That's not a legal URI, so it is not a legal input. So we should
never get it. If we get it, it's not converted to an octet in
step 2), and can therefore not be re-escaped. But maybe it would
help to say clearly that the 're-escape' refers to those octets
produced in step 2):
3) Re-escape any octets produced in step 2) that are not part of
a/any strictly legal UTF-8 octet sequence.
What do you think?
Regards, Martin.
Received on Monday, 18 November 2002 15:57:31 UTC