Announcing draft-duerst-iri-02.txt from Mark Davis on 2002-11-15 (www-international@w3.org from October to December 2002)

From: Mark Davis <mark@macchiato.com>
Date: Fri, 15 Nov 2002 13:16:09 -0800
To: "Martin Duerst" <duerst@w3.org>
Cc: <www-international@w3.org>, <mark@macchiato.com>
Message-ID: <003c01c28cec$52786e70$8500a8c0@Davis2>

One of the steps is the following:

       3) Re-escape any octets that are not part of a strictly legal UTF-
          8 octet sequence.

This needs to be clearer. Suppose you have the invalid sequence:

...<C2><C3><80>...

One could re-escape the entire sequence, 

...%C2%C3%80...

or one could re-escape the minimal-length invalid sequences, preceding from right to left. 

...%C2<C3><80>...

I assume that the latter is what is meant, but it should be clearer in the text of the clause. For that matter, any single octet above <7F> is invalid, so a perverse reading of the clause would require all of them to be escaped!

       4) Re-escape all octets that in UTF-8 represent characters that
          are not appropriate according to Section 5.1.

Should this not also say Section 4.1?

It is also unclear what to do with a sequence like %G1. Does it turn into %25G1?

Mark

Received on Friday, 15 November 2002 16:16:49 UTC