Re: encoding related phrasing from Mike Brown on 2003-09-26 (uri@w3.org from September 2003)

From: Mike Brown <mike@skew.org>
Date: Fri, 26 Sep 2003 11:32:00 -0600 (MDT)
To: uri@w3.org
Message-Id: <200309261732.h8QHW0Wd055377@chilled.skew.org>

I want to revise my suggestion for rewording section 2.1.  In the
last paragraph, I didn't follow my own advice! Also, there are a
couple of other details that I want to address.

1. I missed a "character set".

2. Prior to the adoption of RFCs 2277 and 2718, protocols and
URI schemes were free to mandate the use of encodings other than
UTF-8 as the basis for %-escaping, or to not speak to the issue at
all (HTTP being the most notorious example). This should be
acknowledged when recommending UTF-8.

3. Link the first mention of escaping to section 2.4 (#escape).

4. Even though the reader probably can figure out what is meant,
the recommended action to encode-then-escape can be difficult to
follow to the letter. If you escape an octet, then you have a
triplet of characters. So far, so good. But if you don't escape 
an octet, then you've got ...an octet. You might say to just use
characters represented by the unescaped octets, but then this
makes me think the whole example is redundant, saying, in effect,
"to escape certain characters, encode them all so you know how to
escape them, but then just escape the ones you need to." What's
the point? Just drop this entirely.

So, the last paragraph should be sufficient if it reads like this:

  In accordance with the trend toward UTF-8 [RFC2279] (see also
  [RFC2277] and [RFC2718]), when a URI scheme defines a component
  that represents textual data consisting of characters from the
  Unicode / ISO/IEC 10646 repertoire and does not mandate the
  use of some other encoding, we recommend using UTF-8 [RFC2279]
  to determine the octets used to escape [#escape] characters
  that are not in the unreserved set.

Received on Friday, 26 September 2003 13:31:57 UTC