RE: encoding related phrasing from McDonald, Ira on 2003-09-27 (uri@w3.org from September 2003)

From: McDonald, Ira <imcdonald@sharplabs.com>
Date: Sat, 27 Sep 2003 11:47:28 -0700
To: "'Mike Brown'" <mike@skew.org>, uri@w3.org
Message-ID: <116DB56CD7DED511BC7800508B2CA537B001CE@mailsrvnt02.enet.sharplabs.com>

Hi Mike,

Bunch of good and thougtful comments - thanks.

Below, "trend toward UTF-8" is a little weak.  You cited 
RFC 2277 "IETF Policy on Character Sets and Languages",
with is also BCP 18, an IETF best practices document.
RFC 2277 says on pages 2-3:

   "All protocols MUST identify, for all character data, which charset is
   in use.

>  Protocols MUST be able to use the UTF-8 charset, which consists of
   the ISO 10646 coded character set combined with the UTF-8 character
   encoding scheme, as defined in [10646] Annex R (published in
   Amendment 2), for all text.

   Protocols MAY specify, in addition, how to use other charsets or
   other character encoding schemes for ISO 10646, such as UTF-16, but
>  lack of an ability to use UTF-8 is a violation of this policy; such a
   violation would need a variance procedure ([BCP9] section 9) with
   clear and solid justification in the protocol specification document
   before being entered into or advanced upon the standards track."

While a URI is not a protocol, it _is_ a protocol element, containing
a (sort of) human-readable identifier, which is typically displayed to
human users.  As such, I believe that UTF-8 support in URI is mandated
by RFC 2277.

Cheers,
- Ira McDonald
  High North Inc


-----Original Message-----
From: Mike Brown [mailto:mike@skew.org]
Sent: Friday, September 26, 2003 1:32 PM
To: uri@w3.org
Subject: Re: encoding related phrasing



I want to revise my suggestion for rewording section 2.1.  In the
last paragraph, I didn't follow my own advice! Also, there are a
couple of other details that I want to address.

1. I missed a "character set".

2. Prior to the adoption of RFCs 2277 and 2718, protocols and
URI schemes were free to mandate the use of encodings other than
UTF-8 as the basis for %-escaping, or to not speak to the issue at
all (HTTP being the most notorious example). This should be
acknowledged when recommending UTF-8.

3. Link the first mention of escaping to section 2.4 (#escape).

4. Even though the reader probably can figure out what is meant,
the recommended action to encode-then-escape can be difficult to
follow to the letter. If you escape an octet, then you have a
triplet of characters. So far, so good. But if you don't escape 
an octet, then you've got ...an octet. You might say to just use
characters represented by the unescaped octets, but then this
makes me think the whole example is redundant, saying, in effect,
"to escape certain characters, encode them all so you know how to
escape them, but then just escape the ones you need to." What's
the point? Just drop this entirely.

So, the last paragraph should be sufficient if it reads like this:

  In accordance with the trend toward UTF-8 [RFC2279] (see also
  [RFC2277] and [RFC2718]), when a URI scheme defines a component
  that represents textual data consisting of characters from the
  Unicode / ISO/IEC 10646 repertoire and does not mandate the
  use of some other encoding, we recommend using UTF-8 [RFC2279]
  to determine the octets used to escape [#escape] characters
  that are not in the unreserved set.

Received on Saturday, 27 September 2003 14:48:18 UTC