- From: McDonald, Ira <imcdonald@sharplabs.com>
- Date: Sat, 27 Sep 2003 11:47:28 -0700
- To: "'Mike Brown'" <mike@skew.org>, uri@w3.org
Hi Mike, Bunch of good and thougtful comments - thanks. Below, "trend toward UTF-8" is a little weak. You cited RFC 2277 "IETF Policy on Character Sets and Languages", with is also BCP 18, an IETF best practices document. RFC 2277 says on pages 2-3: "All protocols MUST identify, for all character data, which charset is in use. > Protocols MUST be able to use the UTF-8 charset, which consists of the ISO 10646 coded character set combined with the UTF-8 character encoding scheme, as defined in [10646] Annex R (published in Amendment 2), for all text. Protocols MAY specify, in addition, how to use other charsets or other character encoding schemes for ISO 10646, such as UTF-16, but > lack of an ability to use UTF-8 is a violation of this policy; such a violation would need a variance procedure ([BCP9] section 9) with clear and solid justification in the protocol specification document before being entered into or advanced upon the standards track." While a URI is not a protocol, it _is_ a protocol element, containing a (sort of) human-readable identifier, which is typically displayed to human users. As such, I believe that UTF-8 support in URI is mandated by RFC 2277. Cheers, - Ira McDonald High North Inc -----Original Message----- From: Mike Brown [mailto:mike@skew.org] Sent: Friday, September 26, 2003 1:32 PM To: uri@w3.org Subject: Re: encoding related phrasing I want to revise my suggestion for rewording section 2.1. In the last paragraph, I didn't follow my own advice! Also, there are a couple of other details that I want to address. 1. I missed a "character set". 2. Prior to the adoption of RFCs 2277 and 2718, protocols and URI schemes were free to mandate the use of encodings other than UTF-8 as the basis for %-escaping, or to not speak to the issue at all (HTTP being the most notorious example). This should be acknowledged when recommending UTF-8. 3. Link the first mention of escaping to section 2.4 (#escape). 4. Even though the reader probably can figure out what is meant, the recommended action to encode-then-escape can be difficult to follow to the letter. If you escape an octet, then you have a triplet of characters. So far, so good. But if you don't escape an octet, then you've got ...an octet. You might say to just use characters represented by the unescaped octets, but then this makes me think the whole example is redundant, saying, in effect, "to escape certain characters, encode them all so you know how to escape them, but then just escape the ones you need to." What's the point? Just drop this entirely. So, the last paragraph should be sufficient if it reads like this: In accordance with the trend toward UTF-8 [RFC2279] (see also [RFC2277] and [RFC2718]), when a URI scheme defines a component that represents textual data consisting of characters from the Unicode / ISO/IEC 10646 repertoire and does not mandate the use of some other encoding, we recommend using UTF-8 [RFC2279] to determine the octets used to escape [#escape] characters that are not in the unreserved set.
Received on Saturday, 27 September 2003 14:48:18 UTC