- From: Dan Brotsky <dbrotsky@Adobe.COM>
- Date: Thu, 1 Mar 2001 08:57:47 -0500
- To: "John Glavin" <john@riverfrontsoftware.com>
- Cc: <w3c-dist-auth@w3.org>
At 3:52 PM -0800 2/28/01, John Glavin wrote: >But I run into a problem with the mydocsonline.com DAV server which >says it is using UTF-8 Encoding but returns the href as >href: Magn%FCs.txt This is not UTF-8 encoded, because >characters > 0x80 in UTF-8 will be encoded in a multibyte sequence. >This is normal ISO-8859 (Latin) Encoding. As Greg pointed out on dav-dev, it's very important *not* to confuse the "UTF-8 encoding" explicitly employed for the XML response bodies with any possible "UTF-8 encoding" of non-ascii characters in the URLs in these bodies (before they are %-escaped). XML bodies are explicitly encoded, but this encoding does NOT apply to any URL segments that are %-encoded *inside* the XML bodies. The fact that URLs, in and of themselves, do not indicate the encoding used for their path segments is a known issue in the HTTP community, and is specifically referred to in several specs. At this point, it is pretty much up to clients and servers to agree on conventions about what encoding will be used in such path segments. There seem to be two common appraches: 1. No encoding is done at all. The platform-specific character set components of a filename are viewed as an octet stream and are then %-encoded directly into the URL. Since many platform-specific character sets are very close to ISO 8859-1 (ISO Latin 1), this often looks like the client or server is using an ISO 8859-1 encoding of these segments, but it's often just the platform encoding. 2. Encoding is done by translating the platform-specific character set components of a filename into Unicode, and then UTF-8 (or sometimes ISO 8859-1) encoding that Unicode into an octet stream which is then %-encoded for use in an URL. The advantage of approach 2 [in the case where UTF-8 is used], as I outlined in an earlier message to dav-dev, is that clients and servers with different platform character sets but who are both capable of encoding certain characters can unambiguously communicate about filenames that contain those characters. This is increasingly important in today's global net community as wide-character-capable clients with divergent platform character sets proliferate. I am encouraged to see that many of the "big" browser clients (IE and Netscape) are beginning to use UTF-8 encodings of platform character sets in the URLs they generate and to understand UTF-8 encodings in URLs they receive. It would be nice to see the prevalent DAV servers (such as mod_dav) take approach #2 with UTF-8 as well. (Approach #2 using ISO 8859-1 has far fewer advantages, since ISO 8859-1 encodes a very restricted subset of Unicode.) dan
Received on Thursday, 1 March 2001 11:58:24 UTC