Re: UTF-8 Encoding Question from Dan Brotsky on 2001-03-01 (w3c-dist-auth@w3.org from January to March 2001)

From: Dan Brotsky <dbrotsky@Adobe.COM>
Date: Thu, 1 Mar 2001 08:57:47 -0500
To: "John Glavin" <john@riverfrontsoftware.com>
Cc: <w3c-dist-auth@w3.org>
Message-Id: <p04330106b6c402a9a549@[192.168.1.6]>

At 3:52 PM -0800 2/28/01, John Glavin wrote:
>But I run into a problem with the mydocsonline.com DAV server which 
>says it is using UTF-8 Encoding but returns the href as
>href: Magn%FCs.txt      This is not UTF-8 encoded, because 
>characters > 0x80 in UTF-8 will be encoded in a multibyte sequence. 
>This is normal ISO-8859 (Latin) Encoding.

As Greg pointed out on dav-dev, it's very important *not* to confuse 
the "UTF-8 encoding" explicitly employed for the XML response bodies 
with any possible "UTF-8 encoding" of non-ascii characters in the 
URLs in these bodies (before they are %-escaped).  XML bodies are 
explicitly encoded, but this encoding does NOT apply to any URL 
segments that are %-encoded *inside* the XML bodies.

The fact that URLs, in and of themselves, do not indicate the 
encoding used for their path segments is a known issue in the HTTP 
community, and is specifically referred to in several specs.  At this 
point, it is pretty much up to clients and servers to agree on 
conventions about what encoding will be used in such path segments. 
There seem to be two common appraches:

1. No encoding is done at all.  The platform-specific character set 
components of a filename are viewed as an octet stream and are then 
%-encoded directly into the URL.  Since many platform-specific 
character sets are very close to ISO 8859-1 (ISO Latin 1), this often 
looks like the client or server is using an ISO 8859-1 encoding of 
these segments, but it's often just the platform encoding.

2. Encoding is done by translating the platform-specific character 
set components of a filename into Unicode, and then UTF-8 (or 
sometimes ISO 8859-1) encoding that Unicode into an octet stream 
which is then %-encoded for use in an URL.

The advantage of approach 2 [in the case where UTF-8 is used], as I 
outlined in an earlier message to dav-dev, is that clients and 
servers with different platform character sets but who are both 
capable of encoding certain characters can unambiguously communicate 
about filenames that contain those characters.  This is increasingly 
important in today's global net community as wide-character-capable 
clients with divergent platform character sets proliferate.

I am encouraged to see that many of the "big" browser clients (IE and 
Netscape) are beginning to use UTF-8 encodings of platform character 
sets in the URLs they generate and to understand UTF-8 encodings in 
URLs they receive.  It would be nice to see the prevalent DAV servers 
(such as mod_dav) take approach #2 with UTF-8 as well.  (Approach #2 
using ISO 8859-1 has far fewer advantages, since ISO 8859-1 encodes a 
very restricted subset of Unicode.)

     dan

Received on Thursday, 1 March 2001 11:58:24 UTC