Re: UTF-8 Encoding Question

We just had a discussion related to this on mod_dav's mailing list.

On Wed, Feb 28, 2001 at 03:52:37PM -0800, John Glavin wrote:
>...
> But I run into a problem with the mydocsonline.com DAV server which says
> it is using UTF-8 Encoding but returns the href as href: Magn%FCs.txt
> This is not UTF-8 encoded, because characters > 0x80 in UTF-8 will be
> encoded in a multibyte sequence.  This is normal ISO-8859 (Latin) Encoding.

There are two references to UTF-8 in the response: the Content-Type header
and the XML document header:

    Content-Type: text/xml; charset="utf-8"
    <?xml version="1.0" encoding="utf-8"?>

Both of these refer to the *response body*. In that sense, all characters in
the body are properly UTF-8 encoded.

The URL itself is in its "escaped" form. See sections 2.4.2 of RFC 2396 for
more info. Section 2.1 covers the general problem of UTF-8 encodings for
URLs.


To be more concrete. Section 2.1 defines two types of characters: "URI
characters", and "original characters". The "utf-8" above refers to the URI
characters since that is what is sitting in the body of the response.

The % escaping will give you a set of octets. The question then becomes,
"what encoding will transform those octets into the 'original' characters?"
At the moment, you do not have enough information to do that. There is no
attribute or header or other item that you can inspect for that.

> In this case I am not sure what to do.  I use the Windows API call
> MultiByteToWideChar function but I need to tell it to use either UTF-8 or
> ANSI code pages.  For the mydocsonline server I need to use ANSI however
> they are telling me to use UTF-8 and using UTF-8 wont work.
> 
> When I use Webfolders it works properly on the mydocsonline server and
> somehow knows to not use UTF-8 decoding.  Does anyone have any idea why it
> works or how I could really detect which code page to use ?

I think your statement about it "working" for some servers, and not working
for mydocsonline (which is based on an early mod_dav; the current mod_dav
has the same issue, tho) is based on a presumption that the character set
for the URI characters == the charset of the original characters. That
assumption is being made by servers and clients today.

In mod_dav's case, we take the URI's (unescaped) octets and simply save the
resource under that name. We then return it using the same octet sequence
(properly escaped). The net effect is that we keep the same encoding of the
"original characters" for the client. Of course, the problem arises when one
client saves using a UTF-8 encoding and another reads as Latin-1.

But mod_dav does not have enough information from the client to decode the
URL into (say) Unicode, and save that. If it could, then we could always
return a UTF-8 encoding for the original characters (although we would still
have no way to tell that encoding to the client; clients would just continue
to assume the response encoding matches that encoding and it would *happen*
to match).


To be really clear, let's go with a little diagram here:

    Unicode resource name ("original characters")
       |
     [ UTF-8 encoding ]
       |
    URI octet sequence
       |
     [ URI %-escaping ]
       |
    Returned URI ("URI characters")

But another totally legal scenario is:

    Latin-1 resource name ("original characters")
       |
     [ identity (no) encoding ]
       |
    URI octet sequence
       |
     [ URI %-escaping ]
       |
    Returned URI ("URI characters")

There is nothing in the response body to indicate which of the above two
forms is occurring. Similarly, there is nothing in the request body to
indicate which was used for the Request-URI. Because of the latter, servers
are just as broken if they make an assumption about how to decode from the
URI octets into original characters.

mod_dav does not attempt to decode/encode between octets and original
characters. It just keeps them as octets. But that does imply that the
encoding used by the client when it stored the resource better be the same
encoding used when accessing the resource and the same decoding used for a
PROPFIND result.

RFC 2396, section 2.1 explicitly punts this issue to a future date. It seems
that I recall an internet draft, or even possibly a new RFC, but I'm not
immediately aware of it.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

Received on Wednesday, 28 February 2001 19:54:05 UTC