- From: Greg Stein <gstein@lyra.org>
- Date: Thu, 1 Mar 2001 00:33:55 -0800
- To: John Glavin <john@riverfrontsoftware.com>, w3c-dist-auth@w3.org
That would be a good heuristic for handling this. We could augment the DAV:href element with an optional attribute, like this: <D:href D:original-charset="iso-8859-1">http://some.host/Magn%FCs.txt</D:href> and <D:href D:original-charset="utf-8">http://some.host/C%C3%A9sar.txt</D:href> That will help within the responses (if a server supplies it, then you don't have to guess; otherwise, fall back to your heuristic). We'd still need a way to determine the original charset of the Request-URI, though, to fully solve the problem. JimW: can we list this as an "Issue" for RFC 2518? Cheers, -g On Wed, Feb 28, 2001 at 10:57:05PM -0800, John Glavin wrote: > Thanks, this makes sense now. I guess what I can do is see if the URI > contains a valid UTF-8 sequence and if it does then assume it's UTF-8 > encoded. I got the following text from RFC-2279 > > UTF-8 strings can be fairly reliably recognized as such by a > simple algorithm, i.e. the probability that a string of characters > in any other encoding appears as valid UTF-8 is low, diminishing > with increasing string length. > > Webfolders must be doing something similar to this since it understands both > UTF-8 and Latin-1. > > John Glavin > RiverFront Software > john@webdrive.com > http://www.webdrive.com > > > ----- Original Message ----- > From: "Greg Stein" <gstein@lyra.org> > To: "John Glavin" <john@riverfrontsoftware.com> > Cc: <w3c-dist-auth@w3.org> > Sent: Wednesday, February 28, 2001 4:54 PM > Subject: Re: UTF-8 Encoding Question > > > > We just had a discussion related to this on mod_dav's mailing list. > > > > On Wed, Feb 28, 2001 at 03:52:37PM -0800, John Glavin wrote: > > >... > > > But I run into a problem with the mydocsonline.com DAV server which says > > > it is using UTF-8 Encoding but returns the href as href: Magn%FCs.txt > > > This is not UTF-8 encoded, because characters > 0x80 in UTF-8 will be > > > encoded in a multibyte sequence. This is normal ISO-8859 (Latin) > Encoding. > > > > There are two references to UTF-8 in the response: the Content-Type header > > and the XML document header: > > > > Content-Type: text/xml; charset="utf-8" > > <?xml version="1.0" encoding="utf-8"?> > > > > Both of these refer to the *response body*. In that sense, all characters > in > > the body are properly UTF-8 encoded. > > > > The URL itself is in its "escaped" form. See sections 2.4.2 of RFC 2396 > for > > more info. Section 2.1 covers the general problem of UTF-8 encodings for > > URLs. > > > > > > To be more concrete. Section 2.1 defines two types of characters: "URI > > characters", and "original characters". The "utf-8" above refers to the > URI > > characters since that is what is sitting in the body of the response. > > > > The % escaping will give you a set of octets. The question then becomes, > > "what encoding will transform those octets into the 'original' > characters?" > > At the moment, you do not have enough information to do that. There is no > > attribute or header or other item that you can inspect for that. > > > > > In this case I am not sure what to do. I use the Windows API call > > > MultiByteToWideChar function but I need to tell it to use either UTF-8 > or > > > ANSI code pages. For the mydocsonline server I need to use ANSI however > > > they are telling me to use UTF-8 and using UTF-8 wont work. > > > > > > When I use Webfolders it works properly on the mydocsonline server and > > > somehow knows to not use UTF-8 decoding. Does anyone have any idea why > it > > > works or how I could really detect which code page to use ? > > > > I think your statement about it "working" for some servers, and not > working > > for mydocsonline (which is based on an early mod_dav; the current mod_dav > > has the same issue, tho) is based on a presumption that the character set > > for the URI characters == the charset of the original characters. That > > assumption is being made by servers and clients today. > > > > In mod_dav's case, we take the URI's (unescaped) octets and simply save > the > > resource under that name. We then return it using the same octet sequence > > (properly escaped). The net effect is that we keep the same encoding of > the > > "original characters" for the client. Of course, the problem arises when > one > > client saves using a UTF-8 encoding and another reads as Latin-1. > > > > But mod_dav does not have enough information from the client to decode the > > URL into (say) Unicode, and save that. If it could, then we could always > > return a UTF-8 encoding for the original characters (although we would > still > > have no way to tell that encoding to the client; clients would just > continue > > to assume the response encoding matches that encoding and it would > *happen* > > to match). > > > > > > To be really clear, let's go with a little diagram here: > > > > Unicode resource name ("original characters") > > | > > [ UTF-8 encoding ] > > | > > URI octet sequence > > | > > [ URI %-escaping ] > > | > > Returned URI ("URI characters") > > > > But another totally legal scenario is: > > > > Latin-1 resource name ("original characters") > > | > > [ identity (no) encoding ] > > | > > URI octet sequence > > | > > [ URI %-escaping ] > > | > > Returned URI ("URI characters") > > > > There is nothing in the response body to indicate which of the above two > > forms is occurring. Similarly, there is nothing in the request body to > > indicate which was used for the Request-URI. Because of the latter, > servers > > are just as broken if they make an assumption about how to decode from the > > URI octets into original characters. > > > > mod_dav does not attempt to decode/encode between octets and original > > characters. It just keeps them as octets. But that does imply that the > > encoding used by the client when it stored the resource better be the same > > encoding used when accessing the resource and the same decoding used for a > > PROPFIND result. > > > > RFC 2396, section 2.1 explicitly punts this issue to a future date. It > seems > > that I recall an internet draft, or even possibly a new RFC, but I'm not > > immediately aware of it. > > > > Cheers, > > -g > > > > -- > > Greg Stein, http://www.lyra.org/ -- Greg Stein, http://www.lyra.org/
Received on Thursday, 1 March 2001 03:33:45 UTC