Re: what should the charset be in the response to the server from Martin Duerst on 2004-05-06 (www-international@w3.org from April to June 2004)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 06 May 2004 16:04:38 +0900
To: "Chris Haynes" <chris@harvington.org.uk>, <www-international@w3.org>
Cc: Michel Suignard <michelsu@microsoft.com>
Message-Id: <4.2.0.58.J.20030811105244.0546e258@localhost>

Hello Chris,

In trying to clear up the remaining IRI issues, I found out that
I planned to reply to this message of yours, but didn't get around
to do it.

At 17:20 03/08/07 +0100, Chris Haynes wrote:

>  "Martin Duerst" Replied:
>
>
> > At 12:15 03/07/26 +0100, Chris Haynes wrote:
> >
> > >  "Jungshik Shin" replied at: Saturday, July 26, 2003 11:31 AM
> >
> > > >   It also depends on whether or not you set 'send URLs always in
> > >UTF-8' in
> > > > Tools|Options(?) in MS IE.
> > > >
> > >
> > >True, but I'm trying to find a 'reliable' mechanism which is not
> > >dependent on user-accessible controls.
> > >IMHO, this is also a 'dangerous' option, in that it goes agains the
>de
> > >facto conventions and anticipates (parhaps incorrectly) the
> > >recommendations of the proposed IRI RFC. It can only safely be used
> > >with a 'consenting' server site.
> >
> > Sorry, no. The main dangerous thing is that authors use non-ASCII
> > characters in URIs (without any %HH escaping) when this is clearly
> > forbidden.
> >
> > Regards,  Martin.
>
>
>Martin,
>
>Are you saying that you approve of relying on users to select the
>(Microsoft-specific)  'send URLs always in
>UTF-8'  menu option  to ensure that UTF8 gets returned to the server?
>
>That is what was being suggested.

Well, my above statement was meant in the following sense:
There is NO spec that would allow inclusion of non-ASCII
characters in URIs. The IRI spec is the first one that
defines something similar to an URI that actually allows this.
Any authors that for example put raw iso-8859-1 characters
into an URI in a page in iso-8859-1 are therefore wrong;
any 'it works' effect is coincidental, not according to specs.
Suggesting that a browser that anticipates a future spec
(the IRI spec) is dangerous, while (implicitly) blessing
browsers and pages that don't conform to any spec is in
my eyes a dangerous idea.

>My argument was that any current HTTP-like system in which the
>character encoding could be modified by menu controls in the user
>agent, (and in which the actual encoding used is *not* conveyed in the
>request) was inherently unreliable.

I think we have to look at different parts of a HTTP request separately.
There are mainly two parts: the 'path' part and the 'query' part.

With respect to the path part, this is indeed influenced by the
'send URLs always in UTF-8' option in MS IE. But there are ways
to get around this. For an example, see my Apache 'mod_fileiri'
module, which allows to map requests both in a legacy encoding and
in UTF-8 back to the file in question.
[see http://www.w3.org/2003/06/mod_fileiri/Overview.html for an overview,
including pointers to the actual code and to a talk of mine].

With respect to the query part, this is not affected by the
'send URLs always in UTF-8' option in MS IE. The query part
is always sent in the encoding of the actual page, except
for some browsers that implement the 'accept-charset' attribute
on <form>. But for queries, it is rather easy to e.g. convert
all the forms related to that query URI to UTF-8.

You are right that the (perceived) character encoding of the
page can affect both parts. Of course, users might always
change the character encoding, and as a result send something
that the server gets as garbage. However, users don't use
menus just for fun, and if anybody would ever come and complain,
the server side would be very justified to say "don't mess
around with the settings if you expect your queries to work".
So this is very much a theoretical concern.

Regards,    Martin.

Received on Thursday, 6 May 2004 04:48:09 UTC