Re: what should the charset be in the response to the server from Chris Haynes on 2004-05-06 (www-international@w3.org from April to June 2004)

From: Chris Haynes <chris@harvington.org.uk>
Date: Thu, 6 May 2004 13:38:04 +0100
To: <www-international@w3.org>, "Martin Duerst" <duerst@w3.org>
Cc: "Michel Suignard" <michelsu@microsoft.com>
Message-ID: <015501c43366$fbf64cc0$0200000a@ringo>
Thanks for the response, Martin,

I only noticed this response _after_ I had replied to your other response on the
IRI list, so I apologize that my earlier response did not take into account this
message of yours.

Trying to bring this topic to closure, I think my core worry arises each time
there are what-appear-to-me-to-be normative statements that 'the page encoding
determines the encoding used in requests derived from that page' - ignoring the
possibility of users having changed the encoding setting.

We obviously both agree that users 'should not' use these controls (just as I
diapprove of the use of 'tone controls' and spectral filters in Hi Fi systems
for other than 'loudness' compensation), but I get worried every time the
possibility of their use is ignored.

The situation is not purely  'theoretical ' I've seen reports that it is common
practice in some countries for people to switch to their 'national' character
set every time they appear to have a problem in viewing a page - which could be
occasioned by their browser not having UTF-8 support.

I help provide support to the users of an open-source web server, and we
frequently get requests for help from people managing web services who, having
read the appropriate RFCs and W3 specs in detail, had not appreciated that user
agents can change the encoding in ways which the request-receiving server cannot
detect.

I suppose I'm just keen to make sure that wherever this topic appears, the
potential behavior of the vast majority of browsers in the world is adequately
and completely described.

If there were an RFC somewhere which said that the user agent 'MUST NOT' change
the encoding, and that real-world browsers were ignoring this stricture, I would
agree that other RFCs were right to describe what should be, rather than what
is.

But as far as I know, the ability for users to override the encoding does not
contravene any existing RFC, and therefore other RFCs ought at least to
recognize that possibility, and not infer, by omission, a level of certainty
which can never be assured.

I think I would have a very poor view of any web site which told me it was my
fault a request got garbled because I made use of a freely-available control on
my browser.

Let me try to conclude this by just asking that, so long as user control over
the encoding is permitted by RFCs, that possibility is explicitly recognized by
other RFCs., and that we dont try to pretend that it does not exist or, even
worse, that failures and errors in decoding are the user's fault for breaking an
unwritten, untestable  non-rule.

Chris


----- Original Message -----
From: "Martin Duerst" <duerst@w3.org>
To: "Chris Haynes" <chris@harvington.org.uk>; <www-international@w3.org>
Cc: "Michel Suignard" <michelsu@microsoft.com>
Sent: Thursday, May 06, 2004 8:04 AM
Subject: Re: what should the charset be in the response to the server


> Hello Chris,
>
> In trying to clear up the remaining IRI issues, I found out that
> I planned to reply to this message of yours, but didn't get around
> to do it.
>
> At 17:20 03/08/07 +0100, Chris Haynes wrote:
>
> >  "Martin Duerst" Replied:
> >
> >
> > > At 12:15 03/07/26 +0100, Chris Haynes wrote:
> > >
> > > >  "Jungshik Shin" replied at: Saturday, July 26, 2003 11:31 AM
> > >
> > > > >   It also depends on whether or not you set 'send URLs always in
> > > >UTF-8' in
> > > > > Tools|Options(?) in MS IE.
> > > > >
> > > >
> > > >True, but I'm trying to find a 'reliable' mechanism which is not
> > > >dependent on user-accessible controls.
> > > >IMHO, this is also a 'dangerous' option, in that it goes agains the
> >de
> > > >facto conventions and anticipates (parhaps incorrectly) the
> > > >recommendations of the proposed IRI RFC. It can only safely be used
> > > >with a 'consenting' server site.
> > >
> > > Sorry, no. The main dangerous thing is that authors use non-ASCII
> > > characters in URIs (without any %HH escaping) when this is clearly
> > > forbidden.
> > >
> > > Regards,  Martin.
> >
> >
> >Martin,
> >
> >Are you saying that you approve of relying on users to select the
> >(Microsoft-specific)  'send URLs always in
> >UTF-8'  menu option  to ensure that UTF8 gets returned to the server?
> >
> >That is what was being suggested.
>
> Well, my above statement was meant in the following sense:
> There is NO spec that would allow inclusion of non-ASCII
> characters in URIs. The IRI spec is the first one that
> defines something similar to an URI that actually allows this.
> Any authors that for example put raw iso-8859-1 characters
> into an URI in a page in iso-8859-1 are therefore wrong;
> any 'it works' effect is coincidental, not according to specs.
> Suggesting that a browser that anticipates a future spec
> (the IRI spec) is dangerous, while (implicitly) blessing
> browsers and pages that don't conform to any spec is in
> my eyes a dangerous idea.
>
>
> >My argument was that any current HTTP-like system in which the
> >character encoding could be modified by menu controls in the user
> >agent, (and in which the actual encoding used is *not* conveyed in the
> >request) was inherently unreliable.
>
> I think we have to look at different parts of a HTTP request separately.
> There are mainly two parts: the 'path' part and the 'query' part.
>
> With respect to the path part, this is indeed influenced by the
> 'send URLs always in UTF-8' option in MS IE. But there are ways
> to get around this. For an example, see my Apache 'mod_fileiri'
> module, which allows to map requests both in a legacy encoding and
> in UTF-8 back to the file in question.
> [see http://www.w3.org/2003/06/mod_fileiri/Overview.html for an overview,
> including pointers to the actual code and to a talk of mine].
>
> With respect to the query part, this is not affected by the
> 'send URLs always in UTF-8' option in MS IE. The query part
> is always sent in the encoding of the actual page, except
> for some browsers that implement the 'accept-charset' attribute
> on <form>. But for queries, it is rather easy to e.g. convert
> all the forms related to that query URI to UTF-8.
>
> You are right that the (perceived) character encoding of the
> page can affect both parts. Of course, users might always
> change the character encoding, and as a result send something
> that the server gets as garbage. However, users don't use
> menus just for fun, and if anybody would ever come and complain,
> the server side would be very justified to say "don't mess
> around with the settings if you expect your queries to work".
> So this is very much a theoretical concern.
>
>
> Regards,    Martin.
>
>
Received on Thursday, 6 May 2004 08:39:09 UTC