Re: what should the charset be in the response to the server from Martin Duerst on 2004-05-07 (www-international@w3.org from April to June 2004)

From: Martin Duerst <duerst@w3.org>
Date: Fri, 07 May 2004 15:02:55 +0900
To: "Chris Haynes" <chris@harvington.org.uk>, <www-international@w3.org>
Cc: "Michel Suignard" <michelsu@microsoft.com>, public-iri@w3.org
Message-Id: <4.2.0.58.J.20040507142830.05a7f0f8@localhost>
Hello Chris,

Many thanks for your reply. I have copied the IRI list
because I think this discussion is relevant for the current
draft.

At 13:38 04/05/06 +0100, Chris Haynes wrote:
>Thanks for the response, Martin,
>
>I only noticed this response _after_ I had replied to your other response 
>on the
>IRI list, so I apologize that my earlier response did not take into 
>account this
>message of yours.
>
>Trying to bring this topic to closure, I think my core worry arises each time
>there are what-appear-to-me-to-be normative statements that 'the page encoding
>determines the encoding used in requests derived from that page' - 
>ignoring the
>possibility of users having changed the encoding setting.
>
>We obviously both agree that users 'should not' use these controls (just as I
>diapprove of the use of 'tone controls' and spectral filters in Hi Fi systems
>for other than 'loudness' compensation), but I get worried every time the
>possibility of their use is ignored.
>
>The situation is not purely  'theoretical ' I've seen reports that it is 
>common
>practice in some countries for people to switch to their 'national' character
>set every time they appear to have a problem in viewing a page - which 
>could be
>occasioned by their browser not having UTF-8 support.

Ok, so let's have a look at this case: Either switching to their 'national'
character encoding solves the problem, in which case the page was badly
labeled, and the page author is to blame. Or switching does not solve
the problem, in which case the user may even not be able to read the
page, and therefore won't fill in the form. Or the page only contains
US-ASCII characters to begin with, and the user doesn't have any reason
to switch encodings.

That probably leaves us with just one intermediate case: The page is
mostly in US-ASCII, but with a few other characters (e.g. 'smart quotes',...).
The user sees some problem, tries to fix it by switching the encoding.
That doesn't help, so the user gives up, and just fills in the form
(which is readable enough to complete the task).

If you know about any other scenarios where switching encoding and then
filling in the form with a wrong encoding can happen realistically,
please tell me.


>I help provide support to the users of an open-source web server, and we
>frequently get requests for help from people managing web services who, having
>read the appropriate RFCs and W3 specs in detail, had not appreciated that 
>user
>agents can change the encoding in ways which the request-receiving server 
>cannot
>detect.

I was giving a tutorial about Web internationalization for years, and
the issue of encoding in forms always came up, but from the time when
the first browsers supporting UTF-8 came out, that was always given as
an answer, and I haven't heard anybody question this before you. But
of course your mileage may vary.

But there is an additional point: A server isn't helpless against users
changing the encoding. UTF-8 has the very helpful property of having
very specific byte sequences. It is easy to check these with a
regular expression, for an example, please see
http://www.w3.org/International/questions/qa-forms-utf-8.html.


>I suppose I'm just keen to make sure that wherever this topic appears, the
>potential behavior of the vast majority of browsers in the world is adequately
>and completely described.
>
>If there were an RFC somewhere which said that the user agent 'MUST NOT' 
>change
>the encoding, and that real-world browsers were ignoring this stricture, I 
>would
>agree that other RFCs were right to describe what should be, rather than what
>is.
>
>But as far as I know, the ability for users to override the encoding does not
>contravene any existing RFC, and therefore other RFCs ought at least to
>recognize that possibility, and not infer, by omission, a level of certainty
>which can never be assured.
>
>I think I would have a very poor view of any web site which told me it was my
>fault a request got garbled because I made use of a freely-available 
>control on
>my browser.
>
>Let me try to conclude this by just asking that, so long as user control over
>the encoding is permitted by RFCs, that possibility is explicitly 
>recognized by
>other RFCs., and that we dont try to pretend that it does not exist or, even
>worse, that failures and errors in decoding are the user's fault for 
>breaking an
>unwritten, untestable  non-rule.

I'm still not sure to what extent this is really happening. But I have
clarified this issue by expanding the sentence in question as follows:

"Likewise, when setting up a new Web form using UTF-8 as the encoding
of the form page, the returned query URIs will use UTF-8 as an encoding
(unless the user for whatever reason changes the character encoding)
and will therefore be compatible with IRIs."

This leaves it to the reader to judge for him/herself how high
the probability is that the user is switching code pages.

Regards,    Martin.


>Chris
>
>
>----- Original Message -----
>From: "Martin Duerst" <duerst@w3.org>
>To: "Chris Haynes" <chris@harvington.org.uk>; <www-international@w3.org>
>Cc: "Michel Suignard" <michelsu@microsoft.com>
>Sent: Thursday, May 06, 2004 8:04 AM
>Subject: Re: what should the charset be in the response to the server
>
>
> > Hello Chris,
> >
> > In trying to clear up the remaining IRI issues, I found out that
> > I planned to reply to this message of yours, but didn't get around
> > to do it.
> >
> > At 17:20 03/08/07 +0100, Chris Haynes wrote:
> >
> > >  "Martin Duerst" Replied:
> > >
> > >
> > > > At 12:15 03/07/26 +0100, Chris Haynes wrote:
> > > >
> > > > >  "Jungshik Shin" replied at: Saturday, July 26, 2003 11:31 AM
> > > >
> > > > > >   It also depends on whether or not you set 'send URLs always in
> > > > >UTF-8' in
> > > > > > Tools|Options(?) in MS IE.
> > > > > >
> > > > >
> > > > >True, but I'm trying to find a 'reliable' mechanism which is not
> > > > >dependent on user-accessible controls.
> > > > >IMHO, this is also a 'dangerous' option, in that it goes agains the
> > >de
> > > > >facto conventions and anticipates (parhaps incorrectly) the
> > > > >recommendations of the proposed IRI RFC. It can only safely be used
> > > > >with a 'consenting' server site.
> > > >
> > > > Sorry, no. The main dangerous thing is that authors use non-ASCII
> > > > characters in URIs (without any %HH escaping) when this is clearly
> > > > forbidden.
> > > >
> > > > Regards,  Martin.
> > >
> > >
> > >Martin,
> > >
> > >Are you saying that you approve of relying on users to select the
> > >(Microsoft-specific)  'send URLs always in
> > >UTF-8'  menu option  to ensure that UTF8 gets returned to the server?
> > >
> > >That is what was being suggested.
> >
> > Well, my above statement was meant in the following sense:
> > There is NO spec that would allow inclusion of non-ASCII
> > characters in URIs. The IRI spec is the first one that
> > defines something similar to an URI that actually allows this.
> > Any authors that for example put raw iso-8859-1 characters
> > into an URI in a page in iso-8859-1 are therefore wrong;
> > any 'it works' effect is coincidental, not according to specs.
> > Suggesting that a browser that anticipates a future spec
> > (the IRI spec) is dangerous, while (implicitly) blessing
> > browsers and pages that don't conform to any spec is in
> > my eyes a dangerous idea.
> >
> >
> > >My argument was that any current HTTP-like system in which the
> > >character encoding could be modified by menu controls in the user
> > >agent, (and in which the actual encoding used is *not* conveyed in the
> > >request) was inherently unreliable.
> >
> > I think we have to look at different parts of a HTTP request separately.
> > There are mainly two parts: the 'path' part and the 'query' part.
> >
> > With respect to the path part, this is indeed influenced by the
> > 'send URLs always in UTF-8' option in MS IE. But there are ways
> > to get around this. For an example, see my Apache 'mod_fileiri'
> > module, which allows to map requests both in a legacy encoding and
> > in UTF-8 back to the file in question.
> > [see http://www.w3.org/2003/06/mod_fileiri/Overview.html for an overview,
> > including pointers to the actual code and to a talk of mine].
> >
> > With respect to the query part, this is not affected by the
> > 'send URLs always in UTF-8' option in MS IE. The query part
> > is always sent in the encoding of the actual page, except
> > for some browsers that implement the 'accept-charset' attribute
> > on <form>. But for queries, it is rather easy to e.g. convert
> > all the forms related to that query URI to UTF-8.
> >
> > You are right that the (perceived) character encoding of the
> > page can affect both parts. Of course, users might always
> > change the character encoding, and as a result send something
> > that the server gets as garbage. However, users don't use
> > menus just for fun, and if anybody would ever come and complain,
> > the server side would be very justified to say "don't mess
> > around with the settings if you expect your queries to work".
> > So this is very much a theoretical concern.
> >
> >
> > Regards,    Martin.
> >
> >
Received on Friday, 7 May 2004 02:10:47 UTC