W3C home > Mailing lists > Public > public-iri@w3.org > May 2004

Re: what should the charset be in the response to the server

From: Martin Duerst <duerst@w3.org>
Date: Sun, 09 May 2004 09:38:32 +0900
Message-Id: <4.2.0.58.J.20040509093805.05a6acc8@localhost>
To: "Chris Haynes" <chris@harvington.org.uk>, <www-international@w3.org>
Cc: "Michel Suignard" <michelsu@microsoft.com>, <public-iri@w3.org>

Hello Chris,

Many thanks for your clear response. I have closed this issue.

Regards,    Martin.

At 10:30 04/05/07 +0100, Chris Haynes wrote:

>Martin,
>
>Many thanks for the response.
>
>Your expanded sentence fully addresses this issue, as far as I am concerned.
>
>Chris
>
>
>----- Original Message -----
>From: "Martin Duerst" <duerst@w3.org>
>To: "Chris Haynes" <chris@harvington.org.uk>; <www-international@w3.org>
>Cc: "Michel Suignard" <michelsu@microsoft.com>; <public-iri@w3.org>
>Sent: Friday, May 07, 2004 7:02 AM
>Subject: Re: what should the charset be in the response to the server
>
>
> >
> > Hello Chris,
> >
> > Many thanks for your reply. I have copied the IRI list
> > because I think this discussion is relevant for the current
> > draft.
> >
> > At 13:38 04/05/06 +0100, Chris Haynes wrote:
> > >Thanks for the response, Martin,
> > >
> > >I only noticed this response _after_ I had replied to your other response
> > >on the
> > >IRI list, so I apologize that my earlier response did not take into
> > >account this
> > >message of yours.
> > >
> > >Trying to bring this topic to closure, I think my core worry arises 
> each time
> > >there are what-appear-to-me-to-be normative statements that 'the page
>encoding
> > >determines the encoding used in requests derived from that page' -
> > >ignoring the
> > >possibility of users having changed the encoding setting.
> > >
> > >We obviously both agree that users 'should not' use these controls 
> (just as I
> > >diapprove of the use of 'tone controls' and spectral filters in Hi Fi 
> systems
> > >for other than 'loudness' compensation), but I get worried every time the
> > >possibility of their use is ignored.
> > >
> > >The situation is not purely  'theoretical ' I've seen reports that it is
> > >common
> > >practice in some countries for people to switch to their 'national' 
> character
> > >set every time they appear to have a problem in viewing a page - which
> > >could be
> > >occasioned by their browser not having UTF-8 support.
> >
> > Ok, so let's have a look at this case: Either switching to their 'national'
> > character encoding solves the problem, in which case the page was badly
> > labeled, and the page author is to blame. Or switching does not solve
> > the problem, in which case the user may even not be able to read the
> > page, and therefore won't fill in the form. Or the page only contains
> > US-ASCII characters to begin with, and the user doesn't have any reason
> > to switch encodings.
> >
> > That probably leaves us with just one intermediate case: The page is
> > mostly in US-ASCII, but with a few other characters (e.g. 'smart 
> quotes',...).
> > The user sees some problem, tries to fix it by switching the encoding.
> > That doesn't help, so the user gives up, and just fills in the form
> > (which is readable enough to complete the task).
> >
> > If you know about any other scenarios where switching encoding and then
> > filling in the form with a wrong encoding can happen realistically,
> > please tell me.
> >
> >
> > >I help provide support to the users of an open-source web server, and we
> > >frequently get requests for help from people managing web services who,
>having
> > >read the appropriate RFCs and W3 specs in detail, had not appreciated that
> > >user
> > >agents can change the encoding in ways which the request-receiving server
> > >cannot
> > >detect.
> >
> > I was giving a tutorial about Web internationalization for years, and
> > the issue of encoding in forms always came up, but from the time when
> > the first browsers supporting UTF-8 came out, that was always given as
> > an answer, and I haven't heard anybody question this before you. But
> > of course your mileage may vary.
> >
> > But there is an additional point: A server isn't helpless against users
> > changing the encoding. UTF-8 has the very helpful property of having
> > very specific byte sequences. It is easy to check these with a
> > regular expression, for an example, please see
> > http://www.w3.org/International/questions/qa-forms-utf-8.html.
> >
> >
> > >I suppose I'm just keen to make sure that wherever this topic appears, the
> > >potential behavior of the vast majority of browsers in the world is
>adequately
> > >and completely described.
> > >
> > >If there were an RFC somewhere which said that the user agent 'MUST NOT'
> > >change
> > >the encoding, and that real-world browsers were ignoring this stricture, I
> > >would
> > >agree that other RFCs were right to describe what should be, rather 
> than what
> > >is.
> > >
> > >But as far as I know, the ability for users to override the encoding 
> does not
> > >contravene any existing RFC, and therefore other RFCs ought at least to
> > >recognize that possibility, and not infer, by omission, a level of 
> certainty
> > >which can never be assured.
> > >
> > >I think I would have a very poor view of any web site which told me it 
> was my
> > >fault a request got garbled because I made use of a freely-available
> > >control on
> > >my browser.
> > >
> > >Let me try to conclude this by just asking that, so long as user 
> control over
> > >the encoding is permitted by RFCs, that possibility is explicitly
> > >recognized by
> > >other RFCs., and that we dont try to pretend that it does not exist 
> or, even
> > >worse, that failures and errors in decoding are the user's fault for
> > >breaking an
> > >unwritten, untestable  non-rule.
> >
> > I'm still not sure to what extent this is really happening. But I have
> > clarified this issue by expanding the sentence in question as follows:
> >
> > "Likewise, when setting up a new Web form using UTF-8 as the encoding
> > of the form page, the returned query URIs will use UTF-8 as an encoding
> > (unless the user for whatever reason changes the character encoding)
> > and will therefore be compatible with IRIs."
> >
> > This leaves it to the reader to judge for him/herself how high
> > the probability is that the user is switching code pages.
> >
> > Regards,    Martin.
> >
> >
> > >Chris
> > >
> > >
> > >----- Original Message -----
> > >From: "Martin Duerst" <duerst@w3.org>
> > >To: "Chris Haynes" <chris@harvington.org.uk>; <www-international@w3.org>
> > >Cc: "Michel Suignard" <michelsu@microsoft.com>
> > >Sent: Thursday, May 06, 2004 8:04 AM
> > >Subject: Re: what should the charset be in the response to the server
> > >
> > >
> > > > Hello Chris,
> > > >
> > > > In trying to clear up the remaining IRI issues, I found out that
> > > > I planned to reply to this message of yours, but didn't get around
> > > > to do it.
> > > >
> > > > At 17:20 03/08/07 +0100, Chris Haynes wrote:
> > > >
> > > > >  "Martin Duerst" Replied:
> > > > >
> > > > >
> > > > > > At 12:15 03/07/26 +0100, Chris Haynes wrote:
> > > > > >
> > > > > > >  "Jungshik Shin" replied at: Saturday, July 26, 2003 11:31 AM
> > > > > >
> > > > > > > >   It also depends on whether or not you set 'send URLs 
> always in
> > > > > > >UTF-8' in
> > > > > > > > Tools|Options(?) in MS IE.
> > > > > > > >
> > > > > > >
> > > > > > >True, but I'm trying to find a 'reliable' mechanism which is not
> > > > > > >dependent on user-accessible controls.
> > > > > > >IMHO, this is also a 'dangerous' option, in that it goes 
> agains the
> > > > >de
> > > > > > >facto conventions and anticipates (parhaps incorrectly) the
> > > > > > >recommendations of the proposed IRI RFC. It can only safely be 
> used
> > > > > > >with a 'consenting' server site.
> > > > > >
> > > > > > Sorry, no. The main dangerous thing is that authors use non-ASCII
> > > > > > characters in URIs (without any %HH escaping) when this is clearly
> > > > > > forbidden.
> > > > > >
> > > > > > Regards,  Martin.
> > > > >
> > > > >
> > > > >Martin,
> > > > >
> > > > >Are you saying that you approve of relying on users to select the
> > > > >(Microsoft-specific)  'send URLs always in
> > > > >UTF-8'  menu option  to ensure that UTF8 gets returned to the server?
> > > > >
> > > > >That is what was being suggested.
> > > >
> > > > Well, my above statement was meant in the following sense:
> > > > There is NO spec that would allow inclusion of non-ASCII
> > > > characters in URIs. The IRI spec is the first one that
> > > > defines something similar to an URI that actually allows this.
> > > > Any authors that for example put raw iso-8859-1 characters
> > > > into an URI in a page in iso-8859-1 are therefore wrong;
> > > > any 'it works' effect is coincidental, not according to specs.
> > > > Suggesting that a browser that anticipates a future spec
> > > > (the IRI spec) is dangerous, while (implicitly) blessing
> > > > browsers and pages that don't conform to any spec is in
> > > > my eyes a dangerous idea.
> > > >
> > > >
> > > > >My argument was that any current HTTP-like system in which the
> > > > >character encoding could be modified by menu controls in the user
> > > > >agent, (and in which the actual encoding used is *not* conveyed in the
> > > > >request) was inherently unreliable.
> > > >
> > > > I think we have to look at different parts of a HTTP request 
> separately.
> > > > There are mainly two parts: the 'path' part and the 'query' part.
> > > >
> > > > With respect to the path part, this is indeed influenced by the
> > > > 'send URLs always in UTF-8' option in MS IE. But there are ways
> > > > to get around this. For an example, see my Apache 'mod_fileiri'
> > > > module, which allows to map requests both in a legacy encoding and
> > > > in UTF-8 back to the file in question.
> > > > [see http://www.w3.org/2003/06/mod_fileiri/Overview.html for an 
> overview,
> > > > including pointers to the actual code and to a talk of mine].
> > > >
> > > > With respect to the query part, this is not affected by the
> > > > 'send URLs always in UTF-8' option in MS IE. The query part
> > > > is always sent in the encoding of the actual page, except
> > > > for some browsers that implement the 'accept-charset' attribute
> > > > on <form>. But for queries, it is rather easy to e.g. convert
> > > > all the forms related to that query URI to UTF-8.
> > > >
> > > > You are right that the (perceived) character encoding of the
> > > > page can affect both parts. Of course, users might always
> > > > change the character encoding, and as a result send something
> > > > that the server gets as garbage. However, users don't use
> > > > menus just for fun, and if anybody would ever come and complain,
> > > > the server side would be very justified to say "don't mess
> > > > around with the settings if you expect your queries to work".
> > > > So this is very much a theoretical concern.
> > > >
> > > >
> > > > Regards,    Martin.
> > > >
> > > >
> >
> >
Received on Saturday, 8 May 2004 20:44:56 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Monday, 30 April 2012 19:51:53 GMT