Re: what should the charset be in the response to the server

Martin,

Many thanks for the response.

Your expanded sentence fully addresses this issue, as far as I am concerned.

Chris


----- Original Message -----
From: "Martin Duerst" <duerst@w3.org>
To: "Chris Haynes" <chris@harvington.org.uk>; <www-international@w3.org>
Cc: "Michel Suignard" <michelsu@microsoft.com>; <public-iri@w3.org>
Sent: Friday, May 07, 2004 7:02 AM
Subject: Re: what should the charset be in the response to the server


>
> Hello Chris,
>
> Many thanks for your reply. I have copied the IRI list
> because I think this discussion is relevant for the current
> draft.
>
> At 13:38 04/05/06 +0100, Chris Haynes wrote:
> >Thanks for the response, Martin,
> >
> >I only noticed this response _after_ I had replied to your other response
> >on the
> >IRI list, so I apologize that my earlier response did not take into
> >account this
> >message of yours.
> >
> >Trying to bring this topic to closure, I think my core worry arises each time
> >there are what-appear-to-me-to-be normative statements that 'the page
encoding
> >determines the encoding used in requests derived from that page' -
> >ignoring the
> >possibility of users having changed the encoding setting.
> >
> >We obviously both agree that users 'should not' use these controls (just as I
> >diapprove of the use of 'tone controls' and spectral filters in Hi Fi systems
> >for other than 'loudness' compensation), but I get worried every time the
> >possibility of their use is ignored.
> >
> >The situation is not purely  'theoretical ' I've seen reports that it is
> >common
> >practice in some countries for people to switch to their 'national' character
> >set every time they appear to have a problem in viewing a page - which
> >could be
> >occasioned by their browser not having UTF-8 support.
>
> Ok, so let's have a look at this case: Either switching to their 'national'
> character encoding solves the problem, in which case the page was badly
> labeled, and the page author is to blame. Or switching does not solve
> the problem, in which case the user may even not be able to read the
> page, and therefore won't fill in the form. Or the page only contains
> US-ASCII characters to begin with, and the user doesn't have any reason
> to switch encodings.
>
> That probably leaves us with just one intermediate case: The page is
> mostly in US-ASCII, but with a few other characters (e.g. 'smart quotes',...).
> The user sees some problem, tries to fix it by switching the encoding.
> That doesn't help, so the user gives up, and just fills in the form
> (which is readable enough to complete the task).
>
> If you know about any other scenarios where switching encoding and then
> filling in the form with a wrong encoding can happen realistically,
> please tell me.
>
>
> >I help provide support to the users of an open-source web server, and we
> >frequently get requests for help from people managing web services who,
having
> >read the appropriate RFCs and W3 specs in detail, had not appreciated that
> >user
> >agents can change the encoding in ways which the request-receiving server
> >cannot
> >detect.
>
> I was giving a tutorial about Web internationalization for years, and
> the issue of encoding in forms always came up, but from the time when
> the first browsers supporting UTF-8 came out, that was always given as
> an answer, and I haven't heard anybody question this before you. But
> of course your mileage may vary.
>
> But there is an additional point: A server isn't helpless against users
> changing the encoding. UTF-8 has the very helpful property of having
> very specific byte sequences. It is easy to check these with a
> regular expression, for an example, please see
> http://www.w3.org/International/questions/qa-forms-utf-8.html.
>
>
> >I suppose I'm just keen to make sure that wherever this topic appears, the
> >potential behavior of the vast majority of browsers in the world is
adequately
> >and completely described.
> >
> >If there were an RFC somewhere which said that the user agent 'MUST NOT'
> >change
> >the encoding, and that real-world browsers were ignoring this stricture, I
> >would
> >agree that other RFCs were right to describe what should be, rather than what
> >is.
> >
> >But as far as I know, the ability for users to override the encoding does not
> >contravene any existing RFC, and therefore other RFCs ought at least to
> >recognize that possibility, and not infer, by omission, a level of certainty
> >which can never be assured.
> >
> >I think I would have a very poor view of any web site which told me it was my
> >fault a request got garbled because I made use of a freely-available
> >control on
> >my browser.
> >
> >Let me try to conclude this by just asking that, so long as user control over
> >the encoding is permitted by RFCs, that possibility is explicitly
> >recognized by
> >other RFCs., and that we dont try to pretend that it does not exist or, even
> >worse, that failures and errors in decoding are the user's fault for
> >breaking an
> >unwritten, untestable  non-rule.
>
> I'm still not sure to what extent this is really happening. But I have
> clarified this issue by expanding the sentence in question as follows:
>
> "Likewise, when setting up a new Web form using UTF-8 as the encoding
> of the form page, the returned query URIs will use UTF-8 as an encoding
> (unless the user for whatever reason changes the character encoding)
> and will therefore be compatible with IRIs."
>
> This leaves it to the reader to judge for him/herself how high
> the probability is that the user is switching code pages.
>
> Regards,    Martin.
>
>
> >Chris
> >
> >
> >----- Original Message -----
> >From: "Martin Duerst" <duerst@w3.org>
> >To: "Chris Haynes" <chris@harvington.org.uk>; <www-international@w3.org>
> >Cc: "Michel Suignard" <michelsu@microsoft.com>
> >Sent: Thursday, May 06, 2004 8:04 AM
> >Subject: Re: what should the charset be in the response to the server
> >
> >
> > > Hello Chris,
> > >
> > > In trying to clear up the remaining IRI issues, I found out that
> > > I planned to reply to this message of yours, but didn't get around
> > > to do it.
> > >
> > > At 17:20 03/08/07 +0100, Chris Haynes wrote:
> > >
> > > >  "Martin Duerst" Replied:
> > > >
> > > >
> > > > > At 12:15 03/07/26 +0100, Chris Haynes wrote:
> > > > >
> > > > > >  "Jungshik Shin" replied at: Saturday, July 26, 2003 11:31 AM
> > > > >
> > > > > > >   It also depends on whether or not you set 'send URLs always in
> > > > > >UTF-8' in
> > > > > > > Tools|Options(?) in MS IE.
> > > > > > >
> > > > > >
> > > > > >True, but I'm trying to find a 'reliable' mechanism which is not
> > > > > >dependent on user-accessible controls.
> > > > > >IMHO, this is also a 'dangerous' option, in that it goes agains the
> > > >de
> > > > > >facto conventions and anticipates (parhaps incorrectly) the
> > > > > >recommendations of the proposed IRI RFC. It can only safely be used
> > > > > >with a 'consenting' server site.
> > > > >
> > > > > Sorry, no. The main dangerous thing is that authors use non-ASCII
> > > > > characters in URIs (without any %HH escaping) when this is clearly
> > > > > forbidden.
> > > > >
> > > > > Regards,  Martin.
> > > >
> > > >
> > > >Martin,
> > > >
> > > >Are you saying that you approve of relying on users to select the
> > > >(Microsoft-specific)  'send URLs always in
> > > >UTF-8'  menu option  to ensure that UTF8 gets returned to the server?
> > > >
> > > >That is what was being suggested.
> > >
> > > Well, my above statement was meant in the following sense:
> > > There is NO spec that would allow inclusion of non-ASCII
> > > characters in URIs. The IRI spec is the first one that
> > > defines something similar to an URI that actually allows this.
> > > Any authors that for example put raw iso-8859-1 characters
> > > into an URI in a page in iso-8859-1 are therefore wrong;
> > > any 'it works' effect is coincidental, not according to specs.
> > > Suggesting that a browser that anticipates a future spec
> > > (the IRI spec) is dangerous, while (implicitly) blessing
> > > browsers and pages that don't conform to any spec is in
> > > my eyes a dangerous idea.
> > >
> > >
> > > >My argument was that any current HTTP-like system in which the
> > > >character encoding could be modified by menu controls in the user
> > > >agent, (and in which the actual encoding used is *not* conveyed in the
> > > >request) was inherently unreliable.
> > >
> > > I think we have to look at different parts of a HTTP request separately.
> > > There are mainly two parts: the 'path' part and the 'query' part.
> > >
> > > With respect to the path part, this is indeed influenced by the
> > > 'send URLs always in UTF-8' option in MS IE. But there are ways
> > > to get around this. For an example, see my Apache 'mod_fileiri'
> > > module, which allows to map requests both in a legacy encoding and
> > > in UTF-8 back to the file in question.
> > > [see http://www.w3.org/2003/06/mod_fileiri/Overview.html for an overview,
> > > including pointers to the actual code and to a talk of mine].
> > >
> > > With respect to the query part, this is not affected by the
> > > 'send URLs always in UTF-8' option in MS IE. The query part
> > > is always sent in the encoding of the actual page, except
> > > for some browsers that implement the 'accept-charset' attribute
> > > on <form>. But for queries, it is rather easy to e.g. convert
> > > all the forms related to that query URI to UTF-8.
> > >
> > > You are right that the (perceived) character encoding of the
> > > page can affect both parts. Of course, users might always
> > > change the character encoding, and as a result send something
> > > that the server gets as garbage. However, users don't use
> > > menus just for fun, and if anybody would ever come and complain,
> > > the server side would be very justified to say "don't mess
> > > around with the settings if you expect your queries to work".
> > > So this is very much a theoretical concern.
> > >
> > >
> > > Regards,    Martin.
> > >
> > >
>
>

Received on Friday, 7 May 2004 05:32:39 UTC