- From: Chris Haynes <chris@harvington.org.uk>
- Date: Fri, 7 May 2004 10:30:10 +0100
- To: <www-international@w3.org>, "Martin Duerst" <duerst@w3.org>
- Cc: "Michel Suignard" <michelsu@microsoft.com>, <public-iri@w3.org>
Martin, Many thanks for the response. Your expanded sentence fully addresses this issue, as far as I am concerned. Chris ----- Original Message ----- From: "Martin Duerst" <duerst@w3.org> To: "Chris Haynes" <chris@harvington.org.uk>; <www-international@w3.org> Cc: "Michel Suignard" <michelsu@microsoft.com>; <public-iri@w3.org> Sent: Friday, May 07, 2004 7:02 AM Subject: Re: what should the charset be in the response to the server > > Hello Chris, > > Many thanks for your reply. I have copied the IRI list > because I think this discussion is relevant for the current > draft. > > At 13:38 04/05/06 +0100, Chris Haynes wrote: > >Thanks for the response, Martin, > > > >I only noticed this response _after_ I had replied to your other response > >on the > >IRI list, so I apologize that my earlier response did not take into > >account this > >message of yours. > > > >Trying to bring this topic to closure, I think my core worry arises each time > >there are what-appear-to-me-to-be normative statements that 'the page encoding > >determines the encoding used in requests derived from that page' - > >ignoring the > >possibility of users having changed the encoding setting. > > > >We obviously both agree that users 'should not' use these controls (just as I > >diapprove of the use of 'tone controls' and spectral filters in Hi Fi systems > >for other than 'loudness' compensation), but I get worried every time the > >possibility of their use is ignored. > > > >The situation is not purely 'theoretical ' I've seen reports that it is > >common > >practice in some countries for people to switch to their 'national' character > >set every time they appear to have a problem in viewing a page - which > >could be > >occasioned by their browser not having UTF-8 support. > > Ok, so let's have a look at this case: Either switching to their 'national' > character encoding solves the problem, in which case the page was badly > labeled, and the page author is to blame. Or switching does not solve > the problem, in which case the user may even not be able to read the > page, and therefore won't fill in the form. Or the page only contains > US-ASCII characters to begin with, and the user doesn't have any reason > to switch encodings. > > That probably leaves us with just one intermediate case: The page is > mostly in US-ASCII, but with a few other characters (e.g. 'smart quotes',...). > The user sees some problem, tries to fix it by switching the encoding. > That doesn't help, so the user gives up, and just fills in the form > (which is readable enough to complete the task). > > If you know about any other scenarios where switching encoding and then > filling in the form with a wrong encoding can happen realistically, > please tell me. > > > >I help provide support to the users of an open-source web server, and we > >frequently get requests for help from people managing web services who, having > >read the appropriate RFCs and W3 specs in detail, had not appreciated that > >user > >agents can change the encoding in ways which the request-receiving server > >cannot > >detect. > > I was giving a tutorial about Web internationalization for years, and > the issue of encoding in forms always came up, but from the time when > the first browsers supporting UTF-8 came out, that was always given as > an answer, and I haven't heard anybody question this before you. But > of course your mileage may vary. > > But there is an additional point: A server isn't helpless against users > changing the encoding. UTF-8 has the very helpful property of having > very specific byte sequences. It is easy to check these with a > regular expression, for an example, please see > http://www.w3.org/International/questions/qa-forms-utf-8.html. > > > >I suppose I'm just keen to make sure that wherever this topic appears, the > >potential behavior of the vast majority of browsers in the world is adequately > >and completely described. > > > >If there were an RFC somewhere which said that the user agent 'MUST NOT' > >change > >the encoding, and that real-world browsers were ignoring this stricture, I > >would > >agree that other RFCs were right to describe what should be, rather than what > >is. > > > >But as far as I know, the ability for users to override the encoding does not > >contravene any existing RFC, and therefore other RFCs ought at least to > >recognize that possibility, and not infer, by omission, a level of certainty > >which can never be assured. > > > >I think I would have a very poor view of any web site which told me it was my > >fault a request got garbled because I made use of a freely-available > >control on > >my browser. > > > >Let me try to conclude this by just asking that, so long as user control over > >the encoding is permitted by RFCs, that possibility is explicitly > >recognized by > >other RFCs., and that we dont try to pretend that it does not exist or, even > >worse, that failures and errors in decoding are the user's fault for > >breaking an > >unwritten, untestable non-rule. > > I'm still not sure to what extent this is really happening. But I have > clarified this issue by expanding the sentence in question as follows: > > "Likewise, when setting up a new Web form using UTF-8 as the encoding > of the form page, the returned query URIs will use UTF-8 as an encoding > (unless the user for whatever reason changes the character encoding) > and will therefore be compatible with IRIs." > > This leaves it to the reader to judge for him/herself how high > the probability is that the user is switching code pages. > > Regards, Martin. > > > >Chris > > > > > >----- Original Message ----- > >From: "Martin Duerst" <duerst@w3.org> > >To: "Chris Haynes" <chris@harvington.org.uk>; <www-international@w3.org> > >Cc: "Michel Suignard" <michelsu@microsoft.com> > >Sent: Thursday, May 06, 2004 8:04 AM > >Subject: Re: what should the charset be in the response to the server > > > > > > > Hello Chris, > > > > > > In trying to clear up the remaining IRI issues, I found out that > > > I planned to reply to this message of yours, but didn't get around > > > to do it. > > > > > > At 17:20 03/08/07 +0100, Chris Haynes wrote: > > > > > > > "Martin Duerst" Replied: > > > > > > > > > > > > > At 12:15 03/07/26 +0100, Chris Haynes wrote: > > > > > > > > > > > "Jungshik Shin" replied at: Saturday, July 26, 2003 11:31 AM > > > > > > > > > > > > It also depends on whether or not you set 'send URLs always in > > > > > >UTF-8' in > > > > > > > Tools|Options(?) in MS IE. > > > > > > > > > > > > > > > > > > >True, but I'm trying to find a 'reliable' mechanism which is not > > > > > >dependent on user-accessible controls. > > > > > >IMHO, this is also a 'dangerous' option, in that it goes agains the > > > >de > > > > > >facto conventions and anticipates (parhaps incorrectly) the > > > > > >recommendations of the proposed IRI RFC. It can only safely be used > > > > > >with a 'consenting' server site. > > > > > > > > > > Sorry, no. The main dangerous thing is that authors use non-ASCII > > > > > characters in URIs (without any %HH escaping) when this is clearly > > > > > forbidden. > > > > > > > > > > Regards, Martin. > > > > > > > > > > > >Martin, > > > > > > > >Are you saying that you approve of relying on users to select the > > > >(Microsoft-specific) 'send URLs always in > > > >UTF-8' menu option to ensure that UTF8 gets returned to the server? > > > > > > > >That is what was being suggested. > > > > > > Well, my above statement was meant in the following sense: > > > There is NO spec that would allow inclusion of non-ASCII > > > characters in URIs. The IRI spec is the first one that > > > defines something similar to an URI that actually allows this. > > > Any authors that for example put raw iso-8859-1 characters > > > into an URI in a page in iso-8859-1 are therefore wrong; > > > any 'it works' effect is coincidental, not according to specs. > > > Suggesting that a browser that anticipates a future spec > > > (the IRI spec) is dangerous, while (implicitly) blessing > > > browsers and pages that don't conform to any spec is in > > > my eyes a dangerous idea. > > > > > > > > > >My argument was that any current HTTP-like system in which the > > > >character encoding could be modified by menu controls in the user > > > >agent, (and in which the actual encoding used is *not* conveyed in the > > > >request) was inherently unreliable. > > > > > > I think we have to look at different parts of a HTTP request separately. > > > There are mainly two parts: the 'path' part and the 'query' part. > > > > > > With respect to the path part, this is indeed influenced by the > > > 'send URLs always in UTF-8' option in MS IE. But there are ways > > > to get around this. For an example, see my Apache 'mod_fileiri' > > > module, which allows to map requests both in a legacy encoding and > > > in UTF-8 back to the file in question. > > > [see http://www.w3.org/2003/06/mod_fileiri/Overview.html for an overview, > > > including pointers to the actual code and to a talk of mine]. > > > > > > With respect to the query part, this is not affected by the > > > 'send URLs always in UTF-8' option in MS IE. The query part > > > is always sent in the encoding of the actual page, except > > > for some browsers that implement the 'accept-charset' attribute > > > on <form>. But for queries, it is rather easy to e.g. convert > > > all the forms related to that query URI to UTF-8. > > > > > > You are right that the (perceived) character encoding of the > > > page can affect both parts. Of course, users might always > > > change the character encoding, and as a result send something > > > that the server gets as garbage. However, users don't use > > > menus just for fun, and if anybody would ever come and complain, > > > the server side would be very justified to say "don't mess > > > around with the settings if you expect your queries to work". > > > So this is very much a theoretical concern. > > > > > > > > > Regards, Martin. > > > > > > > >
Received on Friday, 7 May 2004 05:32:39 UTC