- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Fri, 01 May 2009 18:18:49 +0900
- To: Dan Connolly <connolly@w3.org>
- CC: "Henry S. Thompson" <ht@inf.ed.ac.uk>, www-tag@w3.org, public-iri@w3.org, Apps Discuss <discuss@apps.ietf.org>, Lisa Dusseault <lisa.dusseault@messagingarchitects.com>, Alexey Melnikov <alexey.melnikov@isode.com>
Hello Dan, [same added cross-postings as for previous mail.] On 2009/05/01 1:33, Dan Connolly wrote: > On Tue, 2009-04-28 at 16:31 +0100, Henry S. Thompson wrote: > [...] >> [WEBADDR] A preliminary draft of a possible RFC for Web Addresses >> (extracted from HTML5 [1]): >> http://www.w3.org/html/wg/href/draft.html [not yet in RFC format, >> converted version expected >> RSN] > [...] >> I am sure that the above summaries can be improved. In particular it >> would be helpful have clear statements from their respective >> authors/owners as to what the _requirements_ for the three new >> documents ([IRI-BIS], [LEIRI] and [WEBADDR]) are. Only after we have >> those would it make sense to turn to the question of whether we can >> merge some or all of them. > > Good question. I had hoped to document this in the draft by > now. > > One of the main requirements the design in [WEBADDR] takes > on is the non-western search engine problem. > > My understanding is that MS IE implemented a pre-IRI, > pre-unicode convention that form submission data should > go in the encoding that the form page was encoded in, > and the servers bought into this. I'm not sure this was IE, or Netscape. It was definitely very early in the game. It may already have been in some of the very early localized browsers. Or we may simply see it as a consequence of the fact that early browsers didn't do a conversion to Unicode, but kept data in the incoming encoding internally. For servers, it was also easy to adapt. On the assumption that most forms are sent from pages downloaded from the same server, it's easy to guess that a server working e.g. with EUC-JP would send its documents out in EUC-JP and would prefer getting its form data back in EUC-JP. So everything worked out relatively easily. HTML4 actually sanctions this behavior, as follows (see http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset): >>>> The default value for this attribute is the reserved string "UNKNOWN". User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element. >>>> [btw, I have no idea what HTML5 does (or doesn't) do with accept-charset. For a long time, I thought it was dead, but then a few years ago, somebody told me that Mozilla had implemented it as part of their drive to become more standards-compliant.] > Mozilla tried to follow the IRI spec, but > "This breaks most sub-category links on a big greek on-line pc& gadget > shop > (http://www.e-shop.gr) rendering the site unusable." > https://bugzilla.mozilla.org/show_bug.cgi?id=284474 Unfortunately, the test page mentioned in that bug report is no longer available. It seems the domain expired. The issue of what to do with form submission for IRIs is actually one of the currently open issues for the IRI spec. Please see http://www.w3.org/International/iri-edit/#IRI-and-forms-107. At http://lists.w3.org/Archives/Public/public-iri/2007Jul/0000.html, I write "Where exactly is the boundary between these two behaviours?". What I mean is the following. Query parts exist: a) in "freestanding" IRIs, e.g. on a napkin, or when typed into the address field of a browser. b) in plain links, e.g. a@href, or img@src,... c) when created as part of a form submission It seems clear to me that for a), the only choice we have is to use UTF-8. Things such as "the page currently being displayed below" or "the encoding corresponding to the language version of the browser used" just don't make sense. It also seems clear to me that for c), using the encoding of the enclosing page is as reasonable a choice as it always was. This can be handled completely outside any IRI-related spec, and as far as I can see, seems to have been done already (see http://www.w3.org/TR/html5/forms.html#url-encoded-form-data, step 6.2.1; somebody needs to check the other submission methods, or confirm that they are irrelevant in this context.) [Oh, and accept-charset on forms is alive in HTML5, which means that there is always a way to indicated that you want the query part in UTF-8.] (see http://www.w3.org/TR/html5/forms.html#attr-form-accept-charset) I think it would be good to point out in a guideline to authors that a single value, preferrably UTF-8, is best.] So, b) is really the hard case. I definitely would have preferred if this went with UTF-8. The test page mentioned at https://bugzilla.mozilla.org/show_bug.cgi?id=284474 talks about pressing the submit button, not about pressing a link, so this may have been c) rather than b), but there is no way to check anymore. But it would probably make sense to recheck to make sure that it also applies to b) (or provide a pointer to a test that's still available). > I mostly made my peace with this in a June 2008 > discussion: > > Re: expected results for URI encoding tests? > http://lists.w3.org/Archives/Public/public-html/2008Jun/0369.html > > > p.s. I wish I could tell the story of the non-western > search engine problem/requirement more straightforwardly, > but I seem to have some sort of writer's block. Well, the problem is that when a server gets data (in a path component, and even more in a query part), they need to know what encoding it is to make sense of it and provide a reasonable answer. The reason why you are calling this the "search engine" problem may be that this problem is most prominent with search engines, because everybody agrees that people want to search in their language, not limited to US-ASCII. But it applies to any kind of query parts. Some search engines have their own way to passing encoding information, as an example, in http://www.google.com/search?q=%E9%9D%92%E5%B1%B1&ie=utf-8&oe=utf-8, "ie" indicates input encoding (i.e. what's sent to the server, and "oe" indicates output encoding (what should be sent back). Currently, I really don't understand why we more or less managed to move to "path part is UTF-8" from a time where path parts also where encoded in the encoding of the page that contained the link, whereas we haven't been able to make this transition for query parts. Is the reason that path parts are much more used independent of any electronic substrate that would allow to know the encoding (e.g. sides of buses)? Is the reason that "the encoding is the encoding of the containing page" was considered as being a defacto standard worth following for query parts, but just an implementation accident for path parts? Was it that RFC 3987 (and the drafts leading up to it) were not clear enough on how to handle query parts, and how this related to form submission? Is it that cgi scripts and the like (which usually process query parts) are much more difficult to change than the servers themselves (which usually process path parts)? Was it that IE made the decision, at a time where it could have gone both ways, but that now, there is much more i18n content, and it's harder to change? Regards, Martin. > Somebody else care to give it a try? TimBL? Ian? > Henri? > -- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Friday, 1 May 2009 09:19:48 UTC