Re: A new RFC for Web Addresses/Hypertext References: Background wrt LEIRIs from Martin J. Dürst on 2009-05-01 (www-tag@w3.org from May 2009)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Fri, 01 May 2009 18:18:49 +0900
To: Dan Connolly <connolly@w3.org>
CC: "Henry S. Thompson" <ht@inf.ed.ac.uk>, www-tag@w3.org, public-iri@w3.org, Apps Discuss <discuss@apps.ietf.org>, Lisa Dusseault <lisa.dusseault@messagingarchitects.com>, Alexey Melnikov <alexey.melnikov@isode.com>
Message-ID: <49FABE79.9090307@it.aoyama.ac.jp>
Hello Dan,

[same added cross-postings as for previous mail.]

On 2009/05/01 1:33, Dan Connolly wrote:
> On Tue, 2009-04-28 at 16:31 +0100, Henry S. Thompson wrote:
> [...]
>>   [WEBADDR] A preliminary draft of a possible RFC for Web Addresses
>>             (extracted from HTML5 [1]):
>>     http://www.w3.org/html/wg/href/draft.html [not yet in RFC format,
>>                                                converted version expected
>>                                                RSN]
> [...]
>> I am sure that the above summaries can be improved.  In particular it
>> would be helpful have clear statements from their respective
>> authors/owners as to what the _requirements_ for the three new
>> documents ([IRI-BIS], [LEIRI] and [WEBADDR]) are.  Only after we have
>> those would it make sense to turn to the question of whether we can
>> merge some or all of them.
>
> Good question. I had hoped to document this in the draft by
> now.
>
> One of the main requirements the design in [WEBADDR] takes
> on is the non-western search engine problem.
>
> My understanding is that MS IE implemented a pre-IRI,
> pre-unicode convention that form submission data should
> go in the encoding that the form page was encoded in,
> and the servers bought into this.

I'm not sure this was IE, or Netscape. It was definitely very early in 
the game. It may already have been in some of the very early localized 
browsers. Or we may simply see it as a consequence of the fact that 
early browsers didn't do a conversion to Unicode, but kept data in the 
incoming encoding internally.

For servers, it was also easy to adapt. On the assumption that most 
forms are sent from pages downloaded from the same server, it's easy to 
guess that a server working e.g. with EUC-JP would send its documents 
out in EUC-JP and would prefer getting its form data back in EUC-JP.
So everything worked out relatively easily.

HTML4 actually sanctions this behavior, as follows (see 
http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset):
 >>>>
The default value for this attribute is the reserved string "UNKNOWN". 
User agents may interpret this value as the character encoding that was 
used to transmit the document containing this FORM element.
 >>>>

[btw, I have no idea what HTML5 does (or doesn't) do with 
accept-charset. For a long time, I thought it was dead, but then a few 
years ago, somebody told me that Mozilla had implemented it as part of 
their drive to become more standards-compliant.]


> Mozilla tried to follow the IRI spec, but
> "This breaks most sub-category links on a big greek on-line pc&  gadget
> shop
> (http://www.e-shop.gr) rendering the site unusable."
> https://bugzilla.mozilla.org/show_bug.cgi?id=284474

Unfortunately, the test page mentioned in that bug report is no longer 
available. It seems the domain expired.

The issue of what to do with form submission for IRIs is actually one of 
the currently open issues for the IRI spec. Please see 
http://www.w3.org/International/iri-edit/#IRI-and-forms-107.

At http://lists.w3.org/Archives/Public/public-iri/2007Jul/0000.html,
I write "Where exactly is the boundary between these two
behaviours?".

What I mean is the following. Query parts exist:
a) in "freestanding" IRIs, e.g. on a napkin, or when typed into
    the address field of a browser.
b) in plain links, e.g. a@href, or img@src,...
c) when created as part of a form submission

It seems clear to me that for a), the only choice we have is to use 
UTF-8. Things such as "the page currently being displayed below" or "the 
encoding corresponding to the language version of the browser used" just 
don't make sense.

It also seems clear to me that for c), using the encoding of the 
enclosing page is as reasonable a choice as it always was. This can be 
handled completely outside any IRI-related spec, and as far as I can 
see, seems to have been done already
(see http://www.w3.org/TR/html5/forms.html#url-encoded-form-data,
step 6.2.1; somebody needs to check the other submission methods, or 
confirm that they are irrelevant in this context.)
[Oh, and accept-charset on forms is alive in HTML5, which means that 
there is always a way to indicated that you want the query part in UTF-8.]
(see http://www.w3.org/TR/html5/forms.html#attr-form-accept-charset)
I think it would be good to point out in a guideline to authors that a 
single value, preferrably UTF-8, is best.]

So, b) is really the hard case. I definitely would have preferred if 
this went with UTF-8. The test page mentioned at
https://bugzilla.mozilla.org/show_bug.cgi?id=284474 talks about pressing 
the submit button, not about pressing a link, so this may have been c) 
rather than b), but there is no way to check anymore. But it would 
probably make sense to recheck to make sure that it also applies to b) 
(or provide a pointer to a test that's still available).

> I mostly made my peace with this in a June 2008
> discussion:
>
> Re: expected results for URI encoding tests?
> http://lists.w3.org/Archives/Public/public-html/2008Jun/0369.html
>
>
> p.s. I wish I could tell the story of the non-western
> search engine problem/requirement more straightforwardly,
> but I seem to have some sort of writer's block.

Well, the problem is that when a server gets data (in a path component, 
and even more in a query part), they need to know what encoding it is to 
make sense of it and provide a reasonable answer. The reason why you are 
calling this the "search engine" problem may be that this problem is 
most prominent with search engines, because everybody agrees that people 
want to search in their language, not limited to US-ASCII.
But it applies to any kind of query parts.

Some search engines have their own way to passing encoding information, 
as an example, in 
http://www.google.com/search?q=%E9%9D%92%E5%B1%B1&ie=utf-8&oe=utf-8, 
"ie" indicates input encoding (i.e. what's sent to the server, and "oe" 
indicates output encoding (what should be sent back).

Currently, I really don't understand why we more or less managed to move 
to "path part is UTF-8" from a time where path parts also where encoded 
in the encoding of the page that contained the link, whereas we haven't 
been able to make this transition for query parts. Is the reason that 
path parts are much more used independent of any electronic substrate 
that would allow to know the encoding (e.g. sides of buses)? Is the 
reason that "the encoding is the encoding of the containing page" was 
considered as being a defacto standard worth following for query parts, 
but just an implementation accident for path parts? Was it that RFC 3987 
(and the drafts leading up to it) were not clear enough on how to handle 
query parts, and how this related to form submission?
Is it that cgi scripts and the like (which usually process query parts) 
are much more difficult to change than the servers themselves (which 
usually process path parts)? Was it that IE made the decision, at a time 
where it could have gone both ways, but that now, there is much more 
i18n content, and it's harder to change?

Regards,    Martin.



> Somebody else care to give it a try? TimBL? Ian?
> Henri?
>

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp
Received on Friday, 1 May 2009 09:19:48 UTC