Re: [iri] #128: use of the term 'origin' from Martin J. Dürst on 2012-07-18 (public-iri@w3.org from July 2012)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Wed, 18 Jul 2012 20:08:57 +0900
To: Dave Thaler <dthaler@microsoft.com>
CC: Larry Masinter <masinter@adobe.com>, "stpeter@stpeter.im" <stpeter@stpeter.im>, "public-iri@w3.org" <public-iri@w3.org>, Chris Lilley <chris@w3.org>
Message-ID: <50069949.9080308@it.aoyama.ac.jp>
Hello Dave,

Sorry to be very late with my answer.

On 2012/07/11 9:05, Dave Thaler wrote:
> Personally I dislike the change to allow using the document charset and prefer the 3987
> behavior.

I very much also dislike this! I very much wish we could fix this!
Just in case you know a way to convince the IE team at Microsoft to fix 
this, please tell us.

Some background on why browsers got to where they are about query 
encoding later is in a P.S. to this mail.


> On the question of "other than HTML", URIs and/or IRIs can appear in many contexts...
> In normal text in an email message, or in a PDF file or Word doc or whatever else.

Yes indeed.

> Allowing it to vary complicates frameworks considerably since now the doc charset
> has to be passed from whatever extracts the URI from the document (HTML or otherwise)
> and whatever else needs to know the interpretation (normalizer code, comparison code,
> whatever).   Various API frameworks already have various sorts of "Uri" classes that
> take in a URI-like string and let you do things like get the URI form or the IRI form,
> or various components or whatever.   Of course those would have to change for
> any bis, but this also means the constructor needs to change since you cannot
> correctly interpret an IRI(bis) without knowing the document charset.

This is indeed a very important point. Libraries and tooling are too 
often overlooked.

I think the current draft also doesn't say anything about cases where 
"document charset" information is not available (e.g. when you type in a 
query part into a browser bar, or when a query part appears on a napkin. 
We should make sure it says that in that case, use UTF-8.


> I'm not yet convinced that's a change worth making.

Do you see a chance to convince the IE team to fix this?
We'd then also have to convince Mozilla and Webkit folks.

If we can't convince them, then our only hope is that UTF-8 content is 
increasing steadily on the Web (IEEE Spectrum showed a graph provided by 
Mark Davis that had UTF-8 (without pure ASCII) at 60%). I think we 
should be careful to make sure that we write the spec so that it doesn't 
make things overly complicated in a world where essentially all Web 
pages are UTF-8.


Regards,    Martin.

P.S.: And here is the story of why query parts are treated the (odd!) 
way they are in browsers.

In the mid '90ies, Web pages in all kinds of encodings started to show 
up. CGI scripts took in data from forms, and there was a serious 
problem: In what encoding should the form data be sent back to the 
server? It was the most frequent question asked on mailing lists related 
to I18N and the Web, and at the Unicode conference.

RFC 2070 (HTML I18N, now historic) introduced the accept-charset 
attribute (see http://tools.ietf.org/html/rfc2070#section-5.1), but that 
was not implemented by browsers. A convention started to emerge, which 
was that the character encoding of the document containing the from 
would be used.

This was taken over by HTML4 (see 
http://www.w3.org/TR/html4/interact/forms.html#adef-accept-charset (*)), 
although the accept-charset attribute was moved from individual fields 
to the form element itself. The accept-charset attribute for a long time 
was not implemented, but it finally got implemented when Mozilla got 
totally re-implemented, mostly according to spec, and then it spread to 
other browsers, to the extent that it ended up in HTML5 (see 
http://www.w3.org/TR/html5/the-form-element.html#attr-form-accept-charset).

So for forms, we are all set: you can have a page in Shift_JIS with a 
form that uses UTF-8 for application/x-www-form-urlencoded, which means 
that you can display the query part as an IRI, or you could have the 
reverse, which means that you have to use %-encoding for the Shift_JIS 
bytes.

The problem with all this is that browser makers thought that a query 
part in a complete IRI (e.g. in the href attribute of an <a> element or 
the src attribute of an <img> element) is just like a form, and so 
should use the document charset. RFC 3987 nowhere mentions that the 
query part should be treated differently from the rest of the IRI, but 
in hindsight, it might have been a good idea to put a big reminder into 
RFC 3987, saying "all this also applies to the query part". And of 
course there's no accept-charset attribute for <a> or <img>.

(*) There was a small tweak, in that in RFC 2070, the accept-charset 
attribute was on each (textual) form element, but for HTML4, we moved it 
to the form element itself.




> -Dave
>
>> -----Original Message-----
>> From: "Martin J. Dürst" [mailto:duerst@it.aoyama.ac.jp]
>> Sent: Tuesday, July 10, 2012 4:27 AM
>> To: Larry Masinter
>> Cc: stpeter@stpeter.im; public-iri@w3.org; Chris Lilley
>> Subject: Re: [iri] #128: use of the term 'origin'
>>
>> On 2012/06/17 0:28, Larry Masinter wrote:
>>> does this apply to any format other than HTML? I'm not sure that this
>> applies to anything else... Within image/svg+xml, for example? The notion of
>> document charset doesn't apply to some formats.
>>
>> Hello Larry,
>>
>> Very good idea to test this. I tested the various browsers that I have, looking
>> at the actual requests in Wireshark, everything on Windows 7.
>> The test consisted of the attached SVG file in iso-8859-1 with a link to an
>> existing domain but a non-existing page with a query part with non-ASCII
>> characters.
>>
>> Here are the results:
>>
>> Opera 12:
>> GET /non-existent?r%C3%A9sum%C3%A9 HTTP/1.1\r\n This means the
>> query part is sent as percent-encoded UTF-8.
>>
>> Safari (5.1.7):
>> GET /non-existent?r%E9sum%E9 HTTP/1.1\r\n This means that the query
>> part is sent as percent-encoded iso-8859-1.
>>
>> IE9:
>> GET /non-existent?r\351sum\351 HTTP/1.1\r\n This means that the query
>> part is sent as RAW iso-8859-1.
>>
>> Firefox 13.0.1:
>> GET /non-existent?r%E9sum%E9 HTTP/1.1\r\n This means that the query
>> part is sent as percent-encoded iso-8859-1.
>>
>> Chrome 20:
>> GET /non-existent?r%E9sum%E9 HTTP/1.1\r\n This means that the query
>> part is sent as percent-encoded iso-8859-1.
>>
>> With the exception of Opera, SVG seems to follow HTML. But there are SVG
>> user agents that are not browsers. If somebody has one of these, please run
>> this test and tell us what you got.
>>
>> Also, there are formats other than HTML and SVG.
>>
>> Regards,   Martin.
>>
>>
>>> Connected by DROID on Verizon Wireless
>>>
>>>
>>> -----Original message-----
>>> From: iri issue tracker<trac+iri@grenache.tools.ietf.org>
>>> To:
>>> "draft-ietf-iri-3987bis@tools.ietf.org"<draft-ietf-iri-3987bis@tools.i
>>> etf.org>, "stpeter@stpeter.im"<stpeter@stpeter.im>
>>> Cc: "public-iri@w3.org"<public-iri@w3.org>
>>> Sent: Mon, Jun 11, 2012 19:38:45 GMT+00:00
>>> Subject: Re: [iri] #128: use of the term 'origin'
>>>
>>> #128: use of the term 'origin'
>>>
>>> #choose ticket.new
>>>     #when True
>>>    While reviewing 3987bis for i18n terminology, I came across this
>>>    paragraph (Section 3.5):
>>>
>>>       For compatibility with existing deployed HTTP infrastructure, the
>>>       following special case applies for schemes "http" and "https" and
>>>       IRIs whose origin has a document charset other than one which is UCS-
>>>       based (e.g., UTF-8 or UTF-16).  In such a case, the "query" component
>>>       of an IRI is mapped into a URI by using the document charset rather
>>>       than UTF-8 as the binary representation before pct-encoding.  This
>>>       mapping is not applied for any other scheme or component.
>>>
>>>    The term 'origin' could be ambiguous here. It doesn't seem to be
>>>    referencing the Web Origin Concept (RFC 6454) but instead seems to be
>>>    based on the "document" (broadly construed) in which the http or https
>>>    URL is found (e.g., as a hyperlink in an HTML document or perhaps as
>>>    running text in an email message). It would be good to make that clear.
>>>     #end
>>>     #otherwise
>>>       #if changes_body
>>> Changes (by stpeter@…):
>>>
>>>
>>>       #end
>>>       #if changes_descr
>>>         #if not changes_body and not change.comment and change.author
>>> Description changed by stpeter@…:
>>>         #end
>>>
>>> --
>>>       #end
>>>       #if change.comment
>>>
>>> Comment(by stpeter@…):
>>>
>>>    One way to remove the ambiguity would be to change "origin" here to
>>>    something else, but even then I think we'd need additional text. I
>>>    tentatively propose the following:
>>>
>>>       For compatibility with existing deployed HTTP infrastructure, the
>>>       following special case applies for the schemes "http" and "https"
>>>       when an IRI is found in a document whose charset is not based on UCS
>>>       (e.g., not UTF-8 or UTF-16).  In such a case, the "query" component
>>>       of an IRI is mapped into a URI by using the document charset rather
>>>       than UTF-8 as the binary representation before pct-encoding.  This
>>>       mapping is not applied for any other scheme or component.
>>>       #end
>>>     #end
>>> #end
>>>
>>> --
>>> -----------------------+---------------------------------------
>>>    Reporter:  stpeter@…  |       Owner:  draft-ietf-iri-3987bis@…
>>>        Type:  defect     |      Status:  new
>>>    Priority:  minor      |   Milestone:
>>> Component:  3987bis    |     Version:
>>>    Severity:  -          |  Resolution:
>>>    Keywords:             |
>>> -----------------------+---------------------------------------
>>>
>>> Ticket
>>> URL:<http://trac.tools.ietf.org/wg/iri/trac/ticket/128#comment:1>
>>> iri<http://tools.ietf.org/wg/iri/>
>>>
Received on Wednesday, 18 July 2012 11:09:38 UTC