RE: [iri] #128: use of the term 'origin'

Personally I dislike the change to allow using the document charset and prefer the 3987
behavior.

On the question of "other than HTML", URIs and/or IRIs can appear in many contexts...
In normal text in an email message, or in a PDF file or Word doc or whatever else.
Allowing it to vary complicates frameworks considerably since now the doc charset
has to be passed from whatever extracts the URI from the document (HTML or otherwise)
and whatever else needs to know the interpretation (normalizer code, comparison code,
whatever).   Various API frameworks already have various sorts of "Uri" classes that
take in a URI-like string and let you do things like get the URI form or the IRI form,
or various components or whatever.   Of course those would have to change for
any bis, but this also means the constructor needs to change since you cannot
correctly interpret an IRI(bis) without knowing the document charset.

I'm not yet convinced that's a change worth making.

-Dave

> -----Original Message-----
> From: "Martin J. Dürst" [mailto:duerst@it.aoyama.ac.jp]
> Sent: Tuesday, July 10, 2012 4:27 AM
> To: Larry Masinter
> Cc: stpeter@stpeter.im; public-iri@w3.org; Chris Lilley
> Subject: Re: [iri] #128: use of the term 'origin'
> 
> On 2012/06/17 0:28, Larry Masinter wrote:
> > does this apply to any format other than HTML? I'm not sure that this
> applies to anything else... Within image/svg+xml, for example? The notion of
> document charset doesn't apply to some formats.
> 
> Hello Larry,
> 
> Very good idea to test this. I tested the various browsers that I have, looking
> at the actual requests in Wireshark, everything on Windows 7.
> The test consisted of the attached SVG file in iso-8859-1 with a link to an
> existing domain but a non-existing page with a query part with non-ASCII
> characters.
> 
> Here are the results:
> 
> Opera 12:
> GET /non-existent?r%C3%A9sum%C3%A9 HTTP/1.1\r\n This means the
> query part is sent as percent-encoded UTF-8.
> 
> Safari (5.1.7):
> GET /non-existent?r%E9sum%E9 HTTP/1.1\r\n This means that the query
> part is sent as percent-encoded iso-8859-1.
> 
> IE9:
> GET /non-existent?r\351sum\351 HTTP/1.1\r\n This means that the query
> part is sent as RAW iso-8859-1.
> 
> Firefox 13.0.1:
> GET /non-existent?r%E9sum%E9 HTTP/1.1\r\n This means that the query
> part is sent as percent-encoded iso-8859-1.
> 
> Chrome 20:
> GET /non-existent?r%E9sum%E9 HTTP/1.1\r\n This means that the query
> part is sent as percent-encoded iso-8859-1.
> 
> With the exception of Opera, SVG seems to follow HTML. But there are SVG
> user agents that are not browsers. If somebody has one of these, please run
> this test and tell us what you got.
> 
> Also, there are formats other than HTML and SVG.
> 
> Regards,   Martin.
> 
> 
> > Connected by DROID on Verizon Wireless
> >
> >
> > -----Original message-----
> > From: iri issue tracker<trac+iri@grenache.tools.ietf.org>
> > To:
> > "draft-ietf-iri-3987bis@tools.ietf.org"<draft-ietf-iri-3987bis@tools.i
> > etf.org>, "stpeter@stpeter.im"<stpeter@stpeter.im>
> > Cc: "public-iri@w3.org"<public-iri@w3.org>
> > Sent: Mon, Jun 11, 2012 19:38:45 GMT+00:00
> > Subject: Re: [iri] #128: use of the term 'origin'
> >
> > #128: use of the term 'origin'
> >
> > #choose ticket.new
> >    #when True
> >   While reviewing 3987bis for i18n terminology, I came across this
> >   paragraph (Section 3.5):
> >
> >      For compatibility with existing deployed HTTP infrastructure, the
> >      following special case applies for schemes "http" and "https" and
> >      IRIs whose origin has a document charset other than one which is UCS-
> >      based (e.g., UTF-8 or UTF-16).  In such a case, the "query" component
> >      of an IRI is mapped into a URI by using the document charset rather
> >      than UTF-8 as the binary representation before pct-encoding.  This
> >      mapping is not applied for any other scheme or component.
> >
> >   The term 'origin' could be ambiguous here. It doesn't seem to be
> >   referencing the Web Origin Concept (RFC 6454) but instead seems to be
> >   based on the "document" (broadly construed) in which the http or https
> >   URL is found (e.g., as a hyperlink in an HTML document or perhaps as
> >   running text in an email message). It would be good to make that clear.
> >    #end
> >    #otherwise
> >      #if changes_body
> > Changes (by stpeter@…):
> >
> >
> >      #end
> >      #if changes_descr
> >        #if not changes_body and not change.comment and change.author
> > Description changed by stpeter@…:
> >        #end
> >
> > --
> >      #end
> >      #if change.comment
> >
> > Comment(by stpeter@…):
> >
> >   One way to remove the ambiguity would be to change "origin" here to
> >   something else, but even then I think we'd need additional text. I
> >   tentatively propose the following:
> >
> >      For compatibility with existing deployed HTTP infrastructure, the
> >      following special case applies for the schemes "http" and "https"
> >      when an IRI is found in a document whose charset is not based on UCS
> >      (e.g., not UTF-8 or UTF-16).  In such a case, the "query" component
> >      of an IRI is mapped into a URI by using the document charset rather
> >      than UTF-8 as the binary representation before pct-encoding.  This
> >      mapping is not applied for any other scheme or component.
> >      #end
> >    #end
> > #end
> >
> > --
> > -----------------------+---------------------------------------
> >   Reporter:  stpeter@…  |       Owner:  draft-ietf-iri-3987bis@…
> >       Type:  defect     |      Status:  new
> >   Priority:  minor      |   Milestone:
> > Component:  3987bis    |     Version:
> >   Severity:  -          |  Resolution:
> >   Keywords:             |
> > -----------------------+---------------------------------------
> >
> > Ticket
> > URL:<http://trac.tools.ietf.org/wg/iri/trac/ticket/128#comment:1>
> > iri<http://tools.ietf.org/wg/iri/>
> >

Received on Wednesday, 11 July 2012 00:05:43 UTC