Re: Migration of HTTP to the use of IRIs [queryclarify-16]

Hello Chris,

I have added a new paragraph to section 7.8 of the IRI spec, reading
as follows:

 >>>>>>>>
It is often possible to reduce the effort and dependencies for upgrading
to IRIs by using UTF-8 rather than another encoding where there is a free
choice of encodings. For example, when setting up a new file-based Web
server, using UTF-8 as the encoding of the file names will make the
transition to IRIs easier. Likewise, when setting up a new Web form
using UTF-8 as the encoding of the form page, the returned query
URIs will use UTF-8 as an encoding and will therefore be compatible
with IRIs.
 >>>>>>>>

This is to cover the issue queryclarify-16, and it also gives
some more information with regards to UTF-8. When this section
was originally written (which was one of the very early drafts,
if I remember correctly), browsers and other clients that
understood UTF-8 were not very widely available, so this advice
was not appropriate, but this has clearly changed in the meantime.

I have marked this issue as tentatively closed.

Regards,    Martin.


At 01:19 03/06/28 -0400, Martin Duerst wrote:

>Hello Chris,
>
>Many thanks for your comments on the IRI spec.
>
>I have noted your issue as:
>http://www.w3.org/International/iri-edit#queryclarify-16
>
>More explanations below.
>
>At 23:21 03/06/26 +0100, Chris Haynes wrote:
>
>>Dear Martin / Michel,
>>
>>I'm looking at draft-duerst-iri-04 from the viewpoint of a provider of
>>web server technology. I'm trying to understand the likely migration
>>path to the use of IRIs, and I'm concerned that there's a gap I don't
>>see being filled.
>>
>>It may well be that filling the gap is outside the scope of your
>>Internet Draft, but unless the gap is filled, I fear there may be a
>>_long_ delay before IRIs are adopted where they are most needed.
>>
>>Your section 7.8, Upgrading Strategy, contains some useful thoughts /
>>advice, which I have summarised to myself as "Don't put in an
>>IRI-aware server until all the resources on the site(s) you serve are
>>published in IRI".
>>
>>However that's not the problem that concerns me.
>>
>>I'm concerned about the encoding of HTTP GET query strings, typically
>>carrying text inserted by a user into a browser's form.
>
>You are right that section 7.8 does not address query strings,
>and that it doesn't say so clearly, and that there is otherwise
>not too much about how query strings are supposed to work.
>I have noted this specific aspect of your mail as an issue,
>and will try to update the draft accordingly.
>
>
>>Assume below that "I" am the developer of a web server. (I'm not, but
>>I advise someone who is).
>>
>>I want to support IRIs as soon as possible. I know that 'out there'
>>are many different makes and releases of browsers; I have no control
>>over them.
>>
>>As is well known, there is no mechanism in RFCs 2396 / 2616 for
>>indicating the encoding associated with any %hh octet-triplets in
>>URIs.
>
>Agreed.
>
>
>>Unless I've missed something, your draft implies that user agents
>>(browsers) may perform IRI to URI conversion, so that 'my' server sees
>>an RFC 2396-conformant URI.
>
>Well, they actually have to do this conversion, because HTTP
>does not allow anything else than an URI in the request.
>
>
>>How do I know it is was originally an IRI and that I should apply the
>>reverse conversions of your section 3.2 before extracting the query
>>name-value pairs?
>
>You don't. Equally well, you don't know whether the name/value
>pairs were in iso-8859-1 (Latin-1), or shift-jis, or whatever.
>HTTP does not help you there at all.
>
>
>>The problem is not 'academic', the vast majority of browser requests
>>received today which have %hh triplets used encodings other than
>>UTF-8, and these will continue to arrive for the next 20-or-more
>>years.
>
>Well, for query parts, you actually have quite some control over
>what encoding you get the query part back. Already since a few years,
>browsers send back the query part in the encoding that they received
>the page in. This works quite well. So if you want to have any
>idea of what you get back from a browser, you have to know how
>you send out your pages. And if you use UTF-8 for your pages,
>then you get three main benefits when compared to other encodings:
>- UTF-8 can handle the widest range of characters
>- UTF-8 will bring your GET request in line with IRIs
>- UTF-8 can be checked with very high reliability
>
>For more information, please also see the Q&A page that we put up
>recently:
>http://www.w3.org/International/questions/qa-forms-utf-8.html
>
>
>>You may well answer that the way IRIs are to be applied is to be
>>scheme-dependent; the problem/opportunity  is for the HTTP RFC2616++
>>community to address.
>
>Well, part of it could be addressed scheme-by-scheme. For example,
>a new scheme could require that only UTF-8 be used in the query
>part. It can also be addressed by other technologies, for example,
>XForms, which requires the use of UTF-8 in the query part of GET
>requests (see http://www.w3.org/TR/xforms/slice11.html#serialize-urlencode).
>
>
>>I would feel *far* more comfortable if I knew that they were aware of
>>this and if there were draft proposals visible on this list and being
>>checked for feasibility and for 'compatibility' with your drafts.
>>I've seen no evidence for this, and you don't appear to
>>cross-reference any related HTTP activity in your draft.
>>
>>
>>It is not beyond the bounds of possibility, for example, that the HTTP
>>community might conclude that they cannot provide IRI support unless
>>your RFC-to-be  includes some kind of marker or syntactic construction
>>within "URIs which were converted from IRIs" which explicitly
>>identifies them as such.
>>
>>In other words it might be found that all IRIs MUST be mapped into
>>some character sequence which IS NOT a 'legal'  URI (by the current
>>RFC2396), so that the receiver knows that the reverse process of your
>>section 3.2 MUST be applied.
>
>This would mean choosing a different escape convention.
>We considered this years ago, but decided against it.
>Using something that is illegal in an URI would not have
>worked, and would still not work, with the current infra-
>structure.
>
>
>>There are other approaches the HTTP community could take, which
>>_would_ be compatible with your current draft, (and I have my own
>>candidate solution), but surely there should at least be some kind of
>>'existence proof' or 'feasibility study' by which they agree that they
>>_can_ work with your proposals before they are finalized?
>
>I hope what I have explained above is enough of an 'existence proof'.
>
>Please tell me if you don't think so.
>
>Regards,    Martin.
>
>
>>Without some kind of 'roadmap' for HTTP use of IRIs I don't see how
>>anyone can pass final judgement on your draft.
>>
>>Please reassure me by telling me I'm an idiot for not knowing about
>>XXX or not reading YYY.
>>
>>Regards,
>>
>>Chris Haynes

Received on Thursday, 6 May 2004 04:47:44 UTC