Re: Migration of HTTP to the use of IRIs [queryclarify-16] from Chris Haynes on 2004-05-06 (public-iri@w3.org from May 2004)

From: Chris Haynes <chris@harvington.org.uk>
Date: Thu, 6 May 2004 12:49:45 +0100
To: <public-iri@w3.org>, "Martin Duerst" <duerst@w3.org>
Message-ID: <013c01c43360$3bb82f60$0200000a@ringo>
Martin,

Thanks for this response.

Actually, my original core concern has now been covered in your section 1.2.a -
Applicability, where you make it clear that "the intent is not to introduce IRIs
into contexts that are not defined to accept them".

This now makes it clear that new schemas will be required to replace http: ,
https: etc. These will need to be self-identifying in some way, so that
receiving equipment will know that an IRI is being presented.

So, as I commented last June, I await with interest the recognition among those
responsible for the HTTP schema that new schemas with new names are required
before IRIs can be used.


Returning to the logged issue...

Your new paragraph in 7.8 is helpful, but not, I fear, strictly accurate.

The phrase "returned query URIs will use UTF-8 as an encoding" is accurate only
if the browser's user has not manually changed the page encoding via the menu
commands available to her (e.g. with MSIE the  "View - Encoding" menu sequence).
It can easily be demonstrated that this user selection of the encoding overrides
the encoding declared in the HTML text or associated HTTP header when requests
are formulated.

'will use' is therefore too strong.

Changing the phrase to "returned query URIs will, by default, use UTF-8 as an
encoding" is an accurate statement - it just leaves open the question of what
'by default' means.

Chris
.


----- Original Message -----
From: "Martin Duerst" <duerst@w3.org>
To: "Chris Haynes" <chris@harvington.org.uk>; <public-iri@w3.org>
Sent: Thursday, May 06, 2004 8:55 AM
Subject: Re: Migration of HTTP to the use of IRIs [queryclarify-16]


> Hello Chris,
>
> I have added a new paragraph to section 7.8 of the IRI spec, reading
> as follows:
>
>  >>>>>>>>
> It is often possible to reduce the effort and dependencies for upgrading
> to IRIs by using UTF-8 rather than another encoding where there is a free
> choice of encodings. For example, when setting up a new file-based Web
> server, using UTF-8 as the encoding of the file names will make the
> transition to IRIs easier. Likewise, when setting up a new Web form
> using UTF-8 as the encoding of the form page, the returned query
> URIs will use UTF-8 as an encoding and will therefore be compatible
> with IRIs.
>  >>>>>>>>
>
> This is to cover the issue queryclarify-16, and it also gives
> some more information with regards to UTF-8. When this section
> was originally written (which was one of the very early drafts,
> if I remember correctly), browsers and other clients that
> understood UTF-8 were not very widely available, so this advice
> was not appropriate, but this has clearly changed in the meantime.
>
> I have marked this issue as tentatively closed.
>
> Regards,    Martin.
>
>
> At 01:19 03/06/28 -0400, Martin Duerst wrote:
>
> >Hello Chris,
> >
> >Many thanks for your comments on the IRI spec.
> >
> >I have noted your issue as:
> >http://www.w3.org/International/iri-edit#queryclarify-16
> >
> >More explanations below.
> >
> >At 23:21 03/06/26 +0100, Chris Haynes wrote:
> >
> >>Dear Martin / Michel,
> >>
> >>I'm looking at draft-duerst-iri-04 from the viewpoint of a provider of
> >>web server technology. I'm trying to understand the likely migration
> >>path to the use of IRIs, and I'm concerned that there's a gap I don't
> >>see being filled.
> >>
> >>It may well be that filling the gap is outside the scope of your
> >>Internet Draft, but unless the gap is filled, I fear there may be a
> >>_long_ delay before IRIs are adopted where they are most needed.
> >>
> >>Your section 7.8, Upgrading Strategy, contains some useful thoughts /
> >>advice, which I have summarised to myself as "Don't put in an
> >>IRI-aware server until all the resources on the site(s) you serve are
> >>published in IRI".
> >>
> >>However that's not the problem that concerns me.
> >>
> >>I'm concerned about the encoding of HTTP GET query strings, typically
> >>carrying text inserted by a user into a browser's form.
> >
> >You are right that section 7.8 does not address query strings,
> >and that it doesn't say so clearly, and that there is otherwise
> >not too much about how query strings are supposed to work.
> >I have noted this specific aspect of your mail as an issue,
> >and will try to update the draft accordingly.
> >
> >
> >>Assume below that "I" am the developer of a web server. (I'm not, but
> >>I advise someone who is).
> >>
> >>I want to support IRIs as soon as possible. I know that 'out there'
> >>are many different makes and releases of browsers; I have no control
> >>over them.
> >>
> >>As is well known, there is no mechanism in RFCs 2396 / 2616 for
> >>indicating the encoding associated with any %hh octet-triplets in
> >>URIs.
> >
> >Agreed.
> >
> >
> >>Unless I've missed something, your draft implies that user agents
> >>(browsers) may perform IRI to URI conversion, so that 'my' server sees
> >>an RFC 2396-conformant URI.
> >
> >Well, they actually have to do this conversion, because HTTP
> >does not allow anything else than an URI in the request.
> >
> >
> >>How do I know it is was originally an IRI and that I should apply the
> >>reverse conversions of your section 3.2 before extracting the query
> >>name-value pairs?
> >
> >You don't. Equally well, you don't know whether the name/value
> >pairs were in iso-8859-1 (Latin-1), or shift-jis, or whatever.
> >HTTP does not help you there at all.
> >
> >
> >>The problem is not 'academic', the vast majority of browser requests
> >>received today which have %hh triplets used encodings other than
> >>UTF-8, and these will continue to arrive for the next 20-or-more
> >>years.
> >
> >Well, for query parts, you actually have quite some control over
> >what encoding you get the query part back. Already since a few years,
> >browsers send back the query part in the encoding that they received
> >the page in. This works quite well. So if you want to have any
> >idea of what you get back from a browser, you have to know how
> >you send out your pages. And if you use UTF-8 for your pages,
> >then you get three main benefits when compared to other encodings:
> >- UTF-8 can handle the widest range of characters
> >- UTF-8 will bring your GET request in line with IRIs
> >- UTF-8 can be checked with very high reliability
> >
> >For more information, please also see the Q&A page that we put up
> >recently:
> >http://www.w3.org/International/questions/qa-forms-utf-8.html
> >
> >
> >>You may well answer that the way IRIs are to be applied is to be
> >>scheme-dependent; the problem/opportunity  is for the HTTP RFC2616++
> >>community to address.
> >
> >Well, part of it could be addressed scheme-by-scheme. For example,
> >a new scheme could require that only UTF-8 be used in the query
> >part. It can also be addressed by other technologies, for example,
> >XForms, which requires the use of UTF-8 in the query part of GET
> >requests (see http://www.w3.org/TR/xforms/slice11.html#serialize-urlencode).
> >
> >
> >>I would feel *far* more comfortable if I knew that they were aware of
> >>this and if there were draft proposals visible on this list and being
> >>checked for feasibility and for 'compatibility' with your drafts.
> >>I've seen no evidence for this, and you don't appear to
> >>cross-reference any related HTTP activity in your draft.
> >>
> >>
> >>It is not beyond the bounds of possibility, for example, that the HTTP
> >>community might conclude that they cannot provide IRI support unless
> >>your RFC-to-be  includes some kind of marker or syntactic construction
> >>within "URIs which were converted from IRIs" which explicitly
> >>identifies them as such.
> >>
> >>In other words it might be found that all IRIs MUST be mapped into
> >>some character sequence which IS NOT a 'legal'  URI (by the current
> >>RFC2396), so that the receiver knows that the reverse process of your
> >>section 3.2 MUST be applied.
> >
> >This would mean choosing a different escape convention.
> >We considered this years ago, but decided against it.
> >Using something that is illegal in an URI would not have
> >worked, and would still not work, with the current infra-
> >structure.
> >
> >
> >>There are other approaches the HTTP community could take, which
> >>_would_ be compatible with your current draft, (and I have my own
> >>candidate solution), but surely there should at least be some kind of
> >>'existence proof' or 'feasibility study' by which they agree that they
> >>_can_ work with your proposals before they are finalized?
> >
> >I hope what I have explained above is enough of an 'existence proof'.
> >
> >Please tell me if you don't think so.
> >
> >Regards,    Martin.
> >
> >
> >>Without some kind of 'roadmap' for HTTP use of IRIs I don't see how
> >>anyone can pass final judgement on your draft.
> >>
> >>Please reassure me by telling me I'm an idiot for not knowing about
> >>XXX or not reading YYY.
> >>
> >>Regards,
> >>
> >>Chris Haynes
>
>
>
Received on Thursday, 6 May 2004 07:50:52 UTC