Re: Migration of HTTP to the use of IRIs [altdesign-17] from Chris Haynes on 2004-05-09 (public-iri@w3.org from May 2004)

From: Chris Haynes <chris@harvington.org.uk>
Date: Sun, 9 May 2004 11:05:33 +0100
To: "Martin Duerst" <duerst@w3.org>
Cc: <public-iri@w3.org>, "Michel Suignard" <michelsu@windows.microsoft.com>
Message-ID: <032401c435ad$2c5cf380$0200000a@ringo>
Michael,
Thanks for your patience.

So I think you are saying that the flaw in my logic is when I asserted that
there is no syntactic indication of the use of an IRI. Your assertion, in
effect, is that the syntactic indication is only present when needed, and is
implicit in the use of UTF-8 encoding.

Your assertion below is that the vast majority of cases which were not encoded
in UTF-8 will generate one or more octet sub-sequences which are not legal
representations of characters in UTF-8. I should use the presence of such
sequences to conclude that the URI was not encoded in UTF-8 and therefore that
conversion to an IRI  is not applicable.

My first thought was to try applying the processing you define in (draft 7) sect
3.2, to see if that would provide a 'failure indication' that I could use.

But I came to your Step 3:

"Re-escape any octet produced in step 2 that is not part of a strictly legal
UTF-8 sequence".


This step re-absorbs octet sequences which are illegal in UTF-8 into the IRI
world, so, applying section 3.2 in its entirety _cannot_ be used as the basis of
a decision on whether or not UTF-8 encoding was used in the original escaping.

Section 3.2 can only be applied if it is desired to _force_ everything that is
received into an IRI.

Your draft 7  does not provide the basis for deciding whether or not the URI
should be treated in this way. i.e. it does not give any opportunity for
concluding that the presented URI was encoded using some other (legacy)
encoding.

You may recall that my concern is for the design of a web server including
something like a Servlet handler, which has to decode the URI before it can
identify and invoke the referenced servlet (which might know what encoding was
used in URIs identifying that Servlet).

In this 'real world' that I keep worrying about there will be a long transition
phase when there will be many inbound URLs which contain escapes generated using
other encodings. Forcing them into IRIs is not appropriate behaviour; by some
other means the appropriate decoding must be selected and applied.

It seems to me that, in this situation, where URLs containing encodings other
than UTF-8 are to be handled differently, rather than be forced into IRIs by
your section 3.2, a different sequence is required. Something like:

A)  Convert the received URI into an octet sequence as follows: Each %HH triplet
generates an octet whose value is defined by the hex digits HH. All other
(ASCII) characters generate an octet whose value is that of the code point of
that character in the ASCII/UTF-8 code table.

B) Attempt to process the octet sequence generated by B as a UTF_8-encoded octet
sequence. If the octet sequence is 'legal', i.e. it is the correct encoding of a
sequence of integer values (but not necessarily representing valid Unicode code
points), then the URI does represent an IRI and the processing of (draft 7
sect.32.) should be applied to extract the IRI.

C) If, in step B, there should have been found one or more octet sequences which
did not form part of any 'legal' UTF_8 sequence, then no IRIs are involved and
the interpretation of the presented URI is to be decided by other means.

Note that the application of the procedure A-C above will mean that your step 3
will never be applied.


So I think we have two possible scenarios:

Scenario 1)  The world is to be viewed as containing only IRIs.  _All_ received
URIs are converted into IRIs consisting of a sequence of  'appropriate' (your
step 4) UTF characters.  Any non-UTF-8 escapes are still present as
still-escaped sequences in the IRI; there has been no attempt to interpret these
as characters in some other encoding.

Scenario 2)  In a word in which URIs intended to represent IRIs co-exist with
URIs encoded using other character encodings, and where the difference has to be
detected so that the appropriate decoding can be applied, then my steps A-C must
first be undertaken. If my steps A-C indicate that another encoding was used,
then the URI is to be handled in some other way, and no IRI is involved. If no
evidence of a different encoding is found, then it is to be assumed that
conversion to an IRI is valid and your steps 1-5 should be applied (but step 3
will never be invoked).

My tentative conclusion is this:

The IRI draft 7 does not provide any support or advice for those needing to
recognize and process (intelligently and efficiently) URIs containing encodings
other than UTF-8.

Where this needs to be done, something akin to my steps A-C is necessary, before
it can be decided that URI to IRI conversion should be applied.


My concerns would be assuaged if there were a Section or Appendix in the IRI
Internet-Draft :

- Recognizing these transitional / co-existence needs,
- Detailing the necessary and sufficient URI inspection required to decide
whether or not to invoke IRI processing,
- Containing the cautions about the remote possibility of incorrect decisions
being made.

I'd be prepared to help draft it.

Footnote 1:
In a 'real' implementation the two processing sequences 1-5 and A-C could be
undertaken in a single pass through the URI using a merged algorithm,
parameterised to define how it should proceed if a non-UTF_8 octet sequence
should be detected (i.e. parameterised to adopt Scenarios 1 or 2). The
performance penalty of my proposed addition would be insignificant.

Footnote 2:
Your approach of assuming that an IRI interpretation is valid in all situations
in which UTF_8 has been used ought also to be validated. People are already
using UTF-8 encoding with no knowledge of IRIs.  I've not explored what impact
the application of the stage 4+5 processing of your draft (i.e. beyond that of
de-escaping and decoding the UTF-8 characters) could have, and whether or not it
could cause any problems for pre-IRI users of UTF-8. I don't intent to pursue
this line of enquiry ;-)


Chris


----- Original Message -----
From: "Martin Duerst" <duerst@w3.org>
To: "Chris Haynes" <chris@harvington.org.uk>; "Michel Suignard"
<michelsu@windows.microsoft.com>
Cc: <public-iri@w3.org>
Sent: Sunday, May 09, 2004 1:37 AM
Subject: Re: Migration of HTTP to the use of IRIs [altdesign-17]


> Hello Chris,
>
> I have changed the issue for this mail to altdesign-17, because it
> seems more appropriate.
>
> At 11:07 04/05/07 +0100, Chris Haynes wrote:
>
> >Michel,
> >
> >Thanks for this comment, but I think my point is still valid - even just for
> >presentational uses.
> >
> >Given that many URI encodings exist 'in the wild' which use %HH escaping of
> >non-UTF-8 sequences, I fail to see how one can know that it is valid to
> >convert
> >any such URI into an IRI (as per sect. 3.2) - even if just for presentational
> >purposes.
>
> Section 3.2 very clearly says that there is a risk that you convert
> to something that didn't exist previously.
> But in practice, this is not that much of an issue, because it is
> very rare to find reasonable text encoded in legacy encodings that
> matches UTF-8 byte patters. Please try to find some examples yourself,
> and you will see this.
>
>
> >My concern is the same:  unless there is some kind of syntactic indicator
> >within
> >the URI as a whole, how can one reliably know that UTF-8 has been used and
> >that
> >it is intended to have a corresponding IRI?
>
> You are correct that one cannot do this with 100% certainty.
> But then, if you study the URI spec very carefully, you will
> find that it also doesn't guarantee that an 'a' in an URI
> actually corresponds to an 'a' in the original data (e.g.
> file name). For details, please see the "Laguna Beach"
> example in Section 2.5 of draft-fielding-uri-rfc2396bis-05.txt,
> for example at
> http://gbiv.com/protocols/uri/rev-2002/draft-fielding-uri-rfc2396bis-05.txt.
>
> So in those rare cases where an URI with an octet sequence
> that by chance corresponds to an UTF-8 pattern, but that was
> never intended as UTF-8, is converted to an IRI, one will just
> get a weird name, but reusing that name again e.g. in a browser
> that accepts IRIs will lead back to the original resource.
>
>
>
> >It seems to me that IRI will only be deployed accurately and effectively
> >if one
> >of the following situations occurs:
> >
> >1) New protocol schemes (e.g. httpi, httpis ) are introduced which make it
> >explicit that this spec. applies to the URI,
>
> Introducing a new URI scheme is *extremely* expensive. I have heard
> Tim Berners-Lee say this over and over again, and I know he knows it.
> And in the case at hand, it's highly unnecessary. The cost of an
> occasional accidental 'wrong' conversion back to an IRI (as discussed
> above) is much, much smaller than the cost of introducing new schemes.
>
> And what would the real benefit of new schemes be? Would they be
> useful to distinguish URIs from true IRIs (I'm writing 'true' IRIs
> here to exclude URIs which are by definition also IRIs). Not really,
> it's much cheaper to identify IRIs by checking for non-ASCII characters.
>
> So they would only be used to distinguish URIs without known origin
> from URIs originating from conversion from IRIs. But assume I had
> an IRI like like http://www.example.org/ros&#xE9; (rose'). In order
> to pass it to others whom I know can only process URIs, not IRIs,
> would I want to convert it to http://www.example.org/ros%C3%A9,
> or to httpi://www.example.org/ros%C3%A9 ? The former strictly
> speaking looses the information that this was an IRI, so converting
> it back to rose' is a guess (but because of the UTF-8 patters,
> actually a rather safe one). But it actually will go to the
> right page, on hunderds of millions of Web browsers, without
> exception. The later can safely be converted back to the IRI
> (by all the software that knows how to do this, which currently
> numbers exactly 0). But it will work only on the browsers
> that know the httpi: scheme (again, currently numbering
> exactly 0). For me the alternative is very clear,
> http://www.example.org/ros%C3%A9 works in much more cases,
> and is therefore much better.
>
>
> >2) They are used within a closed environment in which it is a convention that
> >only IRIs and IRI-derived URIs are in use (no legacy-encoding escapes, or
they
> >are allowed to be mis-interpreted)
>
> The current draft clearly allows legacy-encoded escapes, for backwards
> compatibility. I'm not sure what you mean by 'mis-interpreted', but
> if you mean that they are converted to IRIs, then yes, the current
> draft allows this in those cases where it is possible (i.e. the
> byte pattern matches UTF-8,...). But this misinterpretation does
> not lead to an actual misinterpretation of the resource that the
> IRI identifies.
>
>
> >3) A new market-dominating user agent is launched which behaves as if (2)
> >above
> >were the case - i.e. there is an attempt to establish IRIs as the de facto
> >default through market force, ignoring or discarding resulting errors of
> >presentation or of resource identification.
> >
> >My big fear is that without rapid progress on (1), IRIs on the open Internet
> >will only ever take off if someone does (3) - which will be without benefit
of
> >adequate standards backing.
>
> I'm not sure I understand you. Several browsers, for example
> Opera and Safari, already implement IRIs. MS IE also does it
> if the relevant flag is set correctly. And the standard is
> close to done; this is the last real issue I'm trying to close.
> So I don't see the problem.
>
>
> >I'd love to either:
> >
> >a) be shown that my logic is faulty
>
> I guess yes. Not in theory, where absolute correctness is the
> only goal, but in practice, where big numbers and deployment
> are important.
>
> >or
> >
> >b) be pleasantly surprised by being told that there _is_  RFC work taking
> >place
> >on new schemes covering at least the space of http(s)
>
> Some schemes may benefit from an update, in particular those that
> haven't thought about internationalization. The first example that
> would come to my mind is the mailto: scheme.
>
>
> Regards,    Martin.
>
>
>
> >otherwise, I fail to understand how IRIs will 'take off' in the 'real
world' -
> >where they are so badly needed.
> >
> >Chris
> >
> >
> >
> >
> >----- Original Message -----
> >From: "Michel Suignard" <michelsu@windows.microsoft.com>
> >To: "Chris Haynes" <chris@harvington.org.uk>
> >Cc: <public-iri@w3.org>; "Martin Duerst" <duerst@w3.org>
> >Sent: Friday, May 07, 2004 1:43 AM
> >Subject: RE: Migration of HTTP to the use of IRIs [queryclarify-16]
> >
> >
> >
> > > From:  Chris Haynes
> > > Sent: Thursday, May 06, 2004 4:50 AM
> > >
> > > Actually, my original core concern has now been covered in your
> >section
> > > 1.2.a - Applicability, where you make it clear that "the intent is not
> >to
> > > introduce IRIs into contexts that are not defined to accept them".
> > >
> > > This now makes it clear that new schemas will be required to replace
> > > http: , https: etc. These will need to be self-identifying in some
> >way, so
> > > that receiving equipment will know that an IRI is being presented.
> > >
> > > So, as I commented last June, I await with interest the recognition
> >among
> > > those responsible for the HTTP schema that new schemas with new names
> >are
> > > required before IRIs can be used.
> >
> >I'd like to comment on that. The IRI spec is fairly explicit on that IRI
> >can be used as presentation elements for URI protocol elements (ref
> >clause 3 intro). This is to recognize that applications out there have
> >not waited for us for creating presentation layers that use non ascii
> >native characters for schemes that supposedly should not use them (such
> >as http). The presentation layer principle is there to support that. So
> >I expect IRI to be used for both purposes:
> >- presentation layer for existing URI schemes
> >- core layer for new schemes exclusively defined using IRI for protocol
> >elements syntax.
> >
> >For a while I'd expect the vast majority of IRI usage to be in the first
> >category.
> >
> >Michel
> >
> >
>
>
>
Received on Sunday, 9 May 2004 06:07:24 UTC