Re: Migration of HTTP to the use of IRIs [altdesign-17] from Martin Duerst on 2004-05-10 (public-iri@w3.org from May 2004)

From: Martin Duerst <duerst@w3.org>
Date: Mon, 10 May 2004 12:15:19 +0900
To: "Chris Haynes" <chris@harvington.org.uk>
Cc: <public-iri@w3.org>, "Michel Suignard" <michelsu@windows.microsoft.com>
Message-Id: <4.2.0.58.J.20040509221222.059ff850@localhost>
Hello Chris,

At 11:05 04/05/09 +0100, Chris Haynes wrote:

>Michael,

[Well, Martin that would have been.]


>Thanks for your patience.
>
>So I think you are saying that the flaw in my logic is when I asserted that
>there is no syntactic indication of the use of an IRI. Your assertion, in
>effect, is that the syntactic indication is only present when needed, and is
>implicit in the use of UTF-8 encoding.

Not exactly. There is no syntactic indication needed for HTTP
and URIs/IRIs to work the way they are designed.

HTTP may in the future decide to introduce a convention to let the
client tell the server about character encodings e.g. in query parameters,
but this is idenpendent of the IRI spec. Such a convention may use another
scheme (which I doubt very much), a special named query parameter
(much more likely, some browsers can already do this, and some
sites (e.g. google) use this), or something else.

Such a solution could take UTF-8 as a special case, or it could
treat UTF-8 just as one choice among many. IRIs will definitely
provide a push towards UTF-8, but they cannot force anybody to
use UTF-8.


>Your assertion below is that the vast majority of cases which were not encoded
>in UTF-8 will generate one or more octet sub-sequences which are not legal
>representations of characters in UTF-8.

Yes. Or they will trivially be ASCII-only, in which case
the question of encoding is mostly irrelevant.


>I should use the presence of such
>sequences to conclude that the URI was not encoded in UTF-8 and therefore that
>conversion to an IRI  is not applicable.
>
>My first thought was to try applying the processing you define in (draft 
>7) sect
>3.2, to see if that would provide a 'failure indication' that I could use.
>
>But I came to your Step 3:
>
>"Re-escape any octet produced in step 2 that is not part of a strictly legal
>UTF-8 sequence".
>
>
>This step re-absorbs octet sequences which are illegal in UTF-8 into the IRI
>world, so, applying section 3.2 in its entirety _cannot_ be used as the 
>basis of
>a decision on whether or not UTF-8 encoding was used in the original escaping.

Yes. Section 3.2 isn't something that returns a boolean, it returns
an IRI. And it tries to convert as many escapes as possible into
actual characters. If that's not what you need, don't use Section 3.2.


>Section 3.2 can only be applied if it is desired to _force_ everything that is
>received into an IRI.
>
>Your draft 7  does not provide the basis for deciding whether or not the URI
>should be treated in this way. i.e. it does not give any opportunity for
>concluding that the presented URI was encoded using some other (legacy)
>encoding.

Yes. There may be many reasons why somebody may want to use Section 3.2,
and many other reasons why one wouldn't want to use it, or would not
want to use it exactly as described.


>You may recall that my concern is for the design of a web server including
>something like a Servlet handler, which has to decode the URI before it can
>identify and invoke the referenced servlet (which might know what encoding was
>used in URIs identifying that Servlet).

If there is really a cyclic interdependency (i.e. servlet knows encoding,
but servlet handler has to know encoding to be able to call servlet),
then the only thing you can really do seems to be trial-and-error.

If there is not really a cyclic interdependency (i.e. servlet knows
encoding, and could handle it, but you want to do decoding in the
servlet handler code), then this might just be bad design and
software engineering.

This does not mean that the servlet handler couldn't do certain
things on behalf of the servlet, but you most probably need a
more flexible interaction.


>In this 'real world' that I keep worrying about there will be a long 
>transition
>phase when there will be many inbound URLs which contain escapes generated 
>using
>other encodings.

Yes in general. But this will be very different for different servers.
I know of cases where UTF-8 is already used throughout since years,
and I'm sure there are cases where legacy encodings will still be
used in some years. Thus different servers will have different
needs when it comes to analyzing incomming HTTP URIs, in particular
query parts.


>Forcing them into IRIs is not appropriate behaviour; by some
>other means the appropriate decoding must be selected and applied.

Agreed. But I don't think the draft ever says you have to do that.

In more general terms, I'm not sure why you want to convert an
URI arrived on the server to an IRI. What you want to do is to
take the URI appart and work on resolving it.


>It seems to me that, in this situation, where URLs containing encodings other
>than UTF-8 are to be handled differently, rather than be forced into IRIs by
>your section 3.2,

Section 3.2 in no way forces you to convert URIs to IRIs on the server!


>a different sequence is required. Something like:
>
>A)  Convert the received URI into an octet sequence as follows: Each %HH 
>triplet
>generates an octet whose value is defined by the hex digits HH. All other
>(ASCII) characters generate an octet whose value is that of the code point of
>that character in the ASCII/UTF-8 code table.
>
>B) Attempt to process the octet sequence generated by B as a UTF_8-encoded 
>octet
>sequence. If the octet sequence is 'legal', i.e. it is the correct 
>encoding of a
>sequence of integer values (but not necessarily representing valid Unicode 
>code
>points), then the URI does represent an IRI and the processing of (draft 7
>sect.32.) should be applied to extract the IRI.
>
>C) If, in step B, there should have been found one or more octet sequences 
>which
>did not form part of any 'legal' UTF_8 sequence, then no IRIs are involved and
>the interpretation of the presented URI is to be decided by other means.

In some cases, a procedure like the above may be appropriate for
implementing some servlet logic. But please note that what you are
actually trying to do really has nothing to do with reconstructing
an IRI from an URI; what you are trying to do is to reconstruct the
original characters that should be handled to the servlet.

Looking at the details, I see the following issues:
- The decision should probably not be taken for the whole URI, but
   e.g. on the query part only. There can easily be cases where the
   query part is in UTF-8, but the path part is not, or the other
   way round.
- The procedure isn't really complete if you end with 'is to be
   decided by other means'. Here several situations may arise,
   and they may need different solutions. In general, you need
   to have a way to know the actual encoding. The most general
   way to do this is to include a hidden form element in the
   form, with a text that will be encoded differently in the
   encodings you want to distinguish (e.g. UTF-8, iso-2022-jp,
   euc-jp, and shift_jis for Japanese).

So if you are working with content, forms, and an audience where
you expect query parts in a variety of encodings, your above
procedure won't really cut it.


>Note that the application of the procedure A-C above will mean that your 
>step 3
>will never be applied.
>
>
>So I think we have two possible scenarios:
>
>Scenario 1)  The world is to be viewed as containing only IRIs.  _All_ 
>received
>URIs are converted into IRIs consisting of a sequence of  'appropriate' (your
>step 4) UTF characters.  Any non-UTF-8 escapes are still present as
>still-escaped sequences in the IRI; there has been no attempt to interpret 
>these
>as characters in some other encoding.

It's much better to think about this per server or Web application.
There may be Web applications where only IRIs are expected. If
non-UTF-8 escapes are found, then rather than keep these as still-
escaped sequences, they should produce an appropriate error message
to the user, e.g. "You have submitted data to this application
that could not be processed correctly, because it was not encoded
correctly by your browser. If you have a very old browser, please
upgrade. ..."


>Scenario 2)  In a word in which URIs intended to represent IRIs co-exist with
>URIs encoded using other character encodings, and where the difference has 
>to be
>detected so that the appropriate decoding can be applied, then my steps 
>A-C must
>first be undertaken. If my steps A-C indicate that another encoding was used,
>then the URI is to be handled in some other way, and no IRI is involved. If no
>evidence of a different encoding is found, then it is to be assumed that
>conversion to an IRI is valid and your steps 1-5 should be applied (but step 3
>will never be invoked).

For servers where this is the case, if there is only one other encoding,
then the above might be okay. But if there are potentially multiple
legacy encodings, that won't do the job.


>My tentative conclusion is this:
>
>The IRI draft 7 does not provide any support or advice for those needing to
>recognize and process (intelligently and efficiently) URIs containing 
>encodings
>other than UTF-8.

That is right. The draft is about IRIs, and about URIs resulting from
conversion from IRIs. It's not implementation advice for implementers
of servers and Web applications on how to distinguish different legacy
encodings.


>Where this needs to be done, something akin to my steps A-C is necessary, 
>before
>it can be decided that URI to IRI conversion should be applied.
>
>
>My concerns would be assuaged if there were a Section or Appendix in the IRI
>Internet-Draft :
>
>- Recognizing these transitional / co-existence needs,

I don't think distinguishing legacy encodings is part of what the
IRI spec should do.


>- Detailing the necessary and sufficient URI inspection required to decide
>whether or not to invoke IRI processing,

What may be going on on the server is not a conversion from URIs to IRIs,
it's an attempt to extract original data from an URI. Because Section
3.2 looks somewhat similar to what you had in mind to do that job,
you got confused.


>- Containing the cautions about the remote possibility of incorrect decisions
>being made.

There is already such caution. The draft says that detecting UTF-8
is correct with high probablity; it doesn't say it's always correct.


>I'd be prepared to help draft it.
>
>Footnote 1:
>In a 'real' implementation the two processing sequences 1-5 and A-C could be
>undertaken in a single pass through the URI using a merged algorithm,
>parameterised to define how it should proceed if a non-UTF_8 octet sequence
>should be detected (i.e. parameterised to adopt Scenarios 1 or 2). The
>performance penalty of my proposed addition would be insignificant.

Anybody who finds a more efficient way to do things is always
welcome to use that way. Specs usually try to give clear and
precise instructions, rather than to be most efficient.


>Footnote 2:
>Your approach of assuming that an IRI interpretation is valid in all 
>situations
>in which UTF_8 has been used ought also to be validated. People are already
>using UTF-8 encoding with no knowledge of IRIs.  I've not explored what impact
>the application of the stage 4+5 processing of your draft (i.e. beyond that of
>de-escaping and decoding the UTF-8 characters) could have, and whether or 
>not it
>could cause any problems for pre-IRI users of UTF-8. I don't intent to pursue
>this line of enquiry ;-)

Well, if somebody has used UTF-8 up to now, others can use IRIs.
As an example, I have used http://www.w3.org/People/D%C3%BCrst
for years. If somebody inputs it in a browser, on some browsers
(e.g. Opera) they will actually see a real IRI.
The only danger with this kind of IRIs is that they then give
that IRI to somebody else, and that person doesn't have a browser
that can resolve IRIs yet. But that's not anything that could
not happen with something that was created to be used as an
IRI from the start.


Regards,    Martin.



>Chris
>
>
>----- Original Message -----
>From: "Martin Duerst" <duerst@w3.org>
>To: "Chris Haynes" <chris@harvington.org.uk>; "Michel Suignard"
><michelsu@windows.microsoft.com>
>Cc: <public-iri@w3.org>
>Sent: Sunday, May 09, 2004 1:37 AM
>Subject: Re: Migration of HTTP to the use of IRIs [altdesign-17]
>
>
> > Hello Chris,
> >
> > I have changed the issue for this mail to altdesign-17, because it
> > seems more appropriate.
> >
> > At 11:07 04/05/07 +0100, Chris Haynes wrote:
> >
> > >Michel,
> > >
> > >Thanks for this comment, but I think my point is still valid - even 
> just for
> > >presentational uses.
> > >
> > >Given that many URI encodings exist 'in the wild' which use %HH 
> escaping of
> > >non-UTF-8 sequences, I fail to see how one can know that it is valid to
> > >convert
> > >any such URI into an IRI (as per sect. 3.2) - even if just for 
> presentational
> > >purposes.
> >
> > Section 3.2 very clearly says that there is a risk that you convert
> > to something that didn't exist previously.
> > But in practice, this is not that much of an issue, because it is
> > very rare to find reasonable text encoded in legacy encodings that
> > matches UTF-8 byte patters. Please try to find some examples yourself,
> > and you will see this.
> >
> >
> > >My concern is the same:  unless there is some kind of syntactic indicator
> > >within
> > >the URI as a whole, how can one reliably know that UTF-8 has been used and
> > >that
> > >it is intended to have a corresponding IRI?
> >
> > You are correct that one cannot do this with 100% certainty.
> > But then, if you study the URI spec very carefully, you will
> > find that it also doesn't guarantee that an 'a' in an URI
> > actually corresponds to an 'a' in the original data (e.g.
> > file name). For details, please see the "Laguna Beach"
> > example in Section 2.5 of draft-fielding-uri-rfc2396bis-05.txt,
> > for example at
> > 
> http://gbiv.com/protocols/uri/rev-2002/draft-fielding-uri-rfc2396bis-05.txt.
> >
> > So in those rare cases where an URI with an octet sequence
> > that by chance corresponds to an UTF-8 pattern, but that was
> > never intended as UTF-8, is converted to an IRI, one will just
> > get a weird name, but reusing that name again e.g. in a browser
> > that accepts IRIs will lead back to the original resource.
> >
> >
> >
> > >It seems to me that IRI will only be deployed accurately and effectively
> > >if one
> > >of the following situations occurs:
> > >
> > >1) New protocol schemes (e.g. httpi, httpis ) are introduced which make it
> > >explicit that this spec. applies to the URI,
> >
> > Introducing a new URI scheme is *extremely* expensive. I have heard
> > Tim Berners-Lee say this over and over again, and I know he knows it.
> > And in the case at hand, it's highly unnecessary. The cost of an
> > occasional accidental 'wrong' conversion back to an IRI (as discussed
> > above) is much, much smaller than the cost of introducing new schemes.
> >
> > And what would the real benefit of new schemes be? Would they be
> > useful to distinguish URIs from true IRIs (I'm writing 'true' IRIs
> > here to exclude URIs which are by definition also IRIs). Not really,
> > it's much cheaper to identify IRIs by checking for non-ASCII characters.
> >
> > So they would only be used to distinguish URIs without known origin
> > from URIs originating from conversion from IRIs. But assume I had
> > an IRI like like http://www.example.org/ros&#xE9; (rose'). In order
> > to pass it to others whom I know can only process URIs, not IRIs,
> > would I want to convert it to http://www.example.org/ros%C3%A9,
> > or to httpi://www.example.org/ros%C3%A9 ? The former strictly
> > speaking looses the information that this was an IRI, so converting
> > it back to rose' is a guess (but because of the UTF-8 patters,
> > actually a rather safe one). But it actually will go to the
> > right page, on hunderds of millions of Web browsers, without
> > exception. The later can safely be converted back to the IRI
> > (by all the software that knows how to do this, which currently
> > numbers exactly 0). But it will work only on the browsers
> > that know the httpi: scheme (again, currently numbering
> > exactly 0). For me the alternative is very clear,
> > http://www.example.org/ros%C3%A9 works in much more cases,
> > and is therefore much better.
> >
> >
> > >2) They are used within a closed environment in which it is a 
> convention that
> > >only IRIs and IRI-derived URIs are in use (no legacy-encoding escapes, or
>they
> > >are allowed to be mis-interpreted)
> >
> > The current draft clearly allows legacy-encoded escapes, for backwards
> > compatibility. I'm not sure what you mean by 'mis-interpreted', but
> > if you mean that they are converted to IRIs, then yes, the current
> > draft allows this in those cases where it is possible (i.e. the
> > byte pattern matches UTF-8,...). But this misinterpretation does
> > not lead to an actual misinterpretation of the resource that the
> > IRI identifies.
> >
> >
> > >3) A new market-dominating user agent is launched which behaves as if (2)
> > >above
> > >were the case - i.e. there is an attempt to establish IRIs as the de facto
> > >default through market force, ignoring or discarding resulting errors of
> > >presentation or of resource identification.
> > >
> > >My big fear is that without rapid progress on (1), IRIs on the open 
> Internet
> > >will only ever take off if someone does (3) - which will be without 
> benefit
>of
> > >adequate standards backing.
> >
> > I'm not sure I understand you. Several browsers, for example
> > Opera and Safari, already implement IRIs. MS IE also does it
> > if the relevant flag is set correctly. And the standard is
> > close to done; this is the last real issue I'm trying to close.
> > So I don't see the problem.
> >
> >
> > >I'd love to either:
> > >
> > >a) be shown that my logic is faulty
> >
> > I guess yes. Not in theory, where absolute correctness is the
> > only goal, but in practice, where big numbers and deployment
> > are important.
> >
> > >or
> > >
> > >b) be pleasantly surprised by being told that there _is_  RFC work taking
> > >place
> > >on new schemes covering at least the space of http(s)
> >
> > Some schemes may benefit from an update, in particular those that
> > haven't thought about internationalization. The first example that
> > would come to my mind is the mailto: scheme.
> >
> >
> > Regards,    Martin.
> >
> >
> >
> > >otherwise, I fail to understand how IRIs will 'take off' in the 'real
>world' -
> > >where they are so badly needed.
> > >
> > >Chris
> > >
> > >
> > >
> > >
> > >----- Original Message -----
> > >From: "Michel Suignard" <michelsu@windows.microsoft.com>
> > >To: "Chris Haynes" <chris@harvington.org.uk>
> > >Cc: <public-iri@w3.org>; "Martin Duerst" <duerst@w3.org>
> > >Sent: Friday, May 07, 2004 1:43 AM
> > >Subject: RE: Migration of HTTP to the use of IRIs [queryclarify-16]
> > >
> > >
> > >
> > > > From:  Chris Haynes
> > > > Sent: Thursday, May 06, 2004 4:50 AM
> > > >
> > > > Actually, my original core concern has now been covered in your
> > >section
> > > > 1.2.a - Applicability, where you make it clear that "the intent is not
> > >to
> > > > introduce IRIs into contexts that are not defined to accept them".
> > > >
> > > > This now makes it clear that new schemas will be required to replace
> > > > http: , https: etc. These will need to be self-identifying in some
> > >way, so
> > > > that receiving equipment will know that an IRI is being presented.
> > > >
> > > > So, as I commented last June, I await with interest the recognition
> > >among
> > > > those responsible for the HTTP schema that new schemas with new names
> > >are
> > > > required before IRIs can be used.
> > >
> > >I'd like to comment on that. The IRI spec is fairly explicit on that IRI
> > >can be used as presentation elements for URI protocol elements (ref
> > >clause 3 intro). This is to recognize that applications out there have
> > >not waited for us for creating presentation layers that use non ascii
> > >native characters for schemes that supposedly should not use them (such
> > >as http). The presentation layer principle is there to support that. So
> > >I expect IRI to be used for both purposes:
> > >- presentation layer for existing URI schemes
> > >- core layer for new schemes exclusively defined using IRI for protocol
> > >elements syntax.
> > >
> > >For a while I'd expect the vast majority of IRI usage to be in the first
> > >category.
> > >
> > >Michel
> > >
> > >
> >
> >
> >
Received on Sunday, 9 May 2004 23:16:05 UTC