Re: Migration of HTTP to the use of IRIs [altdesign-17] from Chris Haynes on 2004-05-11 (public-iri@w3.org from May 2004)

From: Chris Haynes <chris@harvington.org.uk>
Date: Tue, 11 May 2004 08:27:48 +0100
To: "Martin Duerst" <duerst@w3.org>
Cc: <public-iri@w3.org>, "Michel Suignard" <michelsu@windows.microsoft.com>
Message-ID: <01ab01c43729$789935b0$0200000a@ringo>
Ok Martin,

Having carefully re-read the draft, and having checked the terms of reference
for this activity, I now understand that I was placing unwarrented expectations
on the scope and applicability of this work.

I therefore have no further concerns re. altdesign-17

Chris


"Martin Duerst" replied:

>
> Hello Chris,
>
> At 11:05 04/05/09 +0100, Chris Haynes wrote:
>
> >Michael,
>
> [Well, Martin that would have been.]
>
>
> >Thanks for your patience.
> >
> >So I think you are saying that the flaw in my logic is when I asserted that
> >there is no syntactic indication of the use of an IRI. Your assertion, in
> >effect, is that the syntactic indication is only present when needed, and is
> >implicit in the use of UTF-8 encoding.
>
> Not exactly. There is no syntactic indication needed for HTTP
> and URIs/IRIs to work the way they are designed.
>
> HTTP may in the future decide to introduce a convention to let the
> client tell the server about character encodings e.g. in query parameters,
> but this is idenpendent of the IRI spec. Such a convention may use another
> scheme (which I doubt very much), a special named query parameter
> (much more likely, some browsers can already do this, and some
> sites (e.g. google) use this), or something else.
>
> Such a solution could take UTF-8 as a special case, or it could
> treat UTF-8 just as one choice among many. IRIs will definitely
> provide a push towards UTF-8, but they cannot force anybody to
> use UTF-8.
>
>
> >Your assertion below is that the vast majority of cases which were not
encoded
> >in UTF-8 will generate one or more octet sub-sequences which are not legal
> >representations of characters in UTF-8.
>
> Yes. Or they will trivially be ASCII-only, in which case
> the question of encoding is mostly irrelevant.
>
>
> >I should use the presence of such
> >sequences to conclude that the URI was not encoded in UTF-8 and therefore
that
> >conversion to an IRI  is not applicable.
> >
> >My first thought was to try applying the processing you define in (draft
> >7) sect
> >3.2, to see if that would provide a 'failure indication' that I could use.
> >
> >But I came to your Step 3:
> >
> >"Re-escape any octet produced in step 2 that is not part of a strictly legal
> >UTF-8 sequence".
> >
> >
> >This step re-absorbs octet sequences which are illegal in UTF-8 into the IRI
> >world, so, applying section 3.2 in its entirety _cannot_ be used as the
> >basis of
> >a decision on whether or not UTF-8 encoding was used in the original
escaping.
>
> Yes. Section 3.2 isn't something that returns a boolean, it returns
> an IRI. And it tries to convert as many escapes as possible into
> actual characters. If that's not what you need, don't use Section 3.2.
>
>
> >Section 3.2 can only be applied if it is desired to _force_ everything that
is
> >received into an IRI.
> >
> >Your draft 7  does not provide the basis for deciding whether or not the URI
> >should be treated in this way. i.e. it does not give any opportunity for
> >concluding that the presented URI was encoded using some other (legacy)
> >encoding.
>
> Yes. There may be many reasons why somebody may want to use Section 3.2,
> and many other reasons why one wouldn't want to use it, or would not
> want to use it exactly as described.
>
>
> >You may recall that my concern is for the design of a web server including
> >something like a Servlet handler, which has to decode the URI before it can
> >identify and invoke the referenced servlet (which might know what encoding
was
> >used in URIs identifying that Servlet).
>
> If there is really a cyclic interdependency (i.e. servlet knows encoding,
> but servlet handler has to know encoding to be able to call servlet),
> then the only thing you can really do seems to be trial-and-error.
>
> If there is not really a cyclic interdependency (i.e. servlet knows
> encoding, and could handle it, but you want to do decoding in the
> servlet handler code), then this might just be bad design and
> software engineering.
>
> This does not mean that the servlet handler couldn't do certain
> things on behalf of the servlet, but you most probably need a
> more flexible interaction.
>
>
> >In this 'real world' that I keep worrying about there will be a long
> >transition
> >phase when there will be many inbound URLs which contain escapes generated
> >using
> >other encodings.
>
> Yes in general. But this will be very different for different servers.
> I know of cases where UTF-8 is already used throughout since years,
> and I'm sure there are cases where legacy encodings will still be
> used in some years. Thus different servers will have different
> needs when it comes to analyzing incomming HTTP URIs, in particular
> query parts.
>
>
> >Forcing them into IRIs is not appropriate behaviour; by some
> >other means the appropriate decoding must be selected and applied.
>
> Agreed. But I don't think the draft ever says you have to do that.
>
> In more general terms, I'm not sure why you want to convert an
> URI arrived on the server to an IRI. What you want to do is to
> take the URI appart and work on resolving it.
>
>
> >It seems to me that, in this situation, where URLs containing encodings other
> >than UTF-8 are to be handled differently, rather than be forced into IRIs by
> >your section 3.2,
>
> Section 3.2 in no way forces you to convert URIs to IRIs on the server!
>
>
> >a different sequence is required. Something like:
> >
> >A)  Convert the received URI into an octet sequence as follows: Each %HH
> >triplet
> >generates an octet whose value is defined by the hex digits HH. All other
> >(ASCII) characters generate an octet whose value is that of the code point of
> >that character in the ASCII/UTF-8 code table.
> >
> >B) Attempt to process the octet sequence generated by B as a UTF_8-encoded
> >octet
> >sequence. If the octet sequence is 'legal', i.e. it is the correct
> >encoding of a
> >sequence of integer values (but not necessarily representing valid Unicode
> >code
> >points), then the URI does represent an IRI and the processing of (draft 7
> >sect.32.) should be applied to extract the IRI.
> >
> >C) If, in step B, there should have been found one or more octet sequences
> >which
> >did not form part of any 'legal' UTF_8 sequence, then no IRIs are involved
and
> >the interpretation of the presented URI is to be decided by other means.
>
> In some cases, a procedure like the above may be appropriate for
> implementing some servlet logic. But please note that what you are
> actually trying to do really has nothing to do with reconstructing
> an IRI from an URI; what you are trying to do is to reconstruct the
> original characters that should be handled to the servlet.
>
> Looking at the details, I see the following issues:
> - The decision should probably not be taken for the whole URI, but
>    e.g. on the query part only. There can easily be cases where the
>    query part is in UTF-8, but the path part is not, or the other
>    way round.
> - The procedure isn't really complete if you end with 'is to be
>    decided by other means'. Here several situations may arise,
>    and they may need different solutions. In general, you need
>    to have a way to know the actual encoding. The most general
>    way to do this is to include a hidden form element in the
>    form, with a text that will be encoded differently in the
>    encodings you want to distinguish (e.g. UTF-8, iso-2022-jp,
>    euc-jp, and shift_jis for Japanese).
>
> So if you are working with content, forms, and an audience where
> you expect query parts in a variety of encodings, your above
> procedure won't really cut it.
>
>
> >Note that the application of the procedure A-C above will mean that your
> >step 3
> >will never be applied.
> >
> >
> >So I think we have two possible scenarios:
> >
> >Scenario 1)  The world is to be viewed as containing only IRIs.  _All_
> >received
> >URIs are converted into IRIs consisting of a sequence of  'appropriate' (your
> >step 4) UTF characters.  Any non-UTF-8 escapes are still present as
> >still-escaped sequences in the IRI; there has been no attempt to interpret
> >these
> >as characters in some other encoding.
>
> It's much better to think about this per server or Web application.
> There may be Web applications where only IRIs are expected. If
> non-UTF-8 escapes are found, then rather than keep these as still-
> escaped sequences, they should produce an appropriate error message
> to the user, e.g. "You have submitted data to this application
> that could not be processed correctly, because it was not encoded
> correctly by your browser. If you have a very old browser, please
> upgrade. ..."
>
>
> >Scenario 2)  In a word in which URIs intended to represent IRIs co-exist with
> >URIs encoded using other character encodings, and where the difference has
> >to be
> >detected so that the appropriate decoding can be applied, then my steps
> >A-C must
> >first be undertaken. If my steps A-C indicate that another encoding was used,
> >then the URI is to be handled in some other way, and no IRI is involved. If
no
> >evidence of a different encoding is found, then it is to be assumed that
> >conversion to an IRI is valid and your steps 1-5 should be applied (but step
3
> >will never be invoked).
>
> For servers where this is the case, if there is only one other encoding,
> then the above might be okay. But if there are potentially multiple
> legacy encodings, that won't do the job.
>
>
> >My tentative conclusion is this:
> >
> >The IRI draft 7 does not provide any support or advice for those needing to
> >recognize and process (intelligently and efficiently) URIs containing
> >encodings
> >other than UTF-8.
>
> That is right. The draft is about IRIs, and about URIs resulting from
> conversion from IRIs. It's not implementation advice for implementers
> of servers and Web applications on how to distinguish different legacy
> encodings.
>
>
> >Where this needs to be done, something akin to my steps A-C is necessary,
> >before
> >it can be decided that URI to IRI conversion should be applied.
> >
> >
> >My concerns would be assuaged if there were a Section or Appendix in the IRI
> >Internet-Draft :
> >
> >- Recognizing these transitional / co-existence needs,
>
> I don't think distinguishing legacy encodings is part of what the
> IRI spec should do.
>
>
> >- Detailing the necessary and sufficient URI inspection required to decide
> >whether or not to invoke IRI processing,
>
> What may be going on on the server is not a conversion from URIs to IRIs,
> it's an attempt to extract original data from an URI. Because Section
> 3.2 looks somewhat similar to what you had in mind to do that job,
> you got confused.
>
>
> >- Containing the cautions about the remote possibility of incorrect decisions
> >being made.
>
> There is already such caution. The draft says that detecting UTF-8
> is correct with high probablity; it doesn't say it's always correct.
>
>
> >I'd be prepared to help draft it.
> >
> >Footnote 1:
> >In a 'real' implementation the two processing sequences 1-5 and A-C could be
> >undertaken in a single pass through the URI using a merged algorithm,
> >parameterised to define how it should proceed if a non-UTF_8 octet sequence
> >should be detected (i.e. parameterised to adopt Scenarios 1 or 2). The
> >performance penalty of my proposed addition would be insignificant.
>
> Anybody who finds a more efficient way to do things is always
> welcome to use that way. Specs usually try to give clear and
> precise instructions, rather than to be most efficient.
>
>
> >Footnote 2:
> >Your approach of assuming that an IRI interpretation is valid in all
> >situations
> >in which UTF_8 has been used ought also to be validated. People are already
> >using UTF-8 encoding with no knowledge of IRIs.  I've not explored what
impact
> >the application of the stage 4+5 processing of your draft (i.e. beyond that
of
> >de-escaping and decoding the UTF-8 characters) could have, and whether or
> >not it
> >could cause any problems for pre-IRI users of UTF-8. I don't intent to pursue
> >this line of enquiry ;-)
>
> Well, if somebody has used UTF-8 up to now, others can use IRIs.
> As an example, I have used http://www.w3.org/People/D%C3%BCrst
> for years. If somebody inputs it in a browser, on some browsers
> (e.g. Opera) they will actually see a real IRI.
> The only danger with this kind of IRIs is that they then give
> that IRI to somebody else, and that person doesn't have a browser
> that can resolve IRIs yet. But that's not anything that could
> not happen with something that was created to be used as an
> IRI from the start.
>
>
> Regards,    Martin.
>
>
>
> >Chris
> >
> >
> >----- Original Message -----
> >From: "Martin Duerst" <duerst@w3.org>
> >To: "Chris Haynes" <chris@harvington.org.uk>; "Michel Suignard"
> ><michelsu@windows.microsoft.com>
> >Cc: <public-iri@w3.org>
> >Sent: Sunday, May 09, 2004 1:37 AM
> >Subject: Re: Migration of HTTP to the use of IRIs [altdesign-17]
> >
> >
> > > Hello Chris,
> > >
> > > I have changed the issue for this mail to altdesign-17, because it
> > > seems more appropriate.
> > >
> > > At 11:07 04/05/07 +0100, Chris Haynes wrote:
> > >
> > > >Michel,
> > > >
> > > >Thanks for this comment, but I think my point is still valid - even
> > just for
> > > >presentational uses.
> > > >
> > > >Given that many URI encodings exist 'in the wild' which use %HH
> > escaping of
> > > >non-UTF-8 sequences, I fail to see how one can know that it is valid to
> > > >convert
> > > >any such URI into an IRI (as per sect. 3.2) - even if just for
> > presentational
> > > >purposes.
> > >
> > > Section 3.2 very clearly says that there is a risk that you convert
> > > to something that didn't exist previously.
> > > But in practice, this is not that much of an issue, because it is
> > > very rare to find reasonable text encoded in legacy encodings that
> > > matches UTF-8 byte patters. Please try to find some examples yourself,
> > > and you will see this.
> > >
> > >
> > > >My concern is the same:  unless there is some kind of syntactic indicator
> > > >within
> > > >the URI as a whole, how can one reliably know that UTF-8 has been used
and
> > > >that
> > > >it is intended to have a corresponding IRI?
> > >
> > > You are correct that one cannot do this with 100% certainty.
> > > But then, if you study the URI spec very carefully, you will
> > > find that it also doesn't guarantee that an 'a' in an URI
> > > actually corresponds to an 'a' in the original data (e.g.
> > > file name). For details, please see the "Laguna Beach"
> > > example in Section 2.5 of draft-fielding-uri-rfc2396bis-05.txt,
> > > for example at
> > >
> > http://gbiv.com/protocols/uri/rev-2002/draft-fielding-uri-rfc2396bis-05.txt.
> > >
> > > So in those rare cases where an URI with an octet sequence
> > > that by chance corresponds to an UTF-8 pattern, but that was
> > > never intended as UTF-8, is converted to an IRI, one will just
> > > get a weird name, but reusing that name again e.g. in a browser
> > > that accepts IRIs will lead back to the original resource.
> > >
> > >
> > >
> > > >It seems to me that IRI will only be deployed accurately and effectively
> > > >if one
> > > >of the following situations occurs:
> > > >
> > > >1) New protocol schemes (e.g. httpi, httpis ) are introduced which make
it
> > > >explicit that this spec. applies to the URI,
> > >
> > > Introducing a new URI scheme is *extremely* expensive. I have heard
> > > Tim Berners-Lee say this over and over again, and I know he knows it.
> > > And in the case at hand, it's highly unnecessary. The cost of an
> > > occasional accidental 'wrong' conversion back to an IRI (as discussed
> > > above) is much, much smaller than the cost of introducing new schemes.
> > >
> > > And what would the real benefit of new schemes be? Would they be
> > > useful to distinguish URIs from true IRIs (I'm writing 'true' IRIs
> > > here to exclude URIs which are by definition also IRIs). Not really,
> > > it's much cheaper to identify IRIs by checking for non-ASCII characters.
> > >
> > > So they would only be used to distinguish URIs without known origin
> > > from URIs originating from conversion from IRIs. But assume I had
> > > an IRI like like http://www.example.org/ros&#xE9; (rose'). In order
> > > to pass it to others whom I know can only process URIs, not IRIs,
> > > would I want to convert it to http://www.example.org/ros%C3%A9,
> > > or to httpi://www.example.org/ros%C3%A9 ? The former strictly
> > > speaking looses the information that this was an IRI, so converting
> > > it back to rose' is a guess (but because of the UTF-8 patters,
> > > actually a rather safe one). But it actually will go to the
> > > right page, on hunderds of millions of Web browsers, without
> > > exception. The later can safely be converted back to the IRI
> > > (by all the software that knows how to do this, which currently
> > > numbers exactly 0). But it will work only on the browsers
> > > that know the httpi: scheme (again, currently numbering
> > > exactly 0). For me the alternative is very clear,
> > > http://www.example.org/ros%C3%A9 works in much more cases,
> > > and is therefore much better.
> > >
> > >
> > > >2) They are used within a closed environment in which it is a
> > convention that
> > > >only IRIs and IRI-derived URIs are in use (no legacy-encoding escapes, or
> >they
> > > >are allowed to be mis-interpreted)
> > >
> > > The current draft clearly allows legacy-encoded escapes, for backwards
> > > compatibility. I'm not sure what you mean by 'mis-interpreted', but
> > > if you mean that they are converted to IRIs, then yes, the current
> > > draft allows this in those cases where it is possible (i.e. the
> > > byte pattern matches UTF-8,...). But this misinterpretation does
> > > not lead to an actual misinterpretation of the resource that the
> > > IRI identifies.
> > >
> > >
> > > >3) A new market-dominating user agent is launched which behaves as if (2)
> > > >above
> > > >were the case - i.e. there is an attempt to establish IRIs as the de
facto
> > > >default through market force, ignoring or discarding resulting errors of
> > > >presentation or of resource identification.
> > > >
> > > >My big fear is that without rapid progress on (1), IRIs on the open
> > Internet
> > > >will only ever take off if someone does (3) - which will be without
> > benefit
> >of
> > > >adequate standards backing.
> > >
> > > I'm not sure I understand you. Several browsers, for example
> > > Opera and Safari, already implement IRIs. MS IE also does it
> > > if the relevant flag is set correctly. And the standard is
> > > close to done; this is the last real issue I'm trying to close.
> > > So I don't see the problem.
> > >
> > >
> > > >I'd love to either:
> > > >
> > > >a) be shown that my logic is faulty
> > >
> > > I guess yes. Not in theory, where absolute correctness is the
> > > only goal, but in practice, where big numbers and deployment
> > > are important.
> > >
> > > >or
> > > >
> > > >b) be pleasantly surprised by being told that there _is_  RFC work taking
> > > >place
> > > >on new schemes covering at least the space of http(s)
> > >
> > > Some schemes may benefit from an update, in particular those that
> > > haven't thought about internationalization. The first example that
> > > would come to my mind is the mailto: scheme.
> > >
> > >
> > > Regards,    Martin.
> > >
> > >
> > >
> > > >otherwise, I fail to understand how IRIs will 'take off' in the 'real
> >world' -
> > > >where they are so badly needed.
> > > >
> > > >Chris
> > > >
> > > >
> > > >
> > > >
> > > >----- Original Message -----
> > > >From: "Michel Suignard" <michelsu@windows.microsoft.com>
> > > >To: "Chris Haynes" <chris@harvington.org.uk>
> > > >Cc: <public-iri@w3.org>; "Martin Duerst" <duerst@w3.org>
> > > >Sent: Friday, May 07, 2004 1:43 AM
> > > >Subject: RE: Migration of HTTP to the use of IRIs [queryclarify-16]
> > > >
> > > >
> > > >
> > > > > From:  Chris Haynes
> > > > > Sent: Thursday, May 06, 2004 4:50 AM
> > > > >
> > > > > Actually, my original core concern has now been covered in your
> > > >section
> > > > > 1.2.a - Applicability, where you make it clear that "the intent is not
> > > >to
> > > > > introduce IRIs into contexts that are not defined to accept them".
> > > > >
> > > > > This now makes it clear that new schemas will be required to replace
> > > > > http: , https: etc. These will need to be self-identifying in some
> > > >way, so
> > > > > that receiving equipment will know that an IRI is being presented.
> > > > >
> > > > > So, as I commented last June, I await with interest the recognition
> > > >among
> > > > > those responsible for the HTTP schema that new schemas with new names
> > > >are
> > > > > required before IRIs can be used.
> > > >
> > > >I'd like to comment on that. The IRI spec is fairly explicit on that IRI
> > > >can be used as presentation elements for URI protocol elements (ref
> > > >clause 3 intro). This is to recognize that applications out there have
> > > >not waited for us for creating presentation layers that use non ascii
> > > >native characters for schemes that supposedly should not use them (such
> > > >as http). The presentation layer principle is there to support that. So
> > > >I expect IRI to be used for both purposes:
> > > >- presentation layer for existing URI schemes
> > > >- core layer for new schemes exclusively defined using IRI for protocol
> > > >elements syntax.
> > > >
> > > >For a while I'd expect the vast majority of IRI usage to be in the first
> > > >category.
> > > >
> > > >Michel
> > > >
> > > >
> > >
> > >
> > >
>
>
>
Received on Tuesday, 11 May 2004 03:29:56 UTC