- From: Chris Haynes <chris@harvington.org.uk>
- Date: Tue, 11 May 2004 08:27:48 +0100
- To: "Martin Duerst" <duerst@w3.org>
- Cc: <public-iri@w3.org>, "Michel Suignard" <michelsu@windows.microsoft.com>
Ok Martin, Having carefully re-read the draft, and having checked the terms of reference for this activity, I now understand that I was placing unwarrented expectations on the scope and applicability of this work. I therefore have no further concerns re. altdesign-17 Chris "Martin Duerst" replied: > > Hello Chris, > > At 11:05 04/05/09 +0100, Chris Haynes wrote: > > >Michael, > > [Well, Martin that would have been.] > > > >Thanks for your patience. > > > >So I think you are saying that the flaw in my logic is when I asserted that > >there is no syntactic indication of the use of an IRI. Your assertion, in > >effect, is that the syntactic indication is only present when needed, and is > >implicit in the use of UTF-8 encoding. > > Not exactly. There is no syntactic indication needed for HTTP > and URIs/IRIs to work the way they are designed. > > HTTP may in the future decide to introduce a convention to let the > client tell the server about character encodings e.g. in query parameters, > but this is idenpendent of the IRI spec. Such a convention may use another > scheme (which I doubt very much), a special named query parameter > (much more likely, some browsers can already do this, and some > sites (e.g. google) use this), or something else. > > Such a solution could take UTF-8 as a special case, or it could > treat UTF-8 just as one choice among many. IRIs will definitely > provide a push towards UTF-8, but they cannot force anybody to > use UTF-8. > > > >Your assertion below is that the vast majority of cases which were not encoded > >in UTF-8 will generate one or more octet sub-sequences which are not legal > >representations of characters in UTF-8. > > Yes. Or they will trivially be ASCII-only, in which case > the question of encoding is mostly irrelevant. > > > >I should use the presence of such > >sequences to conclude that the URI was not encoded in UTF-8 and therefore that > >conversion to an IRI is not applicable. > > > >My first thought was to try applying the processing you define in (draft > >7) sect > >3.2, to see if that would provide a 'failure indication' that I could use. > > > >But I came to your Step 3: > > > >"Re-escape any octet produced in step 2 that is not part of a strictly legal > >UTF-8 sequence". > > > > > >This step re-absorbs octet sequences which are illegal in UTF-8 into the IRI > >world, so, applying section 3.2 in its entirety _cannot_ be used as the > >basis of > >a decision on whether or not UTF-8 encoding was used in the original escaping. > > Yes. Section 3.2 isn't something that returns a boolean, it returns > an IRI. And it tries to convert as many escapes as possible into > actual characters. If that's not what you need, don't use Section 3.2. > > > >Section 3.2 can only be applied if it is desired to _force_ everything that is > >received into an IRI. > > > >Your draft 7 does not provide the basis for deciding whether or not the URI > >should be treated in this way. i.e. it does not give any opportunity for > >concluding that the presented URI was encoded using some other (legacy) > >encoding. > > Yes. There may be many reasons why somebody may want to use Section 3.2, > and many other reasons why one wouldn't want to use it, or would not > want to use it exactly as described. > > > >You may recall that my concern is for the design of a web server including > >something like a Servlet handler, which has to decode the URI before it can > >identify and invoke the referenced servlet (which might know what encoding was > >used in URIs identifying that Servlet). > > If there is really a cyclic interdependency (i.e. servlet knows encoding, > but servlet handler has to know encoding to be able to call servlet), > then the only thing you can really do seems to be trial-and-error. > > If there is not really a cyclic interdependency (i.e. servlet knows > encoding, and could handle it, but you want to do decoding in the > servlet handler code), then this might just be bad design and > software engineering. > > This does not mean that the servlet handler couldn't do certain > things on behalf of the servlet, but you most probably need a > more flexible interaction. > > > >In this 'real world' that I keep worrying about there will be a long > >transition > >phase when there will be many inbound URLs which contain escapes generated > >using > >other encodings. > > Yes in general. But this will be very different for different servers. > I know of cases where UTF-8 is already used throughout since years, > and I'm sure there are cases where legacy encodings will still be > used in some years. Thus different servers will have different > needs when it comes to analyzing incomming HTTP URIs, in particular > query parts. > > > >Forcing them into IRIs is not appropriate behaviour; by some > >other means the appropriate decoding must be selected and applied. > > Agreed. But I don't think the draft ever says you have to do that. > > In more general terms, I'm not sure why you want to convert an > URI arrived on the server to an IRI. What you want to do is to > take the URI appart and work on resolving it. > > > >It seems to me that, in this situation, where URLs containing encodings other > >than UTF-8 are to be handled differently, rather than be forced into IRIs by > >your section 3.2, > > Section 3.2 in no way forces you to convert URIs to IRIs on the server! > > > >a different sequence is required. Something like: > > > >A) Convert the received URI into an octet sequence as follows: Each %HH > >triplet > >generates an octet whose value is defined by the hex digits HH. All other > >(ASCII) characters generate an octet whose value is that of the code point of > >that character in the ASCII/UTF-8 code table. > > > >B) Attempt to process the octet sequence generated by B as a UTF_8-encoded > >octet > >sequence. If the octet sequence is 'legal', i.e. it is the correct > >encoding of a > >sequence of integer values (but not necessarily representing valid Unicode > >code > >points), then the URI does represent an IRI and the processing of (draft 7 > >sect.32.) should be applied to extract the IRI. > > > >C) If, in step B, there should have been found one or more octet sequences > >which > >did not form part of any 'legal' UTF_8 sequence, then no IRIs are involved and > >the interpretation of the presented URI is to be decided by other means. > > In some cases, a procedure like the above may be appropriate for > implementing some servlet logic. But please note that what you are > actually trying to do really has nothing to do with reconstructing > an IRI from an URI; what you are trying to do is to reconstruct the > original characters that should be handled to the servlet. > > Looking at the details, I see the following issues: > - The decision should probably not be taken for the whole URI, but > e.g. on the query part only. There can easily be cases where the > query part is in UTF-8, but the path part is not, or the other > way round. > - The procedure isn't really complete if you end with 'is to be > decided by other means'. Here several situations may arise, > and they may need different solutions. In general, you need > to have a way to know the actual encoding. The most general > way to do this is to include a hidden form element in the > form, with a text that will be encoded differently in the > encodings you want to distinguish (e.g. UTF-8, iso-2022-jp, > euc-jp, and shift_jis for Japanese). > > So if you are working with content, forms, and an audience where > you expect query parts in a variety of encodings, your above > procedure won't really cut it. > > > >Note that the application of the procedure A-C above will mean that your > >step 3 > >will never be applied. > > > > > >So I think we have two possible scenarios: > > > >Scenario 1) The world is to be viewed as containing only IRIs. _All_ > >received > >URIs are converted into IRIs consisting of a sequence of 'appropriate' (your > >step 4) UTF characters. Any non-UTF-8 escapes are still present as > >still-escaped sequences in the IRI; there has been no attempt to interpret > >these > >as characters in some other encoding. > > It's much better to think about this per server or Web application. > There may be Web applications where only IRIs are expected. If > non-UTF-8 escapes are found, then rather than keep these as still- > escaped sequences, they should produce an appropriate error message > to the user, e.g. "You have submitted data to this application > that could not be processed correctly, because it was not encoded > correctly by your browser. If you have a very old browser, please > upgrade. ..." > > > >Scenario 2) In a word in which URIs intended to represent IRIs co-exist with > >URIs encoded using other character encodings, and where the difference has > >to be > >detected so that the appropriate decoding can be applied, then my steps > >A-C must > >first be undertaken. If my steps A-C indicate that another encoding was used, > >then the URI is to be handled in some other way, and no IRI is involved. If no > >evidence of a different encoding is found, then it is to be assumed that > >conversion to an IRI is valid and your steps 1-5 should be applied (but step 3 > >will never be invoked). > > For servers where this is the case, if there is only one other encoding, > then the above might be okay. But if there are potentially multiple > legacy encodings, that won't do the job. > > > >My tentative conclusion is this: > > > >The IRI draft 7 does not provide any support or advice for those needing to > >recognize and process (intelligently and efficiently) URIs containing > >encodings > >other than UTF-8. > > That is right. The draft is about IRIs, and about URIs resulting from > conversion from IRIs. It's not implementation advice for implementers > of servers and Web applications on how to distinguish different legacy > encodings. > > > >Where this needs to be done, something akin to my steps A-C is necessary, > >before > >it can be decided that URI to IRI conversion should be applied. > > > > > >My concerns would be assuaged if there were a Section or Appendix in the IRI > >Internet-Draft : > > > >- Recognizing these transitional / co-existence needs, > > I don't think distinguishing legacy encodings is part of what the > IRI spec should do. > > > >- Detailing the necessary and sufficient URI inspection required to decide > >whether or not to invoke IRI processing, > > What may be going on on the server is not a conversion from URIs to IRIs, > it's an attempt to extract original data from an URI. Because Section > 3.2 looks somewhat similar to what you had in mind to do that job, > you got confused. > > > >- Containing the cautions about the remote possibility of incorrect decisions > >being made. > > There is already such caution. The draft says that detecting UTF-8 > is correct with high probablity; it doesn't say it's always correct. > > > >I'd be prepared to help draft it. > > > >Footnote 1: > >In a 'real' implementation the two processing sequences 1-5 and A-C could be > >undertaken in a single pass through the URI using a merged algorithm, > >parameterised to define how it should proceed if a non-UTF_8 octet sequence > >should be detected (i.e. parameterised to adopt Scenarios 1 or 2). The > >performance penalty of my proposed addition would be insignificant. > > Anybody who finds a more efficient way to do things is always > welcome to use that way. Specs usually try to give clear and > precise instructions, rather than to be most efficient. > > > >Footnote 2: > >Your approach of assuming that an IRI interpretation is valid in all > >situations > >in which UTF_8 has been used ought also to be validated. People are already > >using UTF-8 encoding with no knowledge of IRIs. I've not explored what impact > >the application of the stage 4+5 processing of your draft (i.e. beyond that of > >de-escaping and decoding the UTF-8 characters) could have, and whether or > >not it > >could cause any problems for pre-IRI users of UTF-8. I don't intent to pursue > >this line of enquiry ;-) > > Well, if somebody has used UTF-8 up to now, others can use IRIs. > As an example, I have used http://www.w3.org/People/D%C3%BCrst > for years. If somebody inputs it in a browser, on some browsers > (e.g. Opera) they will actually see a real IRI. > The only danger with this kind of IRIs is that they then give > that IRI to somebody else, and that person doesn't have a browser > that can resolve IRIs yet. But that's not anything that could > not happen with something that was created to be used as an > IRI from the start. > > > Regards, Martin. > > > > >Chris > > > > > >----- Original Message ----- > >From: "Martin Duerst" <duerst@w3.org> > >To: "Chris Haynes" <chris@harvington.org.uk>; "Michel Suignard" > ><michelsu@windows.microsoft.com> > >Cc: <public-iri@w3.org> > >Sent: Sunday, May 09, 2004 1:37 AM > >Subject: Re: Migration of HTTP to the use of IRIs [altdesign-17] > > > > > > > Hello Chris, > > > > > > I have changed the issue for this mail to altdesign-17, because it > > > seems more appropriate. > > > > > > At 11:07 04/05/07 +0100, Chris Haynes wrote: > > > > > > >Michel, > > > > > > > >Thanks for this comment, but I think my point is still valid - even > > just for > > > >presentational uses. > > > > > > > >Given that many URI encodings exist 'in the wild' which use %HH > > escaping of > > > >non-UTF-8 sequences, I fail to see how one can know that it is valid to > > > >convert > > > >any such URI into an IRI (as per sect. 3.2) - even if just for > > presentational > > > >purposes. > > > > > > Section 3.2 very clearly says that there is a risk that you convert > > > to something that didn't exist previously. > > > But in practice, this is not that much of an issue, because it is > > > very rare to find reasonable text encoded in legacy encodings that > > > matches UTF-8 byte patters. Please try to find some examples yourself, > > > and you will see this. > > > > > > > > > >My concern is the same: unless there is some kind of syntactic indicator > > > >within > > > >the URI as a whole, how can one reliably know that UTF-8 has been used and > > > >that > > > >it is intended to have a corresponding IRI? > > > > > > You are correct that one cannot do this with 100% certainty. > > > But then, if you study the URI spec very carefully, you will > > > find that it also doesn't guarantee that an 'a' in an URI > > > actually corresponds to an 'a' in the original data (e.g. > > > file name). For details, please see the "Laguna Beach" > > > example in Section 2.5 of draft-fielding-uri-rfc2396bis-05.txt, > > > for example at > > > > > http://gbiv.com/protocols/uri/rev-2002/draft-fielding-uri-rfc2396bis-05.txt. > > > > > > So in those rare cases where an URI with an octet sequence > > > that by chance corresponds to an UTF-8 pattern, but that was > > > never intended as UTF-8, is converted to an IRI, one will just > > > get a weird name, but reusing that name again e.g. in a browser > > > that accepts IRIs will lead back to the original resource. > > > > > > > > > > > > >It seems to me that IRI will only be deployed accurately and effectively > > > >if one > > > >of the following situations occurs: > > > > > > > >1) New protocol schemes (e.g. httpi, httpis ) are introduced which make it > > > >explicit that this spec. applies to the URI, > > > > > > Introducing a new URI scheme is *extremely* expensive. I have heard > > > Tim Berners-Lee say this over and over again, and I know he knows it. > > > And in the case at hand, it's highly unnecessary. The cost of an > > > occasional accidental 'wrong' conversion back to an IRI (as discussed > > > above) is much, much smaller than the cost of introducing new schemes. > > > > > > And what would the real benefit of new schemes be? Would they be > > > useful to distinguish URIs from true IRIs (I'm writing 'true' IRIs > > > here to exclude URIs which are by definition also IRIs). Not really, > > > it's much cheaper to identify IRIs by checking for non-ASCII characters. > > > > > > So they would only be used to distinguish URIs without known origin > > > from URIs originating from conversion from IRIs. But assume I had > > > an IRI like like http://www.example.org/rosé (rose'). In order > > > to pass it to others whom I know can only process URIs, not IRIs, > > > would I want to convert it to http://www.example.org/ros%C3%A9, > > > or to httpi://www.example.org/ros%C3%A9 ? The former strictly > > > speaking looses the information that this was an IRI, so converting > > > it back to rose' is a guess (but because of the UTF-8 patters, > > > actually a rather safe one). But it actually will go to the > > > right page, on hunderds of millions of Web browsers, without > > > exception. The later can safely be converted back to the IRI > > > (by all the software that knows how to do this, which currently > > > numbers exactly 0). But it will work only on the browsers > > > that know the httpi: scheme (again, currently numbering > > > exactly 0). For me the alternative is very clear, > > > http://www.example.org/ros%C3%A9 works in much more cases, > > > and is therefore much better. > > > > > > > > > >2) They are used within a closed environment in which it is a > > convention that > > > >only IRIs and IRI-derived URIs are in use (no legacy-encoding escapes, or > >they > > > >are allowed to be mis-interpreted) > > > > > > The current draft clearly allows legacy-encoded escapes, for backwards > > > compatibility. I'm not sure what you mean by 'mis-interpreted', but > > > if you mean that they are converted to IRIs, then yes, the current > > > draft allows this in those cases where it is possible (i.e. the > > > byte pattern matches UTF-8,...). But this misinterpretation does > > > not lead to an actual misinterpretation of the resource that the > > > IRI identifies. > > > > > > > > > >3) A new market-dominating user agent is launched which behaves as if (2) > > > >above > > > >were the case - i.e. there is an attempt to establish IRIs as the de facto > > > >default through market force, ignoring or discarding resulting errors of > > > >presentation or of resource identification. > > > > > > > >My big fear is that without rapid progress on (1), IRIs on the open > > Internet > > > >will only ever take off if someone does (3) - which will be without > > benefit > >of > > > >adequate standards backing. > > > > > > I'm not sure I understand you. Several browsers, for example > > > Opera and Safari, already implement IRIs. MS IE also does it > > > if the relevant flag is set correctly. And the standard is > > > close to done; this is the last real issue I'm trying to close. > > > So I don't see the problem. > > > > > > > > > >I'd love to either: > > > > > > > >a) be shown that my logic is faulty > > > > > > I guess yes. Not in theory, where absolute correctness is the > > > only goal, but in practice, where big numbers and deployment > > > are important. > > > > > > >or > > > > > > > >b) be pleasantly surprised by being told that there _is_ RFC work taking > > > >place > > > >on new schemes covering at least the space of http(s) > > > > > > Some schemes may benefit from an update, in particular those that > > > haven't thought about internationalization. The first example that > > > would come to my mind is the mailto: scheme. > > > > > > > > > Regards, Martin. > > > > > > > > > > > > >otherwise, I fail to understand how IRIs will 'take off' in the 'real > >world' - > > > >where they are so badly needed. > > > > > > > >Chris > > > > > > > > > > > > > > > > > > > >----- Original Message ----- > > > >From: "Michel Suignard" <michelsu@windows.microsoft.com> > > > >To: "Chris Haynes" <chris@harvington.org.uk> > > > >Cc: <public-iri@w3.org>; "Martin Duerst" <duerst@w3.org> > > > >Sent: Friday, May 07, 2004 1:43 AM > > > >Subject: RE: Migration of HTTP to the use of IRIs [queryclarify-16] > > > > > > > > > > > > > > > > > From: Chris Haynes > > > > > Sent: Thursday, May 06, 2004 4:50 AM > > > > > > > > > > Actually, my original core concern has now been covered in your > > > >section > > > > > 1.2.a - Applicability, where you make it clear that "the intent is not > > > >to > > > > > introduce IRIs into contexts that are not defined to accept them". > > > > > > > > > > This now makes it clear that new schemas will be required to replace > > > > > http: , https: etc. These will need to be self-identifying in some > > > >way, so > > > > > that receiving equipment will know that an IRI is being presented. > > > > > > > > > > So, as I commented last June, I await with interest the recognition > > > >among > > > > > those responsible for the HTTP schema that new schemas with new names > > > >are > > > > > required before IRIs can be used. > > > > > > > >I'd like to comment on that. The IRI spec is fairly explicit on that IRI > > > >can be used as presentation elements for URI protocol elements (ref > > > >clause 3 intro). This is to recognize that applications out there have > > > >not waited for us for creating presentation layers that use non ascii > > > >native characters for schemes that supposedly should not use them (such > > > >as http). The presentation layer principle is there to support that. So > > > >I expect IRI to be used for both purposes: > > > >- presentation layer for existing URI schemes > > > >- core layer for new schemes exclusively defined using IRI for protocol > > > >elements syntax. > > > > > > > >For a while I'd expect the vast majority of IRI usage to be in the first > > > >category. > > > > > > > >Michel > > > > > > > > > > > > > > > > > > > >
Received on Tuesday, 11 May 2004 03:29:56 UTC