- From: Chris Haynes <chris@harvington.org.uk>
- Date: Sun, 23 Feb 2003 10:39:43 -0000
- To: <uri@w3.org>, "Martin Duerst" <duerst@w3.org>, "Tim Bray" <tbray@textuality.com>
- Cc: "WWW-Tag" <www-tag@w3.org>
Martin Duerst replied to Tim Bray: > > > Personally, if I were rewriting 2396 I would simply decree that all > > %-escaping be done on the UTF-8 mapping and only the UTF-8 mapping, > > I would be the first to agree with that. Unfortunately, it's not > as easy as that, because there is some legacy out there. > -------------------------- I've spent many days thinking through this problem. My personal conclusion is that the only 'correct and complete' way out of this hole will have to meet the following requirements: 1) Any 'new' URI-RFC _has_ to mandate that: "All %-escaping be done on the UTF-8 mapping and only the UTF-8 mapping" 2) All 'New' schemes are distinguishable by some kind of syntactic tagging, 3) In URIs not containing that tagging the old 2396 rules (or lack thereof) continue to apply. Now, to meet these requirements we need a method of indicating that an instance of a URI conforms to the 'new' URI-RFC. I can think of three possibilities; Option 1 - URI tagging using a 2396-illegal character sequence ----------------------------------------------------------------- In Option 1 URIs conforming to the 'new, common rules' are marked by placing a character sequence which is illegal in 2396 (such as %II) in a well-known place in the syntax. URI-receivers which are aware of the new RFC will detect the marker and interpret the URI accordingly. Those which do not understand it may break or corrupt the data in some way. ------------------------ Option 2 - A totally new URI syntax ----------------------------------- In this we use some syntactic formulation , such as http[URI/1.0]://host.... which cannot occur in 2396. The value in the brackets is the version number of the URI encoding scheme (making it possible to make future changes as well). Note that the [URI/n.n] would be part of the URI syntax, not something done on a scheme-by-scheme basis. Option 3 - Scheme-by-scheme changes ----------------------------------- Under this approach individual URI scheme users 'call up' the new rules on a scheme-by-scheme basis, e.g. for HTTP we introduce two new schemes: httpi: httpis: which are 'Internationalized' protocols and which mandate the UTF-8 encoding (and other internationalization features?). Other schemes ( ftpi: and so on ) would follow. All the individual new schemes need to do is invoke as 'mandatory' some rules defined in the new, common URI-RFC. In HTTP-land I think they would have to issue a new HTTP-RFC, defining a new version of HTTP (1.3?) which would accept all four scheme names Discussion ------------- The only advantage Option 1 has is that (in the RFC-world) it only needs a single RFC change to implement. Other than that I regard it as too ugly for words (but as it meets the requirements I included it). Option 2 is marginally more elegant, and is better 'engineering' in that we now have a protocol whose instances identify their versions. But, as this protocol is used directly by humans, it might be thought to be a little inconvenient (it's 9 additional, not very memorable characters to be remembered from the side of a bus). It would also be more of a 'shock' to legacy receivers which were not expecting it. My own preference is for Option 3 - new schemes. Rationale for chosing Option 3 ----------------------------- The '%hh encoding is unspecified' disaster is so pervasive, and there are so many different, incompatible legacy adaptations and work-arounds that I think its inconceivable that we can find a way out without replacing in some way all the 'tributary' RFCs which call up URIs. My belief is that we have reached the limits of ingenious work-arounds and syntactic contortions - and have to face this Internationalization issue head-on across the whole of the IETF and W3C with a robust, properly-engineered solution. The 'communities' who make the different uses of URIs (browser developers, web server developers, FTP developers etc. ) are going to have to change their products to use the new URI, so having a new scheme name and new RFCs as part of that community change does not seem too onerous a task. The HTTP scheme names I have suggested (httpi and httpis ): a) Are reasonably human-friendly, b) Indicate their purpose (i for International) c) Maintain the visual significance of security (the terminating 's') Obviously, it is for the HTTP community (and the like) to decide on and register their new schemes; the URI-RFC community is above such detail. I only suggest names as part of this feasibility exploration. Implementation and Migration ------------------------------ The most demanding test of the viability of this approach that I can think of (in the HTTP world) is to consider what happens if a 'new' browser sends a request to an 'old' web server. If the URI has characters which need to be encoded (or makes use of other features provided by the new URI-RFC) it first attempts a connection thus: GET httpi://www.host.com/user/D%FCrst HTTP/1.3 Note that the HTTP spec is (the as-yet-fictitious) 1.3 - which, I assume, declares knowledge of httpi: and hence the new URI encoding. If the server understands HTTP/1.3, and hence httpi: and the new encoding it can cope and respond normally. If the server does not understand HTTP/1.3 it should reply with an 505 HTTP version not supported error. The browser must now decide what to do. There can be no single, correct behaviour if the requested URI contains characters which are unrepresentable in 2396. It might decide to try: GET http://www.host.com/user/D%FCrst HTTP/1.1 or it might tell the user it can't cope. Sundry thoughts and issues --------------------------- In HTTP practice , one could still use port 80 as the default for both http: and httpi: and 443 for https: and httpis: Subsidiary scheme names like 'mailto' evolving to 'mailtoi' or 'imailto', would be dependent on an RFC821-replacement, and so on. Chris Haynes Retired Chartered Engineer Harvington, Evesham, UK
Received on Sunday, 23 February 2003 05:46:31 UTC