Re: "How to Compare URIs" update 3

Martin Duerst replied to Tim Bray:
>
> >  Personally, if I were rewriting 2396 I would simply decree that
all
> > %-escaping be done on the UTF-8 mapping and only the UTF-8
mapping,
>
> I would be the first to agree with that. Unfortunately, it's not
> as easy as that, because there is some legacy out there.
>
--------------------------

I've spent many days thinking through this problem.

My personal conclusion is that the only 'correct and complete' way out
of this hole will have to meet the following requirements:

1) Any 'new' URI-RFC  _has_ to mandate that:
"All  %-escaping be done on the UTF-8 mapping and only the UTF-8
mapping"

2) All 'New' schemes are distinguishable by some kind of syntactic
tagging,

3) In URIs not containing that tagging the old 2396 rules (or lack
thereof) continue to apply.


Now, to meet these requirements we need a method of indicating that an
instance of a URI conforms to the 'new' URI-RFC.

I can think of three possibilities;


Option 1 - URI tagging using a 2396-illegal character sequence
-----------------------------------------------------------------

In Option 1 URIs conforming to the 'new, common rules' are marked by
placing a character sequence which is illegal in 2396 (such as %II) in
a well-known place in the syntax.

URI-receivers which are aware of the new RFC will detect the marker
and interpret the URI accordingly.

Those which do not understand it may break or corrupt the data in some
way.
------------------------


Option 2 - A totally new URI syntax
-----------------------------------

In this we use some syntactic formulation , such as

http[URI/1.0]://host....

which cannot occur in 2396. The value in the brackets is the version
number of the URI encoding scheme (making it possible to make future
changes as well).

Note that the [URI/n.n] would be part of the URI syntax, not something
done on a scheme-by-scheme basis.


Option 3 - Scheme-by-scheme changes
-----------------------------------

Under this approach individual URI scheme users 'call up' the new
rules on a scheme-by-scheme basis, e.g. for HTTP we introduce two new
schemes:

httpi:
httpis:

which are 'Internationalized' protocols and which mandate the UTF-8
encoding (and other internationalization features?).

Other schemes ( ftpi: and so on ) would follow.

All the individual new schemes need to do is invoke as 'mandatory'
some rules defined in the new, common URI-RFC.

In HTTP-land I think they would have to issue a new HTTP-RFC, defining
a new version of HTTP (1.3?) which would accept all four scheme names

Discussion
-------------

The only advantage Option 1 has is that (in the RFC-world) it only
needs a single RFC change to implement.  Other than that I regard it
as too ugly for words (but as it meets the requirements I included
it).

Option 2 is marginally more elegant, and is better 'engineering' in
that we now have a protocol whose instances identify their versions.
But, as this protocol is used directly by humans, it might be thought
to be a little inconvenient (it's 9 additional, not very memorable
characters to be remembered from the side of a bus). It would also be
more of a 'shock' to legacy receivers which were not expecting it.



My own preference is for Option 3 - new schemes.

Rationale for chosing Option 3
-----------------------------

The '%hh encoding is unspecified' disaster is so pervasive, and there
are so many different, incompatible legacy adaptations and
work-arounds that I think its inconceivable that we can find a way out
without replacing in some way all the 'tributary' RFCs which call up
URIs.

My belief is that we have reached the limits of ingenious work-arounds
and syntactic contortions - and have to face this Internationalization
issue head-on across the whole of the IETF and W3C with a robust,
properly-engineered solution.

The 'communities' who make the different uses of URIs (browser
developers, web server developers, FTP developers etc. ) are going to
have to change their products to use the new URI, so having a new
scheme name and new RFCs as part of that community change does not
seem too onerous a task.

The HTTP scheme names  I have suggested (httpi and httpis ):
a) Are  reasonably human-friendly,
b) Indicate their purpose (i for International)
c) Maintain the visual significance of security (the terminating 's')

Obviously, it is for the HTTP community (and the like) to decide on
and register their new schemes; the URI-RFC community is above such
detail.  I only suggest names as part of this feasibility exploration.


Implementation and Migration
------------------------------

The most demanding test of the viability of this approach that I can
think of (in the HTTP world) is to consider what happens if a 'new'
browser sends a request to an 'old' web server.

If the URI has characters which need to be encoded (or makes use of
other features provided by the new URI-RFC) it first attempts a
connection thus:

    GET httpi://www.host.com/user/D%FCrst HTTP/1.3

Note that the HTTP spec is (the as-yet-fictitious) 1.3 - which, I
assume, declares knowledge of httpi: and hence the new URI encoding.

If the server understands HTTP/1.3, and hence httpi: and the new
encoding it can cope and respond normally.

If the server does not understand HTTP/1.3 it should reply with an
    505 HTTP version not supported
error.

The browser must now decide what to do. There can be no single,
correct behaviour if the requested URI contains characters which are
unrepresentable in 2396.

It might decide to try:
    GET http://www.host.com/user/D%FCrst HTTP/1.1

or it might tell the user it can't cope.


Sundry thoughts and issues
---------------------------

In HTTP practice , one could still use port 80 as the default for both
http: and httpi: and 443 for https: and httpis:

Subsidiary scheme names like 'mailto' evolving to 'mailtoi' or
'imailto', would be dependent on an RFC821-replacement, and so on.


Chris Haynes
Retired Chartered Engineer
Harvington, Evesham, UK

Received on Sunday, 23 February 2003 05:46:31 UTC