Consequences of Not Sharing URI Syntax

Jim Gettys (jg@pa.dec.com)
Mon, 9 Feb 1998 11:22:56 -0800


Date: Mon, 9 Feb 1998 11:22:56 -0800
From: jg@pa.dec.com (Jim Gettys)
Message-Id: <9802091922.AA09426@pachyderm.pa.dec.com>
To: uri@Bunyip.Com
Subject: Consequences of Not Sharing URI Syntax


I hesitate to tread into this quagmire, but as I much prefer referencing 
a URI document than having to duplicate material into the HTTP spec, which 
I will have to do if the URI document debate is not settled quickly, I 
now join the fray.  IETF rules require draft standards only make normative 
references to other draft standards, so I cannot use the proposed standard 
URL documents in the HTTP document, so unless this is settled quickly, 
I won't have any option....

I did not have a strong opinion in the past, but I've been forced to 
form an opinion, and somewhat to my surprise, I now have a strong opinion.

I hope the material below makes the case well; I am very pressed for time 
to complete the HTTP spec soon, and have a deadline (or maybe liveline 
is a better term) looming that will limit the debating time I have 
available.  I will try to get a detailed critique of what I believe should 
change in Roy's draft in the next day or two.

 			- Jim Gettys


Introduction
------------

The importance of sharing particular pieces of URI syntax has never been 
well understood or documented.  Most URI design has been based on existing 
practice, and usually shares most of the generic syntax; there have been 
exceptions.  Both intuition and reduction of code required to implement 
new URI schemes have encouraged significant uniformity of design, but 
recent understanding, and I hope this document, shows that it is vital 
to share as much syntax between URI schemes. 

Without understanding of the consequences of particular choices, however, 
it has been unclear to designers of new URI schemes if a particular piece 
of syntax is appropriate to their application, and the consequences of 
not sharing a particular piece of syntax has not been clear, so resulting 
URI syntax design has sometimes been poor. 

Both from a review of discussions on the URI mailing list as part of being 
asked about the proposed URI syntax and semantics specification, and 
as part of a meeting I recently attended, I've come to realize that there 
are two quite subtle consequences to (not) sharing components of URI 
syntax that have profound impact on the future evolutionary flexibility for 
the World Wide Web. 

     Note: NOTHING I am saying is different for any URx.

Documenting these issues to guide those involved URI scheme design 
has become vital to future World Wide Web evolution. 

These have generically to do with: 

     o Constraits imposed by Content on the Web
     o Constraints imposed by need for information hiding, to enable 
     software in the Web to remain modular and extensible.

There have been two views of URI syntax: 

   1) more or less ``anything goes'' after the colon
   2) (more) ``uniform'' sharing of URI syntax to the extent it may make 
     sense

The fundamental problem has been to distinguish the merits of each 
approach.  The strongest arguments for "anything goes" have been 

     o the constraints of the syntax can make things difficult for scheme 
     designers
     o existing syntax of identifiers can be adopted without further 
     thought, which are already in widespread use and familiar to those 
     who use them

The strongest arguments for the "uniform" I had previously seen, were: 

     o general simplicity
     o fewer parsers to build
     o and general design intution that uniformity is better than chaos

While I have generally preferred the "uniform" approach, I did not have 
a strong opinion.  If this document succeeds in its intent, however, you 
will decide that the uniform approach is not only desirable, but vital 
for long term Web architecture. 

So what are the consequences of following each path? 

View of URI syntax as a Class Hierarchy
---------------------------------------

One way of framing the discussion is to view URI syntax is as though it 
were an object hierarchy. Then there are a set of methods that can be 
applied to a URI string: 

     Scheme(URI);
     Fragment(URI),
     Relpath(URI), etc.

Note that not all methods might necessarily apply to a particular scheme  
(analogous to an unimplemented method), and some schemes might 
define additional methods (subclass) . 

In these terms, the debate can be framed as: 

     o whether different URx's inherit from Object (``anything goes''),
     o or if they inherit from a basic ``uniform'' URI syntax.

Class hierarchy design is known to be difficult! How do we evaluate the 
choice? 

Consequences of URI's Being Embedded in Content
-----------------------------------------------

The utility of embedding links into document is certainly now clear to 
the world.  But the fact that links are internally embedded into many 
data types (e.g. HTML, XML, Microsoft Word, Adobe PDF, etc.) have 
consequences.  Note that below I mean "naming authority" to be scheme 
specific delegatee of part of a name space; for example, the www.w3.org 
in the URL http://www.w3.org/foo/bar/baz.html. 

     o If fragment syntax (to the extent of understanding the URI is a 
     fragment), isn't shared between two schemes, (e.g. ``<a 
     href=``#foo''>''), you can't move individual completely self 
     referential documents between schemes without rewriting the 
     document.  In the Web, the fragment syntax is a property of the 
     media type, and evaluted by the client.
     o If fragment syntax is not shared between different media types of 
     the same capability (e.g. HTML, XML, Word, or image types 
     such as GIF, JPEG, PNG)  then you can't have a URI reference 
     that can evolve to superior media types as they become available, 
     or even likely work properly today with content negotiation.
     o If relative syntax (to the extent of understanding the URI is 
     relative, and what part of the URI string is relative) isn't shared 
     between two schemes,  (e.g. ``<a href=``foo''>''), you can't 
     move sets of documents that are internally self referential between 
     schemes without rewriting.
     o If ".." syntax as a path component in relative URI's isn't shared 
     between schemes, you can't easily have sets of document sets and 
     refer to them between schemes without rewriting.
     o If / syntax (to the extent of understanding that the URI refers to a 
     path relative to the current naming authority) isn't shared, you 
     can't have multiple sets of documents easily be moved up or 
     down in a relative heirarchy of names and share a common set of 
     documents between them, without rewriting the content, shared 
     either in that scheme or between schemes.  The best example is a 
     site that has a common set of GIF's, JPEG and PNG images, and 
     you want to reorganize the site changing the depth of a subtree 
     from one depth to another, or from one directory to another 
     where the depth isn't the same.
     o If naming authority syntax (e.g. what comes after "//" in most URL 
     schemes) and relative path syntax is shared, to the extent of 
     understanding that the URI has a naming authority, and what part 
     of the URI string is the naming authority vs. path), isn't shared 
     between two schemes, you can't share identical name spaces and 
     serve them up via different schemes.  (The naming authority 
     syntax is a property of the scheme). The fact that HTTP, and FTP 
     have the same syntax, for example, has often been exploited by 
     sites transitioning from ftp archive service to HTTP archive 
     service so that the URL's can be identical between schemes 
     except for the scheme; the same content can be served via two 
     schemes simultaneously.
     o If query syntax (to the extent of understanding the URI has a 
     query, and what part of the URI string is the query) isn't shared 
     between two schemes ( the syntax is a property of the server, 
     rather than the client).
     o There  are  a few other pieces of URI path syntax for which this 
     document does not explore the consequences, but I think you can 
     work it out for yourself, given these examples.

Digital Signatures
------------------

Digital signatures on content will increase even further the importance 
of maintaining bit-for bit integrity of content. Original signatures may 
require a private key only available at the time of signing, and may or 
may not be embedded into content in the same fashion as URI's.  Therefore 
as signature technology deploys, if syntax differs gratuitously between 
schemes, it will strongly discourage old content being available via new 
schemes that might be deployed. 

Impact on Opacity of Interfaces
-------------------------------

     o If fragment syntax is not solely media type dependent, (e.g. 
     depends on the scheme), then introducing a new scheme would 
     (potentially) require that each media viewer be updated for that 
     scheme.  This is likely to be a prohibative amount of work.
     o Similarly, to be able to introduce new schemes into the web, 
     without having to modify all URI access code in applications, the 
     URI parsing code in applications must be able to remove the 
     fragment from the base URI, or it will have to be updated for 
     each scheme.
     o Relative URI parsing and following of links cannot also be 
     independent of scheme unless relative URI syntax is shared, and 
     similarly, user agent and other programs that follow relative links 
     would have to be updated for a new scheme to be introduced.

These examples show that unless syntax is shared, new schemes will be 
very hard to introduce into the Web.

Conclusions
-----------

The sections above shows that the more sharing of basic URI syntax there 
is, the more likely (a set of) complex objects can be transported unmodified 
between different schemes (e.g. FTP to HTTP to HTTP/NG to URN, and to 
other schemes).  Similarly, content can evolve to more useful types without 
breaking URI references, fragment syntax is shared among related content 
types (e.g. named anchors in documents).  Digital signatures on content 
will further increase the importance of maintaining bit-for bit integrity 
of content. 

Some naming systems lack the semantic meanings covered by the commonly 
used URI syntax, and sometimes those naming systems provide additional 
semantic meaning for those systems.  For those naming systems in which 
parts of the URI syntax do not apply, it is clearly acceptable in my view 
to ignore that part of the syntax. I hope this document convinces you, 
however, that where the semantic meaning of name components are identical, 
that mapping them into the a common URI syntax in fact has major medium 
and long term benefits to the World Wide Web.  For those who are working 
on facilities which add new semantic meanings that might be shared between 
schemes, I hope this document convinces you it is worth working on defining 
what that common syntax should be. 

If the same content cannot be served up under alternate schemes, or moved 
to future schemes used in the Web, it will greatly inhibit introduction 
of new schemes into the Web.   If Web software cannot be written without 
intimate intertwining of knowledge between components, and therefore updating 
to introduce new schemes or content types, it will greatly inhibit 
introduction of new schemes and software into the Web. 

If URI syntax, therefore, is gratuitously different for the same semantic 
meaning, it will strongly discourage future innovation in the World Wide 
Web. The more random URI syntax is between schemes, the more Web evolution 
will inhibited, the more programmers and protocol designers we'll keep 
employed kludging around... (job security!).  But since there is enough 
work to go around in the Web, I believe it is clear that unity of URI 
syntax for semantically equivalent constructions is essential for the 
future health of the World Wide Web. 

A single URI specification that covers general URI syntax, along with 
guidance on how to design new URI schemes (and the consequences of different 
design decision), probably as a separate new document, is preferable to 
splitting the URI spec into several specifications (e.g. scheme, vs. 
independent URL and URN specs).  Each URI scheme should be able to reference 
this single syntax and semantics specification, and it should be able 
to do so and make clear which components of the generic URI syntax applies 
for that scheme (and which components do not!).  The November draft submitted 
by Fielding is closest to this model, but does need some further work; 
e.g. the host part of the document needs clear deliniation from the rest 
of the URI spec, so that it is clear that this is additional syntax which 
is common in a number of schemes, but not at all inherent in URI syntax. 


Jim Gettys 
Digital Equipment Corporation 
Visiting Scientist, World Wide Web Consortium