What I

Michael Mealling (Michael.Mealling@oit.gatech.edu)
Wed, 15 Mar 1995 12:40:52 -0500 (EST)

From: Michael.Mealling@oit.gatech.edu (Michael Mealling)
Message-Id: <199503151740.MAA12803@oit.gatech.edu>
Subject: What I
To: uri@bunyip.com
Date: Wed, 15 Mar 1995 12:40:52 -0500 (EST)

This is a draft of a whitepaper on what I plan on talking about in Danvers.
Its a proposal that I and a few others are implementing now. I don't know
if we will have anything demoable in Davers or not. This is finals week
and I'm basically going insane. ;-)


             Support for Uniform Resource Characteristics within a 
                      Simple Internet Directory Service 
                                Michael Mealling
The explosive growth of the Internet in recent years has prompted the
need for standard distributed database functions such as caching and
replication. In classical distributed databases information about a given
object was maintained along with that object. The Internet currently has
no method for tying what is termed metadata, or data about data, to the
object which it is describing. This need has prompted the IETF to begin
work on a standard form for structuring this metadata. This entity is
called a Uniform Resource Characteristic.
The Uniform Resource Characteristic, or URC, is of little use if there is no
distributed database into which it can inserted, updated and retrieved.
Therefore some system must be created that allows users to easily
publish information about their resources, update that information and
easily find the given resource again upon request. This would be fairly
easy if were not for three factors: 1) URCs can potentially contain nested
information that is bound to some subelement instead of the object itself;
2) nested substructures and particular instances of resources must be
maintained by extremely distributed authorities; and 3) the number of
resources on the network is so large that any system that implemented full
searching would have severe scaling problems.
The first issue arises very often when even simple information is included
in the URC. For example, by including an identifier for the publisher within
the URC we know have a case where the URC contains information that is
not bound to the object described. There is an extraneous relationship but
it tends to pollute the database to the degree that complex query
languages must be used to separate nesting levels( i.e. DSSSL, SQL).
One very complex and common example is the case of digital signatures over
subsets of the URC. If a document is archived at two remote locations each
location may want to sign this fact with their key. This causes very
extraneous information to make its way into a system that already has scaling
The second issue of nested and distributed authority causes a very
disturbing fact that it is impossible to ever have the 'complete' URC
for a given resource. Therefore a system must be able to deal with
URC information being inherently distributed. Building a distributed
database where each object itself can be distributed on the scales existing
on the Internet is inherently problematic.
Scalability of such distributed database systems requires some query
routing system that causes searches to be done on information sets that
are worth the overhead. While the area of query routing has had some
research the issue of nested structures from above suggests that a some
operations need to be done on the metadata to 'filter' out the extraneous
information. This filtered information makes a good candidate for
indexed forward knowledge which allows global searches to be pruned based
on whether or not the forward knowledge indicates a search that is useful.
A fourth anecdotal issue is the problem of software use on the Internet.
Any system that is fundamentally too complex is often shunned in favor of
systems that are simple to install and administer. A good example is the
almost complete lack of use of X.500 based systems by organizations in
the Internet community. One of the reasons that Usenet News and DNS
are so prevalent is that they are easily installed and maintained.

The proposed solution being developed by several different parties around
the Internet is to utilize a system being developed within the IETF for
a simple Internet directory service. This service is primarily based on
template oriented information systems such as SOLO and whois++. These
systems create what is called a centroid or unique word list of all words
appearing in an attribute. This centroid is then passed up to more generalized
index servers. These index servers then serve as pruning points for decisions
about whether or not a search is useful.
The one commonality of these systems is that they are template
(i.e. attribute/value) based. This on the surface would appear limiting
but in the case of URCs provides a scoping function that makes the problem
much more manageable. The proposed solution is to filter a given URC so that
it can be fit unambiguously into a template ready for indexing via whois++.
This provides the filtering factor and punts the issues of management and
distributed authority out of the system used for search and retrieval.
The problem this causes is that for the user the important information
is still stored in the URC and not the resulting template. This means that
the final result of the search should be a fully formatted and syntactically
rich URC. Therefore there needs to be some method by which the
whois++ query can finally resolve into something that isn't a template.
The final solution appears to be the addition of a MODE command to whois++
(and possibly SOLO) that allows the client doing the query to request
that the server switch protocols (and thus query language and data format)
entirely. This allows the same connection to issue the query in whois++,
switch MODES to something much more capable, and then reissue a richer
query that returns a much richer data element: the final URC.
This creates a system that can easily map between different semantic spaces
such as MARC, TEI and whois++ all within one server. This solves several
problems of management, legacy systems and data format incompatibility.

<HR><A HREF="http://www.gatech.edu/michael.html">
<ADDRESS>Michael Mealling</ADDRESS>