- From: Michael Mealling <Michael.Mealling@oit.gatech.edu>
- Date: Wed, 15 Mar 1995 12:40:52 -0500 (EST)
- To: uri@bunyip.com
This is a draft of a whitepaper on what I plan on talking about in Danvers. Its a proposal that I and a few others are implementing now. I don't know if we will have anything demoable in Davers or not. This is finals week and I'm basically going insane. ;-) -MM Support for Uniform Resource Characteristics within a Simple Internet Directory Service Michael Mealling michael@gatech.edu The explosive growth of the Internet in recent years has prompted the need for standard distributed database functions such as caching and replication. In classical distributed databases information about a given object was maintained along with that object. The Internet currently has no method for tying what is termed metadata, or data about data, to the object which it is describing. This need has prompted the IETF to begin work on a standard form for structuring this metadata. This entity is called a Uniform Resource Characteristic. The Uniform Resource Characteristic, or URC, is of little use if there is no distributed database into which it can inserted, updated and retrieved. Therefore some system must be created that allows users to easily publish information about their resources, update that information and easily find the given resource again upon request. This would be fairly easy if were not for three factors: 1) URCs can potentially contain nested information that is bound to some subelement instead of the object itself; 2) nested substructures and particular instances of resources must be maintained by extremely distributed authorities; and 3) the number of resources on the network is so large that any system that implemented full searching would have severe scaling problems. The first issue arises very often when even simple information is included in the URC. For example, by including an identifier for the publisher within the URC we know have a case where the URC contains information that is not bound to the object described. There is an extraneous relationship but it tends to pollute the database to the degree that complex query languages must be used to separate nesting levels( i.e. DSSSL, SQL). One very complex and common example is the case of digital signatures over subsets of the URC. If a document is archived at two remote locations each location may want to sign this fact with their key. This causes very extraneous information to make its way into a system that already has scaling problem. The second issue of nested and distributed authority causes a very disturbing fact that it is impossible to ever have the 'complete' URC for a given resource. Therefore a system must be able to deal with URC information being inherently distributed. Building a distributed database where each object itself can be distributed on the scales existing on the Internet is inherently problematic. Scalability of such distributed database systems requires some query routing system that causes searches to be done on information sets that are worth the overhead. While the area of query routing has had some research the issue of nested structures from above suggests that a some operations need to be done on the metadata to 'filter' out the extraneous information. This filtered information makes a good candidate for indexed forward knowledge which allows global searches to be pruned based on whether or not the forward knowledge indicates a search that is useful. A fourth anecdotal issue is the problem of software use on the Internet. Any system that is fundamentally too complex is often shunned in favor of systems that are simple to install and administer. A good example is the almost complete lack of use of X.500 based systems by organizations in the Internet community. One of the reasons that Usenet News and DNS are so prevalent is that they are easily installed and maintained. The proposed solution being developed by several different parties around the Internet is to utilize a system being developed within the IETF for a simple Internet directory service. This service is primarily based on template oriented information systems such as SOLO and whois++. These systems create what is called a centroid or unique word list of all words appearing in an attribute. This centroid is then passed up to more generalized index servers. These index servers then serve as pruning points for decisions about whether or not a search is useful. The one commonality of these systems is that they are template (i.e. attribute/value) based. This on the surface would appear limiting but in the case of URCs provides a scoping function that makes the problem much more manageable. The proposed solution is to filter a given URC so that it can be fit unambiguously into a template ready for indexing via whois++. This provides the filtering factor and punts the issues of management and distributed authority out of the system used for search and retrieval. The problem this causes is that for the user the important information is still stored in the URC and not the resulting template. This means that the final result of the search should be a fully formatted and syntactically rich URC. Therefore there needs to be some method by which the whois++ query can finally resolve into something that isn't a template. The final solution appears to be the addition of a MODE command to whois++ (and possibly SOLO) that allows the client doing the query to request that the server switch protocols (and thus query language and data format) entirely. This allows the same connection to issue the query in whois++, switch MODES to something much more capable, and then reissue a richer query that returns a much richer data element: the final URC. This creates a system that can easily map between different semantic spaces such as MARC, TEI and whois++ all within one server. This solves several problems of management, legacy systems and data format incompatibility. -- ------------------------------------------------------------------------------ <HR><A HREF="http://www.gatech.edu/michael.html"> <ADDRESS>Michael Mealling</ADDRESS> <ADDRESS>michael.mealling@oit.gatech.edu</ADDRESS></A>
Received on Wednesday, 15 March 1995 12:41:01 UTC