URC Davnpt; re Dan's 27jan proposal from Terry Allen on 1995-01-29 (uri@w3.org from January 1995)

From: Terry Allen <terry@ora.com>
Date: Sun, 29 Jan 1995 11:37:45 PST
To: uri@bunyip.com, davenport@ora.com
Message-Id: <199501291937.LAA16155@rock>
In this post I consider aspects of Dan's proposed solution
(which is pretty good).`.

| Date: Fri, 27 Jan 95 19:04:04 EST
| Message-Id: <9501272343.AA22920@ulua.hal.com>
| From: "Daniel W. Connolly" <connolly@hal.com>
| To: Multiple recipients of list <html-wg@oclc.org>
| Subject: Redundancy in links, Davenport Prososal [long]

...

| Given a system like harvest[2], it makes sense to handle queries like
| "find me the document who's publisher is O'Reilly and Associates,
| published in 1994 under the title 'DNS and Bind'." Their model for
| distributed indexing, brokers, replication, and caching (with
| taxonomies and query routing in the works) has me convinced that it's
| the right way to go.
 
Harvest, as described (I haven't actually used it), looks like 
a good solution for gathering and collating metainformation.
(That's a piece I hadn't wanted to worry about for an initial
trial, but it's a necessary piece in a full working system.)

...

| That brings me to another point: The sharing of information can only
| be automated to the point that it can be formalized. I've been trying
| to find some formalism for the way the web works. I've decided that
| this is a useful excercise for areas like security, where you have to
| be 100% sure of your conclusions relative to your premises.
| 
| But for the web in general, 100% accuracy and authenticity is not
| necessary. The web is a model for human knowledge, and human knowledge
| is generally not clean and pricise -- it's not even 100% consistent.
| So I think that in stead of modelling the web with formal systems like
| Larch[3], a more "fuzzy" AI knowledge-representation sort of approach
| like Algernon[4] is the way to go. Traditional formal systems like
| Larch rely on consistency, which is not a good model for the knowledge
| base deployed on the web.
 
The fuzzy system I have in mind for the long run is what I do when
I go to the library catalogue with a citation along the lines of
"Title is `DNS and the Blind,' by some guy named Crikey, and
it was published by, I think, Addison-Wesley."  Every piece of
information there is wrong, but I can still find the book
(DNS and BIND, by Paul Albitz & Cricket Liu, published by ORA).

Others want their writers to be able to walk over to a bookshelf,
pull a book off the shelf, open it, and construct a link
to "the illustration of network topology in Flatland, in
the book `Moebius Network'" and have it resolve to an appropriate
Net object, a list of possible appropriate Net objects, or
a message stating that the book exists only in print, etc.

| The URN model of publisher ID/local-identifier may be sufficient for
| the applications of moving the traditional publishing model onto the
| web. But that is only one application of the technology that it takes
| to achieve high quality links. Another application may have some other
| idea of what the "critical meta-information" is. For example, for bulk
| file distribution (ala archie/ftp), the MD5 is critical.

Thinking over my first proposal, in the early stages of football
deprivation, I see that URNs can be DNSish up to a point, and a
DNSish approach to resolving them will probably work.  Whether
we want to set up all the pieces yet I don't know, but

	URN:ISBN:56592:something

could point to URN servers that know about ISBN servers that
can locate servers with records for publisher 56592 (ORA),
and it's up to those last servers to resolve [something].

A complexity I didn't address in my first proposal is how to
specify what I want returned from a URN query (what I want
is a list of URLs, but someone else might want the whole
URC set, the URC set of the parent URN, if any, or an 
alternate URN, etc., etc.).

| OK... so... now that I've a brian dump, how about a specific answer
| to the "Davenport proposal":

| Problem Statement
| =================
| 
| The Davenport Group is a group of experts in technical documentation,
| mostly representing Unix system vendors.  They have developed DocBook,
| a shared SGML-based representation for technical documentation. They
| will probably be using a combination of CD-ROM distribution and the
| Internet to deliver their techincal documention.

Yes.

| They are developing hypertext documentation; they each have solutions
| for CD-ROM distribution, but while the World-Wide Web is the most
| widely-deployed technology for internet distribution, it does not meet
| their needs for response time nor reliability of links over time. As

nor complexity of link resolution services.

| publishers, they are willing to invest resources to increase the
| quality of service for the information they provide over the web.

Yes.  

| Moreover, the solution for increased reliability must be shared among
| the vendors and publishers, as the links will cross company
| boundaries. Ideally, the solution will be part of an Internet-wide
| strategy to increase the quality of service in information retrieval.

Yes, although smaller successes will be instructive, too.

| Theory of Operation
| ===================
| 
| The body of information offered by these vendors can be regarded as a
| sort of distributed relational database, the rows being individual
| documents (retrievable entities, to be precise), and the columns being
| attributes of those documents, such as content, publisher, author,
| title, date of publication, etc.

Yes, probably that model is sufficient for our docs, but I am well
aware that it isn't sufficient for publishing in general.  The
retrievable entities have interrelations that may need to be
modelled, and which can be both complex and nonhierarchical.  One
can quickly develop situations in which the info can't be 
presented in a table (at least so simple a table), nor in a tree.
(I'd be happy to dig out my example about Sir John Chardin's 
Journey to Persia,  in which the branches of the tree grow
together.)  Considering that publishing happens in time,
this may only be a matter of multiple inheritance (and I
don't know how that is best to be represented in an URC set
or even in a TEI header).

My working visualization is of a multidimensional infospace in 
which many values (author, title, etc) intersect in some published 
works, sometimes in quirky ways.  

| The pattern of access on this database is much like many databases:
| some columns are searched, and then the relavent row is selected. This
| motivates keeping a certain portion of this data, sometimes referred
| to as "meta-data," or indexing information, highly available.

I would call all that data metadata.  The search may be more
complex.

| The harvest system is a natural match. Each vendor or publisher would
| operate a gatherer, which culls the indexing information from the rows
| of the database that it maintains. A harvest broker would collect the
| indexing information into an aggregate index. This gatherer/broker
| collection interaction is very efficient, and the load on a
| publisher's server would be minimal. The broker can be replicated to
| provide sufficiently high availability.
| Typically, a harvest broker exports a forms-based HTTP searching
| interface. But locating documents in the davenport database is a
| non-interactive process in this system. Ultimately, smart browsers
| can be deployed to conduct the search of the nearest broker and
| select the appropriate document automatically. But the system should
| interoperate with existing web clients.
| Hence the typical HTTP/harvest proxy will have to be modified to not
| only search the index, but also select the appropriate document and
| retrieve it. To decrease latency, a harvest cache should be collocated
| with each such proxy.

Would it not suffice for the proxy (the URC resolver) to return
not the document but a pointer to it (URN or set of URLs)?  In
earlier correspondence you seemed to be thinking of the metadata
being at the same site as the document, but that need not be
the case.

| Ideally, links would be represented in the harvest query syntax, or a
| simple s-expression syntax. (Wow! In surfing around for references, I
| just found an example of how these links could be implemented. See the
| PRDM project[2].) But since the only information passed from

(parenthetically, PRDM is a reasonable representation of the info
I'd want to input and validate as a teiheader)

| contemporary browsers to proxy servers is a URL, the query syntax will
| have to be embedded in the URL syntax.
| I'll leave the details aside for now, but for example, the query:
| 
| 	(Publisher-ISBN: 1232) AND (Title: "Mircosoft Windows User Guide")
| 		AND (Edition: Second)
| might be encoded as:
| 	harvest:/davenport?publisher-  
   isbn=1232;title=Microsoft%20Windows%20Users%20Guide;edition=Second
| 
| Each client browser is configured with the host and port of the
| nearest davenport broker/HTTP proxy. The reason for the "//davenport"
| in the above URL is that such a proxy could serve other application
| indices as well. Ultimately, browsers might implement the harvest:
| semantics natively, and the browser could use the Harvest Server
| Registry to resolve the "davenport" keyword to the address of a
| suitable broker.
| To resolve the above link, the browser client contacts the proxy and
| sends the full URL. The proxy contacts a nearby davenport broker,
| which processes the query and returns results. The broker then selects
| any match from those results.

I kinda want to separate the matter of the query syntax from 
the means of transport of the query.  Leaving aside the issue
of how the author knows what syntax to use for a given target 
(you are assuming Harvest, I'm punting), I have thought of
another URLchitecture.  Tell me what you think of this
(please excuse errors in syntax):

	http://hostname/urc2url/urcteidavenport?querygoeshere

Somehow I or my system has to determine an appropriate hostname,
perhaps by pointing to a host that "ought to know":

	http://semi.omniscient.com/urc2url/...

and semi.omniscient.com hands off the rest of the URL more or 
less in the manner you propose, but needs to know only where
it can find a server that knows about urc2url services,
which in turn need to locate databases for the urcteidavenport
information.

| Through careful administration of the links and the index, all the
| matches should identify replicas of the same entity, possibly on
| different ftp/http/gopher servers. An alternative to manually
| replicating the data on these various servers would be to let the
| harvest cache collocated with the broker provide high availability of
| the document content.

| Security Considerations
| =======================

[sensible remarks deleted]
 
| 
| Conclusions
| ===========
| 
| I believe this solution scales well in many ways. It allows the
| publishers to be responsible for the quality of the index and the
| links, while delegating the responsibility of high-availability to
| broker and cache/proxy servers. The publishers could reach agreements
| with network providers to distribute those brokers among the client
| population (much like the GNN is available through various sites.)

I am not sure that, for Net publishing in general, the distributed
model of DNS will work for the bibliographic databases (as opposed
to URN resolution).  One could imagine distributing the URC
resolution load DNSishly by establishing URC resolvers that are
specialized along the vectors of the metadata:  the Shakespeare
URC resolution service, or the Ancient History service.  (Becoming
more highly specialized, we would have the Bronte service, which
would eventually split into the Charlotte Bronte service, the
Emily Bronte service, and the Ann Bronte service ...  nahh,
let's not.)

But otherwise there's going to some efficiency of scale to
be realized by collocating great gobs of metadata in relatively
few places.  When I want to look up information about a print
book on the Net, I go to one of the big catalogues (I use
Melvyl a lot, if I don't find what I want I try Olorin, etc.)
rather than hope the item is in the Sonoma County Public Library
catalogue.  If I had to check hundreds of catalogues to resolve
a URC query it would use significant resources.

I think the librarians need to weigh in on this issue, as I'm
sure they've thought about it more deeply.

| It allows those cache/proxy servers to provide high-availability to
| other applications as well as the davenport community. (The Linux
| community and the Computer Science Technical reports community already
| operate harvest brokers.)

Can you supply pointers? sample URLs?  do they use the harvest:// method?

| The impact on clients is minimal -- a one-time configuration of the
| address of the nearest proxy. I believe that the benefits to the
| respective parties outweigh the cost of deployment, and that this
| solution is very feasible.

Yep, for Davenport it would probably work just fine, at least
until we got into multiple editions in various languages that
have complex interrelations.  So the two issues I remain unresolved
about are:  will this approach really scale for complex 
bibliographic queries (do we need some equivalents to the big 
library catalogues), and how complex/simple should/can the
bibliographic model (URC set, your table example) be, both
for Davenport purposes and for Net publishing in general?
 

-- 
Terry Allen  (terry@ora.com)   O'Reilly & Associates, Inc.
Editor, Digital Media Group    101 Morris St.
			       Sebastopol, Calif., 95472
monthly column at:  http://www.ora.com/gnn/meta/imedia/webworks/allen/

A Davenport Group sponsor.  For information on the Davenport 
  Group see ftp://ftp.ora.com/pub/davenport/README.html
	or  http://www.ora.com/davenport/README.html
Received on Sunday, 29 January 1995 14:37:59 UTC