Re: URC proposal for Davenport Group

Daniel W. Connolly (connolly@hal.com)
Mon, 23 Jan 1995 11:54:55 -0600


Message-Id: <9501231754.AA20926@ulua.hal.com>
To: Terry Allen <terry@ora.com>
Cc: davenport@ora.com, uri@bunyip.com, hackers@ora.com
Subject: Re: URC proposal for Davenport Group 
In-Reply-To: Your message of "Sun, 22 Jan 1995 17:09:16 PST."
             <199501230109.RAA07091@rock> 
Date: Mon, 23 Jan 1995 11:54:55 -0600
From: "Daniel W. Connolly" <connolly@hal.com>

In message <199501230109.RAA07091@rock>, Terry Allen writes:
>
>A feature I need not presently available in applications 
>is the ability to resolve abstract names to addresses automatically
>by means of very constrained queries against a bibliographic database
>*and* inititate appropriate browser activity without human intervention.
>Where is that deployed now?

Whoa... too much abstraction in there for me; too many undefined
terms. What's the (testable) distinction between an "abstract name"
and an "address"? These sound like implementation details. I'm asking
about user-visible (and/or author-visible) features of this
hypothetical system.

I like the concrete example...

>| Just write:
>| 
>| 	<ulink uri="http://www.microsoft.com/windows3.1/userguide.html">
>| 	Windows 3.1 User's Guide</ulink>
>| and run a chaching HTTP server. The client will consult the cache, and
>
>Then caching has to be reinterpreted as "knowing about all the SGML
>entities for which files are installed on the local system," not just 
>"knowing about recently fetched and still valid entities."

OK. Sounds good to me. To accomplish this, you just initialize the
cache to think that all the "SGML entities installed on the local
system" are actually "recently fetched and still valid entities."  You
might be able to do this with a clever combination of CGI scripts.  Or
you might have to hack the server code.

Another option would be to take advantage of the fact that many
browsers support per-uri-scheme proxies, and create a new URI scheme,
say sgml: or davenport:. The link would be:

	<ulink uri="davenport://www.microsoft.com/win31/userguide">...

and the bowser would be configured to use an HTTP proxy for davenport:
URIs. This leaves you the option to re-define the behaviour of these
links in the future, without going back and revising the documents.
(This is perhaps one way to distinguish "abstract names" from "locations."
The question is that of who owns the semantics of the names.)

For example, the proxy server could implement replication (see below...)

>| and links to the document in, say, postscript, PDF, and
>| RTF format, along with ordering information for hardcopy (or alink to
>| such ordering info). The user chooses which URL to follow manually --
>| I haven't seen any successful techniques for automating this step.
>
>Automation is exactly what I want to achieve.  I think the required
>technique is to query an appropriately constructed bibliographic
>database.  For the purpose at hand, I assume all the docs are in 
>SGML, not some choice of formats.

I think I'm missing something. You want it so:

	* the user to selects a ULINK	
	* the browser uses info from the ULINK to find another document
	* that document is displayed

Where does the URC and the bibliographic database come in? As far as I
can tell, the browser just takes a URL from the ULINK, decodes it into
a server and path as usual, and sends the path to the server. The
server looks up the path (ok... this could be some sort of database
lookup: is this the piece you want to explore?) and returns an SGML
entity.

>[...] And I haven't 
>yet heard of a browser that will make a choice among a list of URLs 
>returned from a query.

Where does this "list of URLs" come in? It it a list of various formats?
If so, what rules would you use to automatcially choose from them?

Or is it a list of replicas? Replication would be a handy feature. If
that's what you're after, this could be an interesting project. How
would replicas be created/maintained/administered?


>I agree that the technology exists to achieve what I want; this 
>project is meant to explore what technology would work well and
>what behavior is desired in the system overall.  It may be that the 
>only piece that needs specifying except in prose is the URC format.  

I fail to see why a URC format is needed at all here. All I see is a
need for servers to do identifier->entity mappings, for which URLs
provide sufficient information.


I don't think I understand your notion of "constrained queries against
a bibliographic database." The database I see has two fields:
identifier, and SGML entity. The query is "select entity where
identifier='...'".  The '...' can be the path from the URL; and for a
quick-and-dirty implementation, the unix filesystem can represent this
database.


Another interesting application would be to put the URC in the
_source_ document (kinda like a HyTime bibloc), and use that as the
key for the databse lookup. You'd trade precise-but-fragile URL paths
for more traditional fuzzy-but-robust citations. For example:

 	<ulink linkend="MS94">Windows 3.1 User's Guide</ulink>
	...

	<!-- BibTex/SGML adapatation used here. MARC/TEI/etc. influences
			should also be considered -->
	<book id=MS94>
		<editor>Fred Jones</>
		<title>Microsoft Windows 3.1 User's Guide</>
		<publisher">Microsoft Press
		<year>1993
		<edition>2nd
	</book>

So the client uses the whole <book> element as the query key. The
first question is: what network entity does it consult to resolve
the query? The "local" SGML cache server? Some "well known"
server like http://www.davenport.org/ (a configuration option
on the browser)? The publisher's server? How do you find the
publisher's server? Perhaps the <publisher> element would be
expanded to specify:

	<publisher server="http://sgml.microsoft.com">Microsoft Press

or, in the case where a publisher leases space through an access
provider:

	<publisher server="http://www.dec.com">O'Reilly &amp; Associates</>


Ah! Bright idea: the ideal way to implement this would be a big
distributed citation index, right? Here are some implementation
possibilities:


I. Quick and dirty hack:

	1. URCs are encoded as great big URLs, using cgi-bin/forms
	style syntax:

	<ulink uri="davenport:/citation-index?editor=Fred Jones;title=...">...

	2. existing browsers are used. They are locally configured,
	e.g. setenv davenport_proxy="http://localhost:8010/"

	3. a proxy runs on the local (or nearby) host. It maps /citation-index
	to a cgi-bin program which does a WAIS query against a list
	of sources. The list of sources is a local configuration issue;
	ther might be some "well known" or "bootstrap" WAIS sources,
	say wais.davenport.org or wais.oclc.org or whatever; as time
	goes by, a few dominant servers emerge. Each site could suppliment
	the datbase with a locally maintained WAIS database.

	4. Each of these WAIS servers contains a large collection
	of URCs. The WAIS servers are configured to know, for
	example, that exact matches on titles are a lot more important
	than exact matches on years. A site like OCLC might even license
	the commercial WAIS implementation.

	5. The WAIS query comes back with the WAIS doc-ids of the best
	items from all the servers, including scores. The proxy selects
	the "best" item (highest scoring? closest? some combination of
	them?) and fetches the source of that item, and passes it
	back to the browser.

II. Harvest-based version
	
	1. user selects link; link points via IDREF (or some other
	hytime locladder) to a URC (e.g. the <book> element above)

	2. Browser translates URC to Harvest SOIF format, and conducts
	a Harvest broker query against a list of brokers. (This task
	could be conducted by a proxy as above).

	3. Each publisher maintains a harvest gatherer for the information
	it publishes. Large sites like OCLC run caching brokers that
	do periodic bulk-updates from the publishers, and allow high-speed
	searching of the aggregate database. Client sites will typically
	search the OCLC broker, plus a handful of individual publishers
	that OCLC doesn't carry, plus their own local broker.

	4. Client (browser/proxy) retrieves "best" item from query
	results and displays it.


The choice of query protocols (WAIS, Harvest, or an HTTP/cgi/forms
application like lycos/archiplex/aliweb) isn't critical: the critical
thing is the ability to do distributed queries; i.e. the client (the
browser/proxy) consults _multiple_ servers, which share an
understanding of the precision/recall issues for the overall database;
i.e. they share a scoring system. At a minimum, this implies that
they share an abstract understanding of the schema of the distributed
database; in practice, it will be convenient if this schema has
a concrete represetnation, i.e. a URC syntax.

I still have serious doubts about the ability to automate choosing the
"best" item from the query results. Perhaps these databases will be
very carefully maintiained and controlled, and precision of the
queries will be sufficiently high that the client can just pick the
top scoring item.

Dan