URC prop/Davenport cont'd

Terry Allen (terry@ora.com)
Tue, 24 Jan 1995 08:28:46 PST


Message-Id: <199501241628.IAA23782@rock>
From: Terry Allen <terry@ora.com>
Date: Tue, 24 Jan 1995 08:28:46 PST
To: davenport@ora.com, uri@bunyip.com, hackers@ora.com
Subject: URC prop/Davenport cont'd

Thanks for all the response.  Some clarification is evidently 
in order.


[Dan replying to me:]
| >A feature I need not presently available in applications 
| >is the ability to resolve abstract names to addresses automatically
| >by means of very constrained queries against a bibliographic database
| >*and* inititate appropriate browser activity without human intervention.
| >Where is that deployed now?
| 
| Whoa... too much abstraction in there for me; too many undefined
| terms. What's the (testable) distinction between an "abstract name"
| and an "address"? These sound like implementation details. I'm asking
| about user-visible (and/or author-visible) features of this
| hypothetical system.

The abstract name persistently refers to an object that may exist
at more than one address, and at changing addresses.  

I didn't provide a sample URN, thinking to avoid drawing attention
to that detail, but here's how Mitra put it at
        http://www.path.net/mitra/urn.html

"ISBN's consist of a string of digits which look opaque to the user,
but actally contain two distinct parts, the first is the publisher.
So assuming an ISBN of 1234567890 where 12345 is the naming
authority id, the URN would be <urn:isbn:12345:1234567890>."

That's one way of doing it; the main point is to use the publisher
ID you already have, which marks off your name space.
For O'Reilly that could be, stripped of the outer wrapper,
        ISBN:[some actual ISBN]
or
        ISBN:56592:[some string]

That's an abstract name, neither the title of the work nor any
address at which it is stored.

[Dan again:]

| I like the concrete example...

[which Dirk mistakenly attributed to me, Terry]

| >| Just write:
| >| 
| >| <ulink uri="http://www.microsoft.com/windows3.1/userguide.html">
| >| 	Windows 3.1 User's Guide</ulink>
| >| and run a chaching HTTP server. The client will consult the cache, 
| >
| >Then caching has to be reinterpreted as "knowing about all the SGML
| >entities for which files are installed on the local system," not just 
| >"knowing about recently fetched and still valid entities."
| 
| OK. Sounds good to me. To accomplish this, you just initialize the
| cache to think that all the "SGML entities installed on the local
| system" are actually "recently fetched and still valid entities."  You
| might be able to do this with a clever combination of CGI scripts.  Or
| you might have to hack the server code.

Or should this function be broken out of the server, which might
make it more available to other SGML apps?

| Another option would be to take advantage of the fact that many
| browsers support per-uri-scheme proxies, and create a new URI scheme,
| say sgml: or davenport:. The link would be:
| 
| 	<ulink uri="davenport://www.microsoft.com/win31/userguide">...

But I want to be able to refer to the item without knowing where
it is when I write the link, or where the resolution service
is.  I can't give a pathname to anything specific.  So instead
I might have:

	<ulink urn="ISBN:56592:1-56592-002-3"> ...

or
	<ulink uri="URN:ISBN:56592:1-56592-002-3"> ...

or, it occurs to me, if I want to invent a scheme, make it URN
rather than the industry-specific "davenport":

	<ulink uri="URN://ISBN:56592:1-56592-002-3"> ...

which might do for RFCs, too:

	<ulink uri="IETF://RFC:RFC1737"> or something like that

That's a problem I remember you wanting to solve, Dan.  

| and the bowser would be configured to use an HTTP proxy for davenport:
| URIs. This leaves you the option to re-define the behaviour of these
| links in the future, without going back and revising the documents.
| (This is perhaps one way to distinguish "abstract names" from 
| "locations."
| The question is that of who owns the semantics of the names.)

O'Reilly owns the semantics of the ISBN namespace 56592 (or is it
1-56592? anyway... ), according to the work I've seen in the UR*
discussion.  

| >| and links to the document in, say, postscript, PDF, and
| >| RTF format, along with ordering information for hardcopy (or alink to
| >| such ordering info). The user chooses which URL to follow manually --
| >| I haven't seen any successful techniques for automating this step.
| >Automation is exactly what I want to achieve.  I think the required
| >technique is to query an appropriately constructed bibliographic
| >database.  For the purpose at hand, I assume all the docs are in 
| >SGML, not some choice of formats.
| 
| I think I'm missing something. You want it so:
| 	* the user to selects a ULINK	
| 	* the browser uses info from the ULINK to find another document
| 	* that document is displayed

Yep.

| Where does the URC and the bibliographic database come in? As far as I
| can tell, the browser just takes a URL from the ULINK, decodes it into
| a server and path as usual, and sends the path to the server. The
| server looks up the path (ok... this could be some sort of database
| lookup: is this the piece you want to explore?) and returns an SGML
| entity.

That's certainly part of what I want to explore.  The mapping of the
URN to a resolution service address should definitely be done outside
the document.  Over at that service, the server would have to do
some kind of database lookup in the URC set to return the URLs at
which the SGML entity may be addressed.  I would expect it to return
URLs instead of the entity itself (which might be located closer
to me on the Net, or doesn't that matter anymore?).  

In the long run I'd want to be able to do more complex bibliographic
queries (TITLE:"X User's Guide"+AUTHOR:"Quercia"), but there are
so many complications there I want to put off that part while looking
at the other pieces.  That's why I think an URC database will be needed.

| >[...] And I haven't 
| >yet heard of a browser that will make a choice among a list of URLs 
| >returned from a query.
| 
| Where does this "list of URLs" come in? It it a list of various formats?
| If so, what rules would you use to automatcially choose from them?
| 
| Or is it a list of replicas? Replication would be a handy feature. If
| that's what you're after, this could be an interesting project. How
| would replicas be created/maintained/administered?

I had in mind that the URN would be that of the format appropriate
to the context in which the link was written.  If I have X User's
Guide in formats foo, bar, and SGML, I'd create URNs for each of
those versions (and how to nest those is another hairy issue I was
trying to avoid), so that I can point at the URN (abstract name) 
for the SGML version.  So in fact this would be a list of URLs for
identical (or functionally identical) objects.  As for how these
are created/maintained/administered, I don't have any direct
experience; I'm concerned with what happens once they're out there.

| >I agree that the technology exists to achieve what I want; this 
| >project is meant to explore what technology would work well and
| >what behavior is desired in the system overall.  It may be that the 
| >only piece that needs specifying except in prose is the URC format.  
| I fail to see why a URC format is needed at all here. All I see is a
| need for servers to do identifier->entity mappings, for which URLs
| provide sufficient information.

answered above?

| I don't think I understand your notion of "constrained queries against
| a bibliographic database." The database I see has two fields:
| identifier, and SGML entity. The query is "select entity where
| identifier='...'".  The '...' can be the path from the URL; and for a
| quick-and-dirty implementation, the unix filesystem can represent this
| database.

more indirection between identifier and entity.

| Another interesting application would be to put the URC in the
| _source_ document (kinda like a HyTime bibloc), and use that as the
| key for the databse lookup. You'd trade precise-but-fragile URL paths
| for more traditional fuzzy-but-robust citations. For example:
| 
|  	<ulink linkend="MS94">Windows 3.1 User's Guide</ulink>
| 	...
| 
| 	<!-- BibTex/SGML adapatation used here. MARC/TEI/etc. influences
| 			should also be considered -->
| 	<book id=MS94>
| 		<editor>Fred Jones</>
| 		<title>Microsoft Windows 3.1 User's Guide</>
| 		<publisher">Microsoft Press
| 		<year>1993
| 		<edition>2nd
| 	</book>

But when I write the ulink I *don't know* where the target
is going to be when the user clicks on the link.  You're
skipping some necessary steps.
 
| So the client uses the whole <book> element as the query key. The
| first question is: what network entity does it consult to resolve
| the query? The "local" SGML cache server? Some "well known"
| server like http://www.davenport.org/ (a configuration option
| on the browser)? The publisher's server? How do you find the
| publisher's server? Perhaps the <publisher> element would be
| expanded to specify:

For the ten or so Davenporters who have or might have doc
on the Internet, the Davenport Group might maintain a public
list of ISBN publisher IDs and the corresponding resolution
servers.  (Might help avoid spoofing.)  For this trial setup,
the user could copy over that file to the local system, or
we could get more Rococo and link initially to the (fictitious)
www.davenport.org server.  I don't care too much about this
part because I think that in the long run a stable, well known
setup will develop.

[ ... ]
 
| Ah! Bright idea: the ideal way to implement this would be a big
| distributed citation index, right? Here are some implementation
| possibilities:

Personally, and in the long run, I prefer the "large concatenated 
citation index" idea, but this proposal envisions separate
indices to avoid problems in authenticating changes to the index.

| I. Quick and dirty hack:
| 	1. URCs are encoded as great big URLs, using cgi-bin/forms
| 	style syntax:
| 	<ulink uri="davenport:/citation-index?editor=Fred Jones;title=...">...
| 
| 	2. existing browsers are used. They are locally configured,
| 	e.g. setenv davenport_proxy="http://localhost:8010/"
| 
| 	3. a proxy runs on the local (or nearby) host. It maps /citation-index
| 	to a cgi-bin program which does a WAIS query against a list
| 	of sources. The list of sources is a local configuration issue;
| 	ther might be some "well known" or "bootstrap" WAIS sources,
| 	say wais.davenport.org or wais.oclc.org or whatever; as time
| 	goes by, a few dominant servers emerge. Each site could suppliment
| 	the datbase with a locally maintained WAIS database.
|
| 	4. Each of these WAIS servers contains a large collection
| 	of URCs. The WAIS servers are configured to know, for
| 	example, that exact matches on titles are a lot more important
| 	than exact matches on years. A site like OCLC might even license
| 	the commercial WAIS implementation.
| 
| 	5. The WAIS query comes back with the WAIS doc-ids of the best
| 	items from all the servers, including scores. The proxy selects
| 	the "best" item (highest scoring? closest? some combination of
| 	them?) and fetches the source of that item, and passes it
| 	back to the browser.

Now you're getting warm.  

I was under the impression that WAIS does full-text rather than
structured searching.  A difference between the free and 
commerical products?  I also want to *describe* the operation
so I could use something other than a particular commercial
product.
 
| II. Harvest-based version
| 	
| 	1. user selects link; link points via IDREF (or some other
| 	hytime locladder) to a URC (e.g. the <book> element above)

Nope, can't do that, as explained above; I don't know where
to point, have only an abstract name.
 
| 	2. Browser translates URC to Harvest SOIF format, and conducts
| 	a Harvest broker query against a list of brokers. (This task
| 	could be conducted by a proxy as above).
| 
| 	3. Each publisher maintains a harvest gatherer for the information
| 	it publishes. Large sites like OCLC run caching brokers that
| 	do periodic bulk-updates from the publishers, and allow high-speed
| 	searching of the aggregate database. Client sites will typically
| 	search the OCLC broker, plus a handful of individual publishers
| 	that OCLC doesn't carry, plus their own local broker.
| 
| 	4. Client (browser/proxy) retrieves "best" item from query
| 	results and displays it.

That's more plausible to me for the long run, as it has a nice
aggregated database.

| The choice of query protocols (WAIS, Harvest, or an HTTP/cgi/forms
| application like lycos/archiplex/aliweb) isn't critical: the critical

Right.  But I want to be able to establish a standard form for
the query (the content of ulink's uri att) that will work with
whatever protocol the browser actually invokes.

| thing is the ability to do distributed queries; i.e. the client (the
| browser/proxy) consults _multiple_ servers, which share an
| understanding of the precision/recall issues for the overall database;
| i.e. they share a scoring system. At a minimum, this implies that
| they share an abstract understanding of the schema of the distributed
| database; in practice, it will be convenient if this schema has
| a concrete represetnation, i.e. a URC syntax.
| 
| I still have serious doubts about the ability to automate choosing the
| "best" item from the query results. Perhaps these databases will be
| very carefully maintiained and controlled, and precision of the
| queries will be sufficiently high that the client can just pick the
| top scoring item.

Absolutely right.  For this trial I want to constrain the allowed
queries (so as to avoid choosing a query language), and as a side
effect you would always get an exact hit or total miss.
 
[Dirk:]
| I believe what is missing here in Terry's example is a URN or Uniform
| Resource Name.  The URC is meant to be the "glue" that holds together URNs
| and URLs.  Instead of using URLs as the ulink, one would use a URN, which
| is a location (and maybe format, language, ...) independent name.  Then, a
| URN resolution server is queried, much like a DNS server, and returns a
| URC.  The URC holds 0 or more URLs, as well as select other meta-data.
| 
| So, a URC comes together with a bibliographic database because they both
| hold meta-data.
| 
| I'll rephrase Terry's example now as:
| 
| <ulink uri="urn:microsoft:doc/windows3.1/userguide">
|  Windows 3.1 User's Guide</ulink>

That was Dan's example; the slash still suggests a pathname,
but the idea is right.
 
 [...]

[Dan again:] 
| WHAT is the TESTABLE distinction between this thing you call a URN,
| and the currently deployed architecture for URIs? In what way is an
| http URI not location independent? Where is microsoft.com? That's not
| a location; it's a domain. It can be _anywhere_. How is it not format
| independednt?  How is it not language independent?

I think Dirk and Michael handled this.  

[ ...]

| >hold meta-data.
| 
| More mumbo jumbo. Someone please give a precise defintion of
| meta-data.  I understand terms like "relational database," "inverted
| index," "SQL," "precision," "recall." The term meta-data doesn't mean
| anything in particular to me.

I also give you credit for understanding how a library card 
catalogue, or slightly better, the National Union Catalogue, works.
 
[...]
 
| I can't believe you're going to make up a namespace to replace DNS. If
| I were a publisher, and I wanted to do anything on the internet, the
| first thing I'd do is register a domain. It's cheap, easy, and
| necessary for many of the things I'd want to do. Parties that do
| business on the internet in any way are going to register a domain. I
| think that's a given. Why not take advantage of that namespace? At the
| outside, you might want to do something like MX records to add some
| indirection to DNS. But I haven't seen anything in Terry's scenario
| that warrants that.

And Dirk and Michael handled that, too.  DNS has *nothing to do 
with it*. My domain has *nothing to do* with where my resolution service
might be or where my publications might be.
 
| Harvesting URCs... there's a notion that we might reasonably discuss.

Some of the rest of us have been reasonably discussing (and for some time)
how URCs might be set up, and that's the main focus of my proposal.
Whether you push or pull the info is another matter completely.

| If there's some part of the architecture (such as Harvest's technique
| of shipping index information in bulk between gatherers 
| /caches/brokers)
| that requires sending URCs around, then that's motivation to define
| a data format. Now we're into the resource discovery problem.
| 
| But terry's proposal only involved mapping identifiers to SGML entities.
| For that, nothing beyond HTTP is necessary.

You have certainly not shown that.  I can't map directly between
a URN, much less some other piece of information (title), and
an SGML entity if I don't know where the entity is or where 
the up-to-date mapping info might be.  Maybe I can use HTTP
to communicate among all the pieces, but I still need to
specify how the pieces interact.  As Michael pointed out,
yes, this is a resource discovery problem (although "discovery"
is a bit misleading; I should be referring to something I
know exists in this case).
  
[Michael:]
 
| Again, I'm not putting words in Terry's mouth. Just stating what I 
| thought he meant.

What you thought I meant is what I thought I meant, but maybe
I didn't hit it from the right angle.  But Dan, have you been
biting your tongue at all the URI discussion of the last 6 months
instead of commenting?

Finally, Roy tsks me for misusing URC, so read <urm ... for
<urc ... in that DTD.

| BTW, can we clarify what we mean by URC?  It made sense to talk about
| citations, and it also makes sense to talk about individual 
| characteristics,
| but the use of URC to represent a set of characteristics is just too
| confusing.  Any chance that we can go back to calling the "set" a URM
| and an individual characteristic a URC?

Also, when I wrote:  "We seem to have dropped uri@bunyip.com 
and hackers@ora.com" I was wrong, misled by the format
of the messages on davenport.


-- 
Terry Allen  (terry@ora.com)   O'Reilly & Associates, Inc.
Editor, Digital Media Group    101 Morris St.
			       Sebastopol, Calif., 95472
A Davenport Group sponsor.  For information on the Davenport 
  Group see ftp://ftp.ora.com/pub/davenport/README.html
	or  http://www.ora.com/davenport/README.html